training

source: https://www.wandb.courses/courses/take/training-fine-tuning-LLMs/lessons/44579584-hardware-requirements

N: number of parameters D: number of data

  • Approximation for the number of FLOPs needed to train a LLM FLOPs = 6 * N * D

  • Optimal amount of data needed according to the Chinchilla scaling laws D = 20 * N

  • Speed of GPUs (approximately)

    • A100: 312 TFLOP/s = 312e12 FLOP/s
    • H100: 989 TFLOP/s

training a 7B model using 64 A100s with Chinchilla optimal amount of data takes around 3.4 days