N: number of parameters D: number of data
-
Approximation for the number of FLOPs needed to train a LLM
FLOPs = 6 * N * D
- ignoring self-attention because it is negligible (Choosing architecture for training LLM)
-
Optimal amount of data needed according to the Chinchilla scaling laws
D = 20 * N
-
Speed of GPUs (approximately)
- A100: 312 TFLOP/s = 312e12 FLOP/s
- H100: 989 TFLOP/s
⇒ training a 7B model using 64 A100s with Chinchilla optimal amount of data takes around 3.4 days
-
In practice, GPUs are not always fully utilized
Actual FLOPs = FLOPs * MFU
(MFU - Model FLOPs Utilization) -
Another computation of utilization is HFU - Hardware FLOPs Utilization
-
- Only apply for MosaicML codebase but should give a sense.
-
A typical really good utilization is around 50%.