training
what
- how many flops of those available flops are used properly
- i.e. the utilization from the floating point operations required for a single forward/backwards pass of the model
- only cares about the actual FLOPs that effectively train the model (the number that is counted in the first operation)
- ====does not account for the additional compute required for other implementation details (such as activation checkpointing)
- not every operation is actually productive in moving the model forward
- e.g. Activation Checkpointing, some of the FLOPs are not being used productively, they are just being used for re-computation.
calculation
- run the model and count the number of FLOPs in the first operation
- run the model for a while and calculate the tokens per second (every token is doing the number of FLOPs calculated above)
note
nvidia-smi
does not tell the MFU
references