what

how many flops of those available flops are used properly
- i.e. the utilization from the floating point operations required for a single forward/backwards pass of the model
- only cares about the actual FLOPs that effectively train the model (the number that is counted in the first operation)
- ====does not account for the additional compute required for other implementation details (such as activation checkpointing)
  - not every operation is actually productive in moving the model forward
    - e.g. Activation Checkpointing, some of the FLOPs are not being used productively, they are just being used for re-computation.

calculation

run the model and count the number of FLOPs in the first operation
run the model for a while and calculate the tokens per second (every token is doing the number of FLOPs calculated above)

note

nvidia-smi does not tell the MFU

references

MosaicML benchmarking

Per token, each parameter is used for a MAC (2 FLOPS) per network operation. Neural Network training has 3 network operations: forward pass, backward pass, and computation of parameter gradient.
    
The attention mechanism the forward pass FLOPS are: 
`attn_flops_per_seq = n_layers * 2 * 2 * (d_model * (seq_len**2))`
 
    ```
    flops_per_token = 2 * n_params
    flops_per_seq = flops_per_token * seq_len
    mfu* = 3 * flops_per_seq * seq_per_sec / (gpu_num * GPU_AVAILABLE_FLOPS)
    
    attn_flops_per_seq = n_layers * 2 * 2 * (d_model * (seq_len**2))
    mfu = (3 * flops_per_seq + 3 * attn_flops_per_seq) * seq_per_sec / (gpu_num * GPU_AVAILABLE_FLOPS)
    ```

Lone's notes

Recently Updated

Diffusion language modeling with maximum semantic likelihood

distill from AR LM to diffusion LM

LLM generation is path finding in activation space, each decoder block's processing is taking a step in said space

AI resources

Controlling reasoning duration with activation steering

All notes

MFU - Model FLOPs Utilization

what

calculation

note

references

Graph View

Table of Contents

Backlinks