training

Loss spikes

  • in billion-sized models, you are going to run into loss spikes if you try to push your learning rate high enough to get SOTA results (source)
  • for reasons that people don’t usually understand

Resolution

  • there are some algorithmic ones
  • one of the most popular ones in the literature (in practical settings) is just: (source)
    • roll back to a checkpoint
    • change the random seed
    • change the learning rate, lower it
    • retry

Hardware failures

  • GPUs die or run into different issues pretty frequently (source)
  • A persistent problem at MosaicML (source)

Consequences and Resolution

  • once a GPU dies, the training job dies
    • -> need a fault-tolerant way to train
    • just have to resume from a checkpoint
  • usually dies in group of 8, because on major cloud providers they’re solder to the mother board to get the fast multi-GPU interconnection.
    • can’t just swap out 1, must swap the whole board or a whole node
  • sometimes you cannot resume training if you have less GPUs:
    • maybe your checkpoint is divided into the number of GPUs
    • you might lose determinism

Mitigation

  • Automatic detection of failures
    • look for Nvidia errors
    • look for failure conditions, e.g.: the job suddenly gets really slow
    • identify the problems proactively
      • a lot of the time it won’t just crash, it would just get really slow or get stuck somewhere
      • oftentimes the error messages can be cryptic if you just look at the command line
    • keep spare GPUs available:
      • use them for lower-priority stuff
      • swap them in when they’re needed
    • Sharded checkpointing
    • Data loaders with random access
      • after loading a checkpoint, you may have to go all the way to through the data loader to get back to where it were take a long time for large dataset
      • Data loaders from MosaicML (MosaicML Streaming Data Loader) allow random access