GitHub - mosaicml/streaming: A Data Streaming Library for Efficient Neural Network Training
- Allow data to be stored in many places during training.
- Ensure determinism even when training crashes or the number of GPUs changes.
- Determinism is really tricky. Key ideas:
- The idea of virtual GPUs.
- As part of spinning up streaming data loader, you actually set a number of virtual GPUs.
- The number of virtual GPUs have to be divisible by the number of physical GPUs or vice versa.
- Once the number of virtual GPUs are fixed, determinism is preserved.
- ⇒ Virtual GPUs act as a layer of abstraction that determine the sequence that you look at data.
- allow for elastic sharded checkpointing
- allow
n
sharded of checkpoints to be able to resume usingm
number of GPUs (m != n
) - e.g. have a checkpoints for 512 GPUs and resumes just fine on 504 GPUs (or whatever the number)
- allow