lecture link: https://www.wandb.courses/courses/take/training-fine-tuning-LLMs/lessons/49181376-logistics-of-data-loading
How
- Stream data from multiple source to the GPUs cluster during training.
Why
- Data usually very big and cannot be stored at one place.
- Training usually happens in different clouds from where data is stored.
- Might change clouds for GPUs cluster over time due to availability, cost, use cases, etc.
- Data only need to be stored on cheap, no-cli storage like S3 instead of keep it in SSD.
- The network cost for moving data is very small (1-2% or less, source: from the course) compare to the cost of one training epoch. ⇒ It’s a good tradeoff between cost and convenience.