distributed

What

  • have each layer of the network on a different GPU
  • constantly sending data forward and backward through the GPUs

Hence

  • a very complicated strategy
  • very scalable
  • bubbles (low utilization):
    • all other GPUs is idle when the 1st batch is in the 1st GPU
    • the GPUs become idle as the last batch is passed through
    • need to do a lot of data before this makes sense because these bubbles cause cost a lot of utilization