distributed

What

  • multiple GPUs
  • a separate copy of the model on each GPU
  • feed different data on each GPU
  • average the gradients when done

Hence

  • speed up training