training

Why

If the model is so big and there is not enough memory to store the activation.

How

  • Compute the activations then delete them for some layers to save some memory
  • then on the backward pass, recomputes those activations when needed using the activations that got saved.

Hence

  • Save some memory
  • But need to do the forward pass twice for some layers, which costs some compute