Why
If the model is so big and there is not enough memory to store the activation.
How
- Compute the activations then delete them for some layers to save some memory
- then on the backward pass, recomputes those activations when needed using the activations that got saved.
Hence
- Save some memory
- But need to do the forward pass twice for some layers, which costs some compute