Gradient accumulation
- Only update params every \(n\) steps.
- Essentially replicates (is (almost) statistically the same) the notion of batching in stochastic gradient descent.
- Thus, allows one to train a model on a smaller machine but simulate a bigger batch size to avoid out-of-memory errors.
gradient accumulation in PyTorch
See here for more and how to do this in PyTorch (or see CS 330 HW3).