Why

  • fine-tuning is often modifying the weights of the whole model
  • but not all of us have fancy GPUs that can fit all the weights
  • sometimes you want different copies of the models fine-tuned for different purposes

What

  • the left is the pre-trained weights
  • learn the things on the right
    • the delta that needed to modify the pre-trained weight
  • in LoRA, the delta doesn’t need to be as big as the pre-trained weights
    • factorize them into 2 low-rank matrices that when multiply will approximate the big one

Hence

  • tends to work really well in practice, thus it is popular
  • dramatically reduces memory usage
    • don’t need to store much during forward and backward passes, just the LoRA weights and the main model
    • don’t need to do backward pass on the main model
      • no optimizer states
      • no gradients
      • no need to store the activations
  • doesn’t help speed very much
    • still have to do forward pass through the whole network
    • plus forward and backward for the LoRA weights
  • good for low memory GPU