Scaling Sparse Fine-Tuning to Large Language Models

sparse fine-tuning (SFT) for LLMs
- adding a sparse vector $ϕ$ to the model weight $θ$
- LoRA (top) vs Sparse Fine-Tuning (bottom)
address the memory footprint problem
- SFTs usually consider all LLM parameters
- by only keep a fixed-size vector of SFT parameters of model weight indices, then iteratively perform 3 operations:
  - update: update the PEFT deltas corresponds to the active indices
  - drop: prune indices based on the change of magnitude of their deltas
  - grow:
    - choosing new indices to be active, base on 1 of 2 criteria
      - accumulated gradients of a few candidate parameters
      - approximate momenta estimated using SM3 optimizer
    - gradient estimation are estimated for some training steps not just 1
      - this is called the “gradient estimation phase”
    - only maintain estimation for $k$ “growth candidates”
      - selected as top-k gradient magnitude during the first batch of the gradient estimation phase
      - this might be biased towards the first training step (and the training examples used in that step) idea
        
        instead of that, dynamically update the candidates by getting top-k at each step
        
        get the top-k out of the existing candidates and top-k from the current step (up to $2 k$ total)
        
        based on their running average of gradient magnitude
        
        with higher priority for the more “senior” candidates

Lone's notes