• sparse fine-tuning (SFT) for LLMs
    • adding a sparse vector to the model weight
    • LoRA (top) vs Sparse Fine-Tuning (bottom)
  • address the memory footprint problem
    • SFTs usually consider all LLM parameters
    • by only keep a fixed-size vector of SFT parameters of model weight indices, then iteratively perform 3 operations:
      • update: update the PEFT deltas corresponds to the active indices
      • drop: prune indices based on the change of magnitude of their deltas
      • grow:
        • choosing new indices to be active, base on 1 of 2 criteria
          • accumulated gradients of a few candidate parameters
          • approximate momenta estimated using SM3 optimizer
        • gradient estimation are estimated for some training steps not just 1
          • this is called the “gradient estimation phase”
        • only maintain estimation for “growth candidates”
          • selected as top-k gradient magnitude during the first batch of the gradient estimation phase
          • this might be biased towards the first training step (and the training examples used in that step)idea
            • instead of that, dynamically update the candidates by getting top-k at each step
            • get the top-k out of the existing candidates and top-k from the current step (up to total)
              • based on their running average of gradient magnitude
              • with higher priority for the more “senior” candidates