llmquantization

what

A quantization format

  • based on the same method as GPTQ
  • supports 2,3,4,5,6 and 8-bit quantization.

features

  • supports 2,3,4,5,6 and 8-bit quantization
  • allows for mixing quantization levels within a model
    • to achieve any average bitrate between 2 and 8 bits per weight
    • possible to apply multiple quantization level to each linear layer
      • more important weights (columns) are quantized with more bits (akin to sparse quantization)
  • automatic parameter selection
    • by quantizing each matrix multiple times
    • measuring the quantization error for each possible setting, per layer
    • return a combination that
      • minimize the maximum quantization error over the entire model
      • meet a target average bitrate