backendquantization
what
- the official backend engine of PygmalionAI
- a LLM inference engine that integrates features from various projects
feature
- Continuous Batching
- Efficient K/V management with PagedAttention from vLLM
- Optimized CUDA kernels for improved inference
- Quantization support via AQLM, AWQ, Bitsandbytes, EXL2, GGUF, GPTQ, [[QuIP#]], Smoothquant+, and SqueezeLLM
- Distributed inference
- Variety of sampling methods (Mirostat, Locally Typical Sampling, Tail-Free Sampling, etc)
- 8-bit KV Cache for higher context lengths and throughput, at both FP8 and INT8 formats.
note