loader
Tools used for loading LLMs to be used by CPU/GPU.
| OS | Hardware | Format / quantization | |||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| MacOS | Windows | Windows WSL | Linux | Android | iOS / iPadOS | Web | Apple Silicon | Apple Intel | Nvdia (CUDA) | AMD (ROCm) | Intel/AMD CPU | Intel Arc | Intel iGPU | offload | GGML | GUFF | GPTQ | AWQ | EXL2 | [[QuIP#]] | MLC | hugging face | safetensor | SqueezeLLM | |
| huggingface/transformers | + | + | + | + | + | + | + | + | + | + (PyTorch) | mem | + | + | ||||||||||||
| llama-cpp (llama-cpp-python) | + | + | + | + | ++ (1st class) | + | + | + | + (SYCL) | layer | - (dropped) | + | |||||||||||||
| GPTQ-for-LLaMa (paused, use AutoGPTQ instead) | - | - | + | + | - | - | - | - | - | + | + | - | - | layer | + (LLaMa only) | ||||||||||
| AutoGPTQ | + | + | + | - | - | - | + | + | mem | y | ~ (indirect) | ~ (indirect) | |||||||||||||
| ExLlamaV2 | + | + | + | - | - | - | + | + | + | + | |||||||||||||||
| AutoAWQ | - | + | + | + | - | - | - | - | - | + | + | + | - | - | layer (accelerate) | + | |||||||||
| CTransformers | + (Metal, LLaMa 1/2 only) | + | + | + | ~ (Metal, LLaMa 1/2 only) | ~ (Metal, LLaMa 1/2 only) | ~ (limited) | ~ (limited ?) | + | layer | + | + | + | ||||||||||||
| [[QuIP#]] | + | + | + | + | + | + | ~ (indrect) | ? | |||||||||||||||||
| MLC LLM | + (Metal) | + | + | + | + (OpenCL on Adreno, Mali) | + (Metal on A-series) | + (WebGPU, WASM) | + (Metal) | + (Metal) | + | + | + (Vulkan) | + (Vulkan, Metal) | + | ~ (indirect) | ? | |||||||||
| GPT4All | + | + | + | + | + | + | + | + | + (AVX/AVX2 instructions) | + (Vulkan) | + (Vulkan) | ~ (limited architectures) | |||||||||||||
| vLLM | + | - (uses GPU) | + | + | + | + | + | + | + | + | |||||||||||||||
| Aphrodite |