loader
Tools used for loading LLMs to be used by CPU/GPU.
OS | Hardware | Format / quantization | |||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
MacOS | Windows | Windows WSL | Linux | Android | iOS / iPadOS | Web | Apple Silicon | Apple Intel | Nvdia (CUDA) | AMD (ROCm) | Intel/AMD CPU | Intel Arc | Intel iGPU | offload | GGML | GUFF | GPTQ | AWQ | EXL2 | [[QuIP#]] | MLC | hugging face | safetensor | SqueezeLLM | |
huggingface/transformers | + | + | + | + | + | + | + | + | + | + (PyTorch) | mem | + | + | ||||||||||||
llama-cpp (llama-cpp-python) | + | + | + | + | ++ (1st class) | + | + | + | + (SYCL) | layer | - (dropped) | + | |||||||||||||
GPTQ-for-LLaMa (paused, use AutoGPTQ instead) | - | - | + | + | - | - | - | - | - | + | + | - | - | layer | + (LLaMa only) | ||||||||||
AutoGPTQ | + | + | + | - | - | - | + | + | mem | y | ~ (indirect) | ~ (indirect) | |||||||||||||
ExLlamaV2 | + | + | + | - | - | - | + | + | + | + | |||||||||||||||
AutoAWQ | - | + | + | + | - | - | - | - | - | + | + | + | - | - | layer (accelerate) | + | |||||||||
CTransformers | + (Metal, LLaMa 1/2 only) | + | + | + | ~ (Metal, LLaMa 1/2 only) | ~ (Metal, LLaMa 1/2 only) | ~ (limited) | ~ (limited ?) | + | layer | + | + | + | ||||||||||||
[[QuIP#]] | + | + | + | + | + | + | ~ (indrect) | ? | |||||||||||||||||
MLC LLM | + (Metal) | + | + | + | + (OpenCL on Adreno, Mali) | + (Metal on A-series) | + (WebGPU, WASM) | + (Metal) | + (Metal) | + | + | + (Vulkan) | + (Vulkan, Metal) | + | ~ (indirect) | ? | |||||||||
GPT4All | + | + | + | + | + | + | + | + | + (AVX/AVX2 instructions) | + (Vulkan) | + (Vulkan) | ~ (limited architectures) | |||||||||||||
vLLM | + | - (uses GPU) | + | + | + | + | + | + | + | + | |||||||||||||||
Aphrodite |