loader

Tools used for loading LLMs to be used by CPU/GPU.

	OS							Hardware								Format / quantization
	MacOS	Windows	Windows WSL	Linux	Android	iOS / iPadOS	Web	Apple Silicon	Apple Intel	Nvdia (CUDA)	AMD (ROCm)	Intel/AMD CPU	Intel Arc	Intel iGPU	offload	GGML	GUFF	GPTQ	AWQ	EXL2	[[QuIP#]]	MLC	hugging face	safetensor	SqueezeLLM
huggingface/transformers	+	+	+	+				+	+	+	+	+	+ (PyTorch)		mem								+	+
llama-cpp (llama-cpp-python)	+	+	+	+				++ (1st class)	+	+		+	+ (SYCL)		layer	- (dropped)	+
GPTQ-for-LLaMa (paused, use AutoGPTQ instead)	-	-	+	+	-	-	-	-	-	+		+	-	-	layer			+ (LLaMa only)
AutoGPTQ		+	+	+	-	-	-			+	+				mem			y					~ (indirect)	~ (indirect)
ExLlamaV2		+	+	+	-	-	-			+	+							+		+
AutoAWQ	-	+	+	+	-	-	-	-	-	+	+	+	-	-	layer (accelerate)				+
CTransformers	+ (Metal, LLaMa 1/2 only)	+	+	+				~ (Metal, LLaMa 1/2 only)	~ (Metal, LLaMa 1/2 only)	~ (limited)	~ (limited ?)	+			layer	+	+	+
[[QuIP#]]		+	+	+						+	+										+		~ (indrect)	?
MLC LLM	+ (Metal)	+	+	+	+ (OpenCL on Adreno, Mali)	+ (Metal on A-series)	+ (WebGPU, WASM)	+ (Metal)	+ (Metal)	+	+		+ (Vulkan)	+ (Vulkan, Metal)								+	~ (indirect)	?
GPT4All	+	+	+	+				+	+	+	+	+ (AVX/AVX2 instructions)	+ (Vulkan)	+ (Vulkan)		~ (limited architectures)
vLLM		+	- (uses GPU)	+						+	+							+	+				+	+	+
Aphrodite

Lone's notes