backend inference

about

  • vLLM is a fast and easy-to-use library for LLM inference and serving.

resources

features

vLLM is fast with:

  • State-of-the-art serving throughput
  • Efficient management of attention key and value memory with Paged Attention
  • Continuous batching of incoming requests
  • Fast model execution with CUDA/HIP graph
  • Quantization: GPTQAWQSqueezeLLM, FP8 KV Cache
  • Optimized CUDA kernels

vLLM is flexible and easy to use with

  • Seamless integration with popular Hugging Face models
  • High-throughput serving with various decoding algorithms, including parallel samplingbeam search, and more
  • Tensor parallelism support for distributed inference
  • Streaming outputs
  • OpenAI-compatible API server
  • Support NVIDIA GPUs and AMD GPUs
  • (Experimental) Prefix caching support
  • (Experimental) Multi-lora support

vLLM seamlessly supports many Hugging Face models