Nowadays large language models (LLMs) have revolutionized various domains. However, deploying these models in real-world applications can be challenging due to their high computational demands. This is where vLLM steps in. vLLM stands for Virtual Large Language Model and is an active open-source library that supports LLMs in inferencing and model serving efficiently.
vLLM was first introduced in a paper – Efficient Memory Management for Large Language Model Serving with PagedAttention, authored by Kwon et al. The paper identifies that the challenges faced when serving LLMs are memory allocation and measures their impact on performance. Specifically, it emphasizes the inefficiency of managing Key-Value (KV) cache memory in current LLM serving systems. These limitations can often result in slow inference speed and high memory footprint.
To address this, the paper presents PagedAttention, an attention algorithm inspired by virtual memory and paging techniques commonly used in operating systems. PagedAttention enables efficient memory management by allowing for non-contiguous storage of attention keys and values. Following this idea, the paper develops vLLM, a high-throughput distributed LLM serving engine that is built on PagedAttention. vLLM achieves near-zero waste in KV cache memory, significantly improving serving performance. Moreover, leveraging techniques like virtual memory and copy-on-write, vLLM efficiently manages the KV cache and handles various decoding algorithms. This results in 2-4 times throughput improvements compared to state-of-the-art systems such as FasterTransformer and Orca. This improvement is especially noticeable with longer sequences, larger models, and complex decoding algorithms.
What is the core idea in vLLM?
PagedAttention
The attention mechanism allows LLMs to focus on relevant parts of the input sequence while generating output/response. Inside the attention mechanism, the attention scores for all input tokens need to be calculated. Existing systems store KV pairs in contiguous memory spaces, limiting memory sharing and leading to inefficient memory management.
PagedAttention is an attention algorithm inspired by the concept of paging in operating systems. It allows storing continuous KV pairs in non-contiguous memory space by partitioning the KV cache of each sequence into KV block tables. This way, it enables the flexible management of KV vectors across layers and attention heads within a layer in separate block tables, thus optimizing memory usage, reducing fragmentation, and minimizing redundant duplication.
What are the other techniques used in vLLM for efficient serving?
vLLM doesn’t stop at PagedAttention. It incorporates a suite of techniques to further optimize LLM serving.
- Continuous Batching: Incoming requests are continuously batched together to maximize hardware utilization and reduce computing waste, minimizing idle time.
- Quantization: vLLM utilizes quantization techniques like FP16 to optimize memory usage by representing the KV cache in reduced precision, leading to smaller memory footprints and faster computations.
- Optimized CUDA Kernels: vLLM hand-tunes the code executed on the GPU for maximum performance. For example, for fused reshape and block write, optimized kernels are developed to split the new KV cache into blocks, reshape them for efficient memory access, and save them based on a block table, all fused into a single kernel to reduce overheads.
How to use vLLM?
vLLM is easy-to-use. Here is a glimpse into how it can be used in Python:
One can install vLLM via pip:
# (Recommended) Create a new conda environment.
conda create -n myenv python=3.9 -y
conda activate myenv
# Install vLLM with CUDA 12.1.
pip install vllm
Offline inference
Then import the vLLM module into your code and do an offline inference with vLLM’s engine. The LLM class is to initialize the vLLM engine with a specific built-in LLM model. The LLM models are by default downloaded from HuggingFace. The SamplingParams class is to set the parameters for inferencing.
from vllm import LLM, SamplingParams
Then we define an input sequence and set the sampling parameters. Initialize vLLM’s engine for offline inference with the LLM class and an LLM model:
prompts = ["The future of humanity is"]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
llm = LLM(model="meta-llama/Meta-Llama-3-8B-Instruct")
Finally, the output/response can be generated by:
Responses = llm.generate(prompts, sampling_params)
print(f"Prompt: { Responses[0].prompt!r}, Generated text: { Responses[0].outputs[0].text!r}")
The code example can be found here.
Online serving
To use vLLM for online serving, OpenAI’s completions and APIs can be used in vLLM. The server can be started with Python:
python -m vllm.entrypoints.openai.api_server --model
NousResearch/Meta-Llama-3-8B-In