In this blog post, the MosaicML engineering team shares best practices for how to capitalize on popular open source large language models (LLMs) for production usage. We also provide guidelines for deploying inference services built around these models to help users in their selection of models and deployment hardware. We have worked with multiple PyTorch-based backends in production; these guidelines are drawn from our experience with FasterTransformers, vLLM, NVIDIA’s soon-to-be-released TensorRT-LLM, and others.
Understanding LLM Text Generation
Large Language Models (LLMs) generate text in a two-step process: “prefill”, where the tokens in the input prompt are processed in parallel, and “decoding”, where text is generated one ‘token’ at a time in an autoregressive manner. Each generated token is appended to the input and fed back into the model to generate the next token. Generation stops when the LLM outputs a special stop token or when a user-defined condition is met (e.g., some maximum number of tokens has been generated). If you’d like more background on how LLMs use decoder blocks, check out this blog post.
Tokens can be words or sub-words; the exact rules for splitting text into tokens vary from model to model. For instance, you can compare how Llama models tokenize text to how OpenAI models tokenize text. Although LLM inference providers often talk about performance in token-based metrics (e.g., tokens/second), these numbers are not always comparable across model types given these variations. For a concrete example, the team at Anyscale found that Llama 2 tokenization is 19% longer than ChatGPT tokenization (but still has a much lower overall cost). And researchers at HuggingFace also found that Llama 2 required ~20% more tokens to train over the same amount of text as GPT-4.
Important Metrics for LLM Serving
So, how exactly should we think about inference speed?
Our team uses four key metrics for LLM serving:
- Time To First Token (TTFT): How quickly users start seeing the model’s output after entering their query. Low waiting times for a response are essential in real-time interactions, but less important in offline workloads. This metric is driven by the time required to process the prompt and then generate the first output token.
- Time Per Output Token (TPOT): Time to generate an output token for each user that is querying our system. This metric corresponds with how each user will perceive the “speed” of the model. For example, a TPOT of 100 milliseconds/tok would be 10 tokens per second per user, or ~450 words per minute, which is faster than a typical person can read.
- Latency: The overall time it takes for the model to generate the full response for a user. Overall response latency can be calculated using the previous two metrics: latency = (TTFT) + (TPOT) * (the number of tokens to be generated).
- Throughput: The number of output tokens per second an inference server can generate across all users and requests.
Our goal? The fastest time to first token, the highest throughput, and the quickest time per output token. In other words, we want our models to generate text as fast as possible for as many users as we can support.
Notably, there is a tradeoff between throughput and time per output token: if we process 16 user queries concurrently, we’ll have higher throughput compared to running the queries sequentially, but we’ll take longer to generate output tokens for each user.
If you have overall inference latency targets, here are some useful heuristics for evaluating models:
- Output length dominates overall response latency: For average latency, you can usually just take your expected/max output token length and multiply it by an overall average time per output token for the model.
- Input length is not significant for performance but important for hardware requirements: The addition of 512 input tokens increases latency less than the production of 8 additional output tokens in the MPT models. However, the need to support long inputs can make models harder to serve. For example, we recommend using the A100-80GB (or newer) to serve MPT-7B with its maximum context length of 2048 tokens.
- Overall latency scales sub-linearly with model size: On the same hardware, larger models are slower, but the speed ratio won’t necessarily match the parameter count ratio. MPT-30B latency is ~2.5x that of MPT-7B latency. Llama2-70B latency is ~2x that of Llama2-13B latency.
We are often asked by prospective customers to provide an average inference latency. We recommend that before you anchor yourself to specific latency targets (“we need less than 20 ms per token”), you should spend some time characterizing your expected input and desired output lengths.
Challenges in LLM Inference
Optimizing LLM inference benefits from general techniques such as:
- Operator Fusion: Combining different adjacent operators together often results in better latency.
- Quantization: Activations and weights are compressed to use a smaller number of bits.
- Compression: Sparsity or Distillation.
- Parallelization: Tensor parallelism across multiple devices or pipeline parallelism for larger models.
Beyond these methods, there are many important Transformer-specific optimizations. A prime example of this is KV (key-value) caching. The Attention mechanism in decoder-only Transformer-based models is computationally inefficient. Each token attends to all previously seen tokens, and thus recomputes many of the same values as each new token is generated. For example, while generating the Nth token, the (N-1)th token attends to (N-2)th, (N-3)th … 1st tokens. Similarly, while generating (N+1)th token, attention for the Nth token again needs to look at the (N-1)th, (N-2)th, (N-3)th, … 1st tokens. KV caching, i.e., saving of intermediate keys/values for the attention layers, is used to preserve those results for later reuse, avoiding repeated computation.
Memory Bandwidth is Key
Computations in LLMs are mainly dominated by matrix-matrix multiplication operations; these operations with small dimensions are typically memory-bandwidth-bound on most hardware. When generating tokens in an autoregressive manner, one of the activation matrix dimensions (defined by batch size and number of tokens in the sequence) is small at small batch sizes. Therefore, the speed is dependent on how quickly we can load model parameters from GPU memory to local caches/registers, rather than how quickly we can compute on loaded data. Available and achieved memory bandwidth in inference hardware is a better predictor of speed of token generation than their peak compute performance.
Inference hardware utilization is very important in terms of serving costs. GPUs are expensive and we need them to do as much work as possible. Shared inference services promise to keep costs low by combining workloads from many users, filling in individual gaps and batching together overlapping requests. For large models like Llama2-70B, we only achieve good cost/performance at large batch sizes. Having an inference serving system that can operate at large batch sizes is critical for cost efficiency. However, a large batch means larger KV cache size, and that in turn increases the number of GPUs required to serve the model. There’s a tug-of-war here and shared service operators need to make some cost trade-offs and implement systems optimizations.
Model Bandwidth Utilization (MBU)
How optimized is an LLM inference server?
As briefly explained earlier, inference for LLMs at smaller batch sizes—especially at decode time—is bottlenecked on how quickly we can load model parameters from the device memory to the compute units. Memory bandwidth dictates how quickly the data movement happens. To measure the underlying hardware’s utili