Understanding the theory behind PagedAttention and continuous batching is essential, but implementing these techniques from scratch would require months of engineering effort and deep expertise in GPU programming. Fortunately, several production-ready frameworks have emerged that implement these optimizations and provide simple APIs for serving LLMs at scale.
In this chapter, we'll explore the three leading LLM serving frameworks: vLLM, Text Generation Inference (TGI), and TensorRT-LLM. We'll compare their strengths, walk through practical deployment examples, and provide guidance on choosing the right framework for your use case.
vLLM: The PagedAttention Pioneer
vLLM (Virtual LLM) is an open-source inference and serving engine developed by researchers at UC Berkeley. It's the original implementation of PagedAttention and has quickly become the de facto standard for high-throughput LLM serving.
Key Features
https://api.openai.com/v1 to http://your-server:8000/v1 and you're serving from your own infrastructure.
Installation and Setup
# Install vLLM (requires CUDA 11.8+, Python 3.8-3.11)
pip install vllm
# Or with specific CUDA version
pip install vllm --extra-index-url https://download.pytorch.org/whl/cu118
# Verify installation
python -c "import vllm; print(vllm.__version__)"
Basic Usage: Python API
from vllm import LLM, SamplingParams
# Initialize the model
# This loads the model and prepares the vLLM engine
llm = LLM(
model="meta-llama/Llama-2-7b-chat-hf",
tensor_parallel_size=1, # Number of GPUs
dtype="float16",
max_model_len=2048,
)
# Define sampling parameters
sampling_params = SamplingParams(
temperature=0.7,
top_p=0.95,
max_tokens=512,
)
# Single generation
prompts = ["Explain quantum computing in simple terms."]
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
generated_text = output.outputs[0].text
print(f"Generated: {generated_text}")
# Batch generation (much more efficient!)
prompts = [
"Explain quantum computing in simple terms.",
"What is the capital of France?",
"Write a haiku about AI.",
"How do neural networks work?",
]
# vLLM automatically batches and optimizes these
outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
print(f"\nPrompt: {prompt}")
print(f"Response: {output.outputs[0].text}")
print(f"Tokens generated: {len(output.outputs[0].token_ids)}")
print(f"Time: {output.metrics.finished_time - output.metrics.started_time:.2f}s")
OpenAI-Compatible Server
# Start vLLM server with OpenAI-compatible API
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-2-7b-chat-hf \
--host 0.0.0.0 \
--port 8000 \
--tensor-parallel-size 1 \
--dtype float16
# Server provides endpoints:
# - POST /v1/completions (text completion)
# - POST /v1/chat/completions (chat completion)
# - GET /v1/models (list models)
# - GET /health (health check)
# Client code (drop-in OpenAI replacement)
from openai import OpenAI
# Point to your vLLM server instead of OpenAI
client = OpenAI(
api_key="EMPTY", # vLLM doesn't require API key
base_url="http://localhost:8000/v1",
)
# Use exactly like OpenAI API
response = client.chat.completions.create(
model="meta-llama/Llama-2-7b-chat-hf",
messages=[
{"role": "system", "content": "You are a helpful AI assistant."},
{"role": "user", "content": "Explain PagedAttention."}
],
temperature=0.7,
max_tokens=300,
)
print(response.choices[0].message.content)
# Streaming example
stream = client.chat.completions.create(
model="meta-llama/Llama-2-7b-chat-hf",
messages=[
{"role": "user", "content": "Write a short story about AI."}
],
stream=True,
)
for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True)
Advanced Configuration
from vllm import LLM, SamplingParams
# Production configuration with optimizations
llm = LLM(
model="meta-llama/Llama-2-13b-chat-hf",
# GPU configuration
tensor_parallel_size=2, # Use 2 GPUs with tensor parallelism
gpu_memory_utilization=0.90, # Use 90% of GPU memory for KV cache
# Performance tuning
max_num_batched_tokens=8192, # Max tokens processed in one batch
max_num_seqs=256, # Max concurrent sequences
max_model_len=4096, # Max sequence length
# Quantization for speed and memory
quantization="awq", # Use AWQ quantization
dtype="float16",
# Enable optimizations
enforce_eager=False, # Use CUDA graphs for faster inference
trust_remote_code=True, # Required for some custom models
)
# Advanced sampling parameters
sampling_params = SamplingParams(
# Sampling strategy
temperature=0.8,
top_p=0.95,
top_k=50,
# Generation control
max_tokens=1024,
min_tokens=10,
stop=["", "\n\n\n"],
# Frequency and presence penalties
frequency_penalty=0.1,
presence_penalty=0.1,
# Multiple outputs per prompt
n=3, # Generate 3 different completions
best_of=5, # Sample 5 and return best 3
# Streaming
stream=False,
)
Text Generation Inference (TGI): Hugging Face's Solution
Text Generation Inference (TGI) is Hugging Face's production-grade serving framework. It powers inference for many of Hugging Face's hosted models and provides excellent integration with the Hugging Face ecosystem.
Key Features
- Hugging Face native: Seamless integration with Hub models, works with any Transformers-compatible model
- Continuous batching: Dynamic batching for high throughput
- Token streaming: Server-Sent Events (SSE) for real-time streaming
- Safetensors support: Fast and secure model loading
- Quantization: Built-in support for bitsandbytes, GPTQ, AWQ
- Flash Attention: Optimized attention implementation for speed
Installation (Docker Recommended)
# Pull TGI Docker image
docker pull ghcr.io/huggingface/text-generation-inference:latest
# Run TGI server
docker run -d --gpus all \
--shm-size 1g \
-p 8080:80 \
-v $PWD/data:/data \
ghcr.io/huggingface/text-generation-inference:latest \
--model-id meta-llama/Llama-2-7b-chat-hf \
--num-shard 1 \
--max-concurrent-requests 128 \
--max-total-tokens 4096
# Check server health
curl http://localhost:8080/health
Client Usage
from huggingface_hub import InferenceClient
# Connect to TGI server
client = InferenceClient(model="http://localhost:8080")
# Text generation
prompt = "Explain machine learning in simple terms."
response = client.text_generation(
prompt,
max_new_tokens=500,
temperature=0.7,
top_p=0.95,
repetition_penalty=1.1,
)
print(response)
# Streaming generation
for token in client.text_generation(
prompt,
max_new_tokens=500,
stream=True,
):
print(token, end="", flush=True)
# Chat completion format (for chat models)
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the meaning of life?"}
]
response = client.chat_completion(
messages=messages,
max_tokens=300,
temperature=0.8,
)
print(response.choices[0].message.content)
TGI vs vLLM: When to Choose TGI
| Use Case | TGI Strengths |
|---|---|
| Hugging Face Ecosystem | Native integration, can serve any HF model instantly |
| Production Deployment | Battle-tested, powers HF Inference API |
| Docker/Kubernetes | Official Docker images, easy container deployment |
| Multi-model Serving | Can easily switch between models without code changes |
βοΈ Performance Comparison
In benchmarks, vLLM typically achieves 20-30% higher throughput than TGI due to its more aggressive PagedAttention optimizations. However, TGI offers better ecosystem integration and is easier to deploy in containerized environments. For maximum performance, choose vLLM. For maximum compatibility and ease of deployment, choose TGI.
TensorRT-LLM: NVIDIA's Performance Champion
TensorRT-LLM is NVIDIA's framework for optimizing and serving LLMs on NVIDIA GPUs. It provides the absolute highest performance but requires more setup and is less flexible than vLLM or TGI.
Key Features
- Maximum performance: 2-3x faster than vLLM for certain models and hardware configurations
- Advanced optimizations: Custom CUDA kernels, graph optimizations, kernel fusion
- Multi-GPU support: Tensor parallelism, pipeline parallelism, expert parallelism (for MoE models)
- In-flight batching: NVIDIA's implementation of continuous batching
- Quantization: INT8, INT4, FP8 support with minimal accuracy loss
When to Use TensorRT-LLM
Installation (Requires NVIDIA GPUs)
# TensorRT-LLM requires Docker for easiest setup
docker pull nvcr.io/nvidia/tensorrt-llm:latest
# Or install from source (complex)
git clone https://github.com/NVIDIA/TensorRT-LLM.git
cd TensorRT-LLM
pip install -r requirements.txt
python setup.py install
Model Compilation (Required)
Unlike vLLM and TGI, TensorRT-LLM requires you to compile your model into an optimized engine before serving. This is a one-time process per model:
# Convert Hugging Face model to TensorRT-LLM format
python convert_checkpoint.py \
--model_dir meta-llama/Llama-2-7b-chat-hf \
--output_dir ./llama-2-7b-trt \
--dtype float16
# Build TensorRT engine
trtllm-build \
--checkpoint_dir ./llama-2-7b-trt \
--output_dir ./llama-2-7b-engine \
--gemm_plugin float16 \
--max_batch_size 256 \
--max_input_len 2048 \
--max_output_len 1024
# This creates optimized engine files that are ready to serve
Serving with Triton Inference Server
# TensorRT-LLM integrates with NVIDIA Triton for serving
# Start Triton server with TensorRT-LLM backend
docker run -d --gpus all \
-p 8000:8000 \
-p 8001:8001 \
-p 8002:8002 \
-v ${PWD}/llama-2-7b-engine:/models \
nvcr.io/nvidia/tritonserver:23.10-trtllm-python-py3
# Client code using Triton
import tritonclient.http as httpclient
client = httpclient.InferenceServerClient(url="localhost:8000")
# Prepare input
prompt = "Explain quantum entanglement."
inputs = [
httpclient.InferInput("input_ids", [1, len(prompt)], "INT32")
]
inputs[0].set_data_from_numpy(tokenize(prompt))
# Request inference
outputs = [httpclient.InferRequestedOutput("output_ids")]
response = client.infer("tensorrt_llm", inputs, outputs=outputs)
# Decode output
output_ids = response.as_numpy("output_ids")
generated_text = decode(output_ids)
print(generated_text)
Performance Benchmarks
| Model | Framework | Throughput (req/sec) | Latency (ms) |
|---|---|---|---|
| Llama 2 7B | TensorRT-LLM | 42 | 85 |
| Llama 2 7B | vLLM | 35 | 102 |
| Llama 2 7B | TGI | 28 | 125 |
| Llama 2 7B | Naive (HF) | 2 | 1200+ |
Benchmark conditions: A100 80GB GPU, batch size=32, input=512 tokens, output=128 tokens
Choosing the Right Framework
Each framework has distinct strengths. Here's a comprehensive comparison to guide your decision:
| Criteria | vLLM | TGI | TensorRT-LLM |
|---|---|---|---|
| Performance | Excellent (95/100) | Very Good (85/100) | Best (100/100) |
| Ease of Use | Excellent | Excellent | Moderate |
| Model Support | 50+ models | Any HF model | Manual porting |
| Setup Complexity | pip install | Docker run | Complex build |
| Deployment | Simple | Very Simple | Moderate |
| OpenAI Compatible | Yes | Yes | No (Triton API) |
| Quantization | AWQ, GPTQ | AWQ, GPTQ, bitsandbytes | INT8, INT4, FP8 |
| Multi-GPU | Tensor parallel | Tensor parallel | All strategies |
| Community | Very Active | Very Active | Moderate |
| Cost (Setup Time) | 1-2 hours | 1-2 hours | 1-2 days |
Decision Matrix
Choose vLLM if:
- β You want the best performance-to-ease ratio
- β You need OpenAI API compatibility
- β You're using popular open-source models (Llama, Mistral, Phi, etc.)
- β You want fast iteration and experimentation
- β You value community support and frequent updates
Choose TGI if:
- β You're deeply integrated with Hugging Face ecosystem
- β You need to serve many different models dynamically
- β You prefer containerized deployment (Docker/K8s)
- β You want production-proven stability (powers HF Inference API)
- β You need excellent documentation and enterprise support
Choose TensorRT-LLM if:
- β You need absolute maximum performance
- β You're serving millions of requests per day
- β You have NVIDIA hardware (A100, H100, L40S)
- β You can invest engineering time in model optimization
- β Infrastructure cost savings justify setup complexity
Recommendation for most practitioners: Start with vLLM. It provides 90-95% of TensorRT-LLM's performance with 10% of the setup complexity. If you later identify specific performance bottlenecks and have the engineering resources, you can migrate to TensorRT-LLM for maximum optimization.
Production Deployment Best Practices
1. Docker Deployment (Recommended)
# Dockerfile for vLLM production deployment
FROM nvidia/cuda:11.8.0-runtime-ubuntu22.04
# Install Python and dependencies
RUN apt-get update && apt-get install -y \
python3.10 \
python3-pip \
git
# Install vLLM
RUN pip3 install vllm
# Copy model (or download at runtime)
# COPY ./models /models
# Set environment variables
ENV MODEL_NAME="meta-llama/Llama-2-7b-chat-hf"
ENV TENSOR_PARALLEL_SIZE="1"
ENV GPU_MEMORY_UTILIZATION="0.90"
# Expose port
EXPOSE 8000
# Start server
CMD python3 -m vllm.entrypoints.openai.api_server \
--model $MODEL_NAME \
--host 0.0.0.0 \
--port 8000 \
--tensor-parallel-size $TENSOR_PARALLEL_SIZE \
--gpu-memory-utilization $GPU_MEMORY_UTILIZATION
# Build and run
docker build -t vllm-server:latest .
docker run -d \
--name vllm-llama2 \
--gpus all \
-p 8000:8000 \
-e HF_TOKEN=your_huggingface_token \
vllm-server:latest
2. Kubernetes Deployment
# kubernetes/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-server
spec:
replicas: 2 # Scale as needed
selector:
matchLabels:
app: vllm-server
template:
metadata:
labels:
app: vllm-server
spec:
containers:
- name: vllm
image: vllm-server:latest
ports:
- containerPort: 8000
resources:
limits:
nvidia.com/gpu: 1 # Request 1 GPU per pod
requests:
memory: "32Gi"
cpu: "8"
env:
- name: MODEL_NAME
value: "meta-llama/Llama-2-7b-chat-hf"
- name: TENSOR_PARALLEL_SIZE
value: "1"
livenessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 120
periodSeconds: 30
readinessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 120
periodSeconds: 10
---
apiVersion: v1
kind: Service
metadata:
name: vllm-service
spec:
selector:
app: vllm-server
ports:
- protocol: TCP
port: 80
targetPort: 8000
type: LoadBalancer
3. Load Balancing and Scaling
# nginx.conf for load balancing multiple vLLM instances
upstream vllm_backend {
# Least connections strategy for LLM serving
least_conn;
server 10.0.1.10:8000 max_fails=3 fail_timeout=30s;
server 10.0.1.11:8000 max_fails=3 fail_timeout=30s;
server 10.0.1.12:8000 max_fails=3 fail_timeout=30s;
}
server {
listen 80;
server_name llm-api.example.com;
# Increase timeouts for long-running LLM requests
proxy_connect_timeout 300s;
proxy_send_timeout 300s;
proxy_read_timeout 300s;
# Enable streaming
proxy_buffering off;
proxy_cache off;
location / {
proxy_pass http://vllm_backend;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
# For Server-Sent Events (streaming)
proxy_set_header Connection '';
proxy_http_version 1.1;
chunked_transfer_encoding on;
}
# Health check endpoint
location /health {
proxy_pass http://vllm_backend/health;
}
}
4. Monitoring and Observability
# Add Prometheus metrics to vLLM deployment
from prometheus_client import Counter, Histogram, Gauge
import time
# Metrics
request_count = Counter('vllm_requests_total', 'Total requests')
request_duration = Histogram('vllm_request_duration_seconds', 'Request duration')
tokens_generated = Counter('vllm_tokens_generated_total', 'Total tokens generated')
active_requests = Gauge('vllm_active_requests', 'Currently active requests')
# Wrap generation with metrics
@request_duration.time()
def generate_with_metrics(prompt, sampling_params):
request_count.inc()
active_requests.inc()
try:
output = llm.generate([prompt], sampling_params)[0]
tokens_generated.inc(len(output.outputs[0].token_ids))
return output
finally:
active_requests.dec()
# Expose metrics endpoint
from prometheus_client import start_http_server
start_http_server(9090) # Metrics available at :9090/metrics
Summary: Building Production LLM Infrastructure
π Key Takeaways
- vLLM is the default choice: Best performance-to-ease ratio, excellent community support, OpenAI-compatible API
- TGI for Hugging Face integration: Seamless ecosystem integration, production-proven, containerization-friendly
- TensorRT-LLM for extreme optimization: Maximum performance but requires significant setup investment
- All three are production-ready: Each powers large-scale deployments serving millions of requests
- Start simple, optimize later: Begin with vLLM, profile your workload, migrate to TensorRT-LLM only if justified
With these frameworks, you can deploy production-grade LLM endpoints that achieve 20-50x better throughput than naive implementations. In the next chapter, we'll explore cost and latency optimization techniques to further improve efficiency and user experience.