Build, optimize, monitor, and deploy a production-ready LLM serving infrastructure. Master high-performance serving techniques, observability, and cost optimization.
By the end of this lab, you'll have a complete production LLM serving stack including:
Objective: Build a basic Flask server that serves an LLM using Transformers. Observe performance issues and identify bottlenecks that production systems must overcome.
Create a new directory and virtual environment:
mkdir llm-production-lab
cd llm-production-lab
python3 -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install basic dependencies
pip install torch transformers flask
Create naive_server.py with the following code:
from flask import Flask, request, jsonify
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
import time
app = Flask(__name__)
print("Loading model... This will take a few minutes.")
model_name = "meta-llama/Llama-2-7b-chat-hf" # or "gpt2" for testing
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.float16,
device_map="auto"
)
print("Model loaded!")
@app.route('/generate', methods=['POST'])
def generate():
start_time = time.time()
data = request.json
prompt = data.get('prompt', '')
max_tokens = data.get('max_tokens', 100)
# Tokenize
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
# Generate
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=max_tokens,
do_sample=True,
temperature=0.7
)
# Decode
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
total_time = time.time() - start_time
return jsonify({
'generated_text': generated_text,
'time_seconds': total_time,
'tokens_per_second': max_tokens / total_time
})
if __name__ == '__main__':
app.run(host='0.0.0.0', port=5000)
python naive_server.py
In a new terminal, send a test request:
curl -X POST http://localhost:5000/generate \
-H "Content-Type: application/json" \
-d '{"prompt": "The future of AI is", "max_tokens": 50}'
Now test with multiple concurrent requests to see the bottleneck:
# load_test.py
import requests
import time
import concurrent.futures
import statistics
def send_request(i):
start = time.time()
response = requests.post('http://localhost:5000/generate', json={
'prompt': f'Request {i}: Tell me about AI',
'max_tokens': 50
})
duration = time.time() - start
return duration, response.json()
# Test with 5 concurrent requests
print("Sending 5 concurrent requests...")
start_time = time.time()
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
futures = [executor.submit(send_request, i) for i in range(5)]
results = [f.result() for f in futures]
total_time = time.time() - start_time
durations = [r[0] for r in results]
print(f"\nResults:")
print(f"Total time: {total_time:.2f}s")
print(f"Average request time: {statistics.mean(durations):.2f}s")
print(f"Min time: {min(durations):.2f}s")
print(f"Max time: {max(durations):.2f}s")
print(f"Throughput: {5/total_time:.2f} req/s")
python load_test.py
The naive approach has several critical issues:
| Problem | Impact | Solution |
|---|---|---|
| Sequential Processing | Each request waits for previous to complete | Continuous batching |
| No KV Cache Reuse | Recomputes attention for every token | PagedAttention |
| Memory Fragmentation | GPU memory wasted, limits batch size | Paged memory management |
| Full Precision (FP16) | Higher memory usage, slower inference | Quantization (AWQ/GPTQ) |
| Blocking I/O | Client waits for entire response | Streaming responses |
Objective: Deploy an LLM using vLLM and observe dramatic performance improvements through PagedAttention and continuous batching.
# Install vLLM (requires Python 3.8-3.11)
pip install vllm
# Verify installation
python -c "import vllm; print(f'vLLM version: {vllm.__version__}')"
vLLM provides an OpenAI-compatible API server out of the box:
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-2-7b-chat-hf \
--port 8000 \
--dtype float16
Send a test request using the OpenAI-compatible API:
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-2-7b-chat-hf",
"prompt": "The future of AI is",
"max_tokens": 50,
"temperature": 0.7
}'
Update the load test to use vLLM:
# load_test_vllm.py
import requests
import time
import concurrent.futures
import statistics
def send_request(i):
start = time.time()
response = requests.post('http://localhost:8000/v1/completions', json={
'model': 'meta-llama/Llama-2-7b-chat-hf',
'prompt': f'Request {i}: Tell me about AI',
'max_tokens': 50,
'temperature': 0.7
})
duration = time.time() - start
return duration, response.json()
# Test with 5 concurrent requests
print("Sending 5 concurrent requests to vLLM...")
start_time = time.time()
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
futures = [executor.submit(send_request, i) for i in range(5)]
results = [f.result() for f in futures]
total_time = time.time() - start_time
durations = [r[0] for r in results]
print(f"\nResults:")
print(f"Total time: {total_time:.2f}s")
print(f"Average request time: {statistics.mean(durations):.2f}s")
print(f"Min time: {min(durations):.2f}s")
print(f"Max time: {max(durations):.2f}s")
print(f"Throughput: {5/total_time:.2f} req/s")
python load_test_vllm.py
Sequential processing, no batching
Continuous batching, PagedAttention
Just by switching to vLLM!
Let's visualize how PagedAttention improves memory efficiency:
[Token 1][Token 2][Token 3][...][Token N]
[======= Contiguous Memory Block =======]
Problem: Memory fragmentation, wasted space
Page 1: [Token 1][Token 2][Token 3][Token 4]
Page 2: [Token 5][Token 6][Token 7][Token 8]
Page 3: [Token 9][Token 10][ Free ][ Free ]
Benefits:
✅ No memory fragmentation
✅ Efficient memory reuse
✅ 3-4x higher batch sizes
✅ 2-4x better throughput
Restart vLLM with optimized settings:
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-2-7b-chat-hf \
--port 8000 \
--dtype float16 \
--max-model-len 2048 \
--gpu-memory-utilization 0.9 \
--max-num-batched-tokens 4096 \
--max-num-seqs 32
--max-model-len 2048: Maximum sequence length--gpu-memory-utilization 0.9: Use 90% of GPU memory--max-num-batched-tokens 4096: Max tokens per batch--max-num-seqs 32: Max concurrent sequencesIncrease the load test to 20 concurrent requests and measure the throughput. How does vLLM handle higher concurrency compared to the naive server?
Objective: Implement advanced optimization techniques including response streaming, prompt caching, and model quantization to reduce latency and cost.
Streaming reduces perceived latency by sending tokens as they're generated:
# streaming_client.py
import requests
import json
import time
def stream_completion(prompt):
"""Send streaming request to vLLM server"""
url = "http://localhost:8000/v1/completions"
headers = {"Content-Type": "application/json"}
data = {
"model": "meta-llama/Llama-2-7b-chat-hf",
"prompt": prompt,
"max_tokens": 100,
"temperature": 0.7,
"stream": True # Enable streaming
}
print(f"Prompt: {prompt}\n")
print("Response: ", end="", flush=True)
start_time = time.time()
first_token_time = None
token_count = 0
with requests.post(url, headers=headers, json=data, stream=True) as response:
for line in response.iter_lines():
if line:
line = line.decode('utf-8')
if line.startswith('data: '):
data_str = line[6:] # Remove 'data: ' prefix
if data_str == '[DONE]':
break
try:
chunk = json.loads(data_str)
text = chunk['choices'][0]['text']
print(text, end="", flush=True)
token_count += 1
if first_token_time is None:
first_token_time = time.time() - start_time
except json.JSONDecodeError:
continue
total_time = time.time() - start_time
print(f"\n\n--- Metrics ---")
print(f"Time to first token: {first_token_time:.3f}s")
print(f"Total time: {total_time:.3f}s")
print(f"Tokens generated: {token_count}")
print(f"Tokens per second: {token_count/total_time:.2f}")
if __name__ == "__main__":
stream_completion("Explain quantum computing in simple terms:")
python streaming_client.py
Cache common prompt prefixes to avoid recomputing attention:
# prompt_cache.py
import hashlib
import json
from typing import Dict, Optional
import time
class PromptCache:
"""Simple LRU cache for prompt prefixes"""
def __init__(self, max_size: int = 100):
self.cache: Dict[str, dict] = {}
self.max_size = max_size
self.hits = 0
self.misses = 0
def _hash_prompt(self, prompt: str) -> str:
"""Hash prompt for cache key"""
return hashlib.md5(prompt.encode()).hexdigest()
def get(self, prompt: str) -> Optional[dict]:
"""Get cached result if exists"""
cache_key = self._hash_prompt(prompt)
if cache_key in self.cache:
self.hits += 1
result = self.cache[cache_key]
result['cache_hit'] = True
return result
self.misses += 1
return None
def put(self, prompt: str, result: dict):
"""Store result in cache"""
cache_key = self._hash_prompt(prompt)
# Simple LRU: remove oldest if full
if len(self.cache) >= self.max_size:
oldest_key = next(iter(self.cache))
del self.cache[oldest_key]
self.cache[cache_key] = result
def stats(self) -> dict:
"""Get cache statistics"""
total = self.hits + self.misses
hit_rate = self.hits / total if total > 0 else 0
return {
'hits': self.hits,
'misses': self.misses,
'hit_rate': hit_rate,
'cache_size': len(self.cache)
}
# Example usage
cache = PromptCache(max_size=50)
# Simulate repeated requests
common_prompts = [
"Translate to French: Hello, how are you?",
"Summarize this article: ...",
"Write a Python function to sort a list",
]
print("Testing prompt cache...")
for round_num in range(3):
print(f"\n--- Round {round_num + 1} ---")
for prompt in common_prompts:
# Check cache first
cached = cache.get(prompt)
if cached:
print(f"✅ Cache HIT: {prompt[:40]}...")
print(f" Saved time: {cached.get('generation_time', 0):.3f}s")
else:
print(f"❌ Cache MISS: {prompt[:40]}...")
# Simulate API call
start = time.time()
time.sleep(0.5) # Simulate generation time
generation_time = time.time() - start
result = {
'text': 'Generated response...',
'generation_time': generation_time
}
cache.put(prompt, result)
print("\n--- Cache Statistics ---")
stats = cache.stats()
print(f"Total requests: {stats['hits'] + stats['misses']}")
print(f"Cache hits: {stats['hits']}")
print(f"Cache misses: {stats['misses']}")
print(f"Hit rate: {stats['hit_rate']*100:.1f}%")
print(f"Cache size: {stats['cache_size']}")
python prompt_cache.py
Quantize the model to 4-bit to reduce memory usage and increase throughput:
# Install AWQ
pip install autoawq
# Download pre-quantized model (saves time)
# Or quantize your own model - see lab-code.py for full script
Launch vLLM with quantized model:
python -m vllm.entrypoints.openai.api_server \
--model TheBloke/Llama-2-7B-Chat-AWQ \
--quantization awq \
--port 8001 \
--dtype float16 \
--max-model-len 2048
Compare quantized vs full precision:
# compare_quantization.py
import requests
import time
import concurrent.futures
def benchmark_model(base_url, model_name, num_requests=10):
"""Benchmark model throughput"""
def send_request():
start = time.time()
response = requests.post(f'{base_url}/v1/completions', json={
'model': model_name,
'prompt': 'Tell me about artificial intelligence in detail',
'max_tokens': 100,
'temperature': 0.7
})
return time.time() - start, response.json()
start_time = time.time()
with concurrent.futures.ThreadPoolExecutor(max_workers=num_requests) as executor:
futures = [executor.submit(send_request) for _ in range(num_requests)]
results = [f.result() for f in futures]
total_time = time.time() - start_time
durations = [r[0] for r in results]
return {
'total_time': total_time,
'avg_latency': sum(durations) / len(durations),
'throughput': num_requests / total_time
}
print("Benchmarking FP16 model (port 8000)...")
fp16_results = benchmark_model(
'http://localhost:8000',
'meta-llama/Llama-2-7b-chat-hf',
num_requests=10
)
print("\nBenchmarking AWQ quantized model (port 8001)...")
awq_results = benchmark_model(
'http://localhost:8001',
'TheBloke/Llama-2-7B-Chat-AWQ',
num_requests=10
)
print("\n" + "="*60)
print("QUANTIZATION COMPARISON")
print("="*60)
print(f"\n{'Metric':<30} {'FP16':<15} {'AWQ (4-bit)':<15} {'Improvement'}")
print("-"*60)
print(f"{'Throughput (req/s)':<30} {fp16_results['throughput']:<15.2f} {awq_results['throughput']:<15.2f} {awq_results['throughput']/fp16_results['throughput']:.2f}x")
print(f"{'Avg Latency (s)':<30} {fp16_results['avg_latency']:<15.2f} {awq_results['avg_latency']:<15.2f} {fp16_results['avg_latency']/awq_results['avg_latency']:.2f}x faster")
print(f"{'GPU Memory (estimated)':<30} {'~14 GB':<15} {'~4 GB':<15} {'3.5x less'}")
print(f"{'Cost Savings':<30} {'Baseline':<15} {'~70%':<15} {'Major'}")
python compare_quantization.py
Implement prefix caching in vLLM (use --enable-prefix-caching flag)
and measure the improvement for chat applications with long system prompts.
Objective: Set up a complete monitoring stack with Prometheus and Grafana to track latency, throughput, costs, and system health.
vLLM exposes Prometheus metrics by default. Verify they're available:
curl http://localhost:8000/metrics
Create prometheus.yml configuration:
# prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'vllm'
static_configs:
- targets: ['localhost:8000']
metrics_path: '/metrics'
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
Run Prometheus with Docker:
docker run -d \
--name prometheus \
-p 9090:9090 \
-v $(pwd)/prometheus.yml:/etc/prometheus/prometheus.yml \
--network host \
prom/prometheus
Verify Prometheus is scraping metrics:
# Open browser to http://localhost:9090
# Query: vllm_num_requests_running
Run Grafana with Docker:
docker run -d \
--name grafana \
-p 3000:3000 \
--network host \
grafana/grafana
Access Grafana:
Import this dashboard JSON (grafana-dashboard.json):
{
"dashboard": {
"title": "vLLM Production Monitoring",
"panels": [
{
"title": "Requests Per Second",
"targets": [
{
"expr": "rate(vllm_request_success_total[1m])"
}
],
"type": "graph"
},
{
"title": "Running Requests",
"targets": [
{
"expr": "vllm_num_requests_running"
}
],
"type": "stat"
},
{
"title": "Waiting Requests",
"targets": [
{
"expr": "vllm_num_requests_waiting"
}
],
"type": "stat"
},
{
"title": "GPU Cache Usage",
"targets": [
{
"expr": "vllm_gpu_cache_usage_perc"
}
],
"type": "gauge"
},
{
"title": "Time to First Token (p95)",
"targets": [
{
"expr": "histogram_quantile(0.95, rate(vllm_time_to_first_token_seconds_bucket[5m]))"
}
],
"type": "graph"
},
{
"title": "E2E Request Latency (p95)",
"targets": [
{
"expr": "histogram_quantile(0.95, rate(vllm_e2e_request_latency_seconds_bucket[5m]))"
}
],
"type": "graph"
}
]
}
}
| Metric | What It Measures | Target |
|---|---|---|
| Time to First Token (TTFT) | Perceived latency | < 200ms |
| E2E Request Latency | Total generation time | < 2s for 100 tokens |
| Throughput (tokens/sec) | System capacity | > 1000 tokens/sec |
| GPU Cache Usage | Memory efficiency | 70-90% |
| Requests Waiting | Queue backlog | < 5 |
| Error Rate | Service reliability | < 0.1% |
Create alerting rules in prometheus_alerts.yml:
# prometheus_alerts.yml
groups:
- name: vllm_alerts
interval: 30s
rules:
- alert: HighLatency
expr: histogram_quantile(0.95, rate(vllm_e2e_request_latency_seconds_bucket[5m])) > 5
for: 2m
labels:
severity: warning
annotations:
summary: "High p95 latency detected"
description: "p95 latency is {{ $value }}s (threshold: 5s)"
- alert: HighQueueDepth
expr: vllm_num_requests_waiting > 10
for: 1m
labels:
severity: warning
annotations:
summary: "Request queue is backing up"
description: "{{ $value }} requests waiting (threshold: 10)"
- alert: GPUMemoryHigh
expr: vllm_gpu_cache_usage_perc > 95
for: 5m
labels:
severity: critical
annotations:
summary: "GPU memory near capacity"
description: "GPU cache at {{ $value }}% (threshold: 95%)"
- alert: HighErrorRate
expr: rate(vllm_request_failure_total[5m]) / rate(vllm_request_total[5m]) > 0.01
for: 2m
labels:
severity: critical
annotations:
summary: "Error rate above 1%"
description: "Error rate: {{ $value | humanizePercentage }}"
Calculate cost per request based on GPU time:
# cost_calculator.py
def calculate_costs(metrics):
"""Calculate LLM serving costs"""
# Assumptions
GPU_COST_PER_HOUR = 1.50 # A10 GPU on cloud
# Get metrics
total_requests = metrics['total_requests']
avg_latency_seconds = metrics['avg_latency']
gpu_utilization = metrics['gpu_utilization'] # 0-1
# Calculate
total_gpu_hours = (total_requests * avg_latency_seconds) / 3600
effective_gpu_hours = total_gpu_hours / gpu_utilization
total_cost = effective_gpu_hours * GPU_COST_PER_HOUR
cost_per_request = total_cost / total_requests
cost_per_1k_requests = cost_per_request * 1000
return {
'total_cost': total_cost,
'cost_per_request': cost_per_request,
'cost_per_1k_requests': cost_per_1k_requests,
'gpu_hours': effective_gpu_hours
}
# Example
metrics = {
'total_requests': 10000,
'avg_latency': 2.5, # seconds
'gpu_utilization': 0.85
}
costs = calculate_costs(metrics)
print(f"Total cost: ${costs['total_cost']:.2f}")
print(f"Cost per request: ${costs['cost_per_request']:.4f}")
print(f"Cost per 1K requests: ${costs['cost_per_1k_requests']:.2f}")
print(f"GPU hours used: {costs['gpu_hours']:.2f}")
Objective: Containerize the LLM service with Docker, deploy to Kubernetes, add load balancing, and test under production-like load.
Create Dockerfile for the vLLM service:
# Dockerfile
FROM nvidia/cuda:11.8.0-devel-ubuntu22.04
# Install Python 3.10
RUN apt-get update && apt-get install -y \
python3.10 \
python3-pip \
git \
&& rm -rf /var/lib/apt/lists/*
# Install vLLM
RUN pip3 install vllm
# Download model at build time (optional, for faster startup)
# RUN python3 -c "from transformers import AutoTokenizer, AutoModelForCausalLM; \
# AutoTokenizer.from_pretrained('meta-llama/Llama-2-7b-chat-hf'); \
# AutoModelForCausalLM.from_pretrained('meta-llama/Llama-2-7b-chat-hf')"
EXPOSE 8000
# Run vLLM server
CMD ["python3", "-m", "vllm.entrypoints.openai.api_server", \
"--model", "meta-llama/Llama-2-7b-chat-hf", \
"--port", "8000", \
"--host", "0.0.0.0"]
# Build image
docker build -t vllm-server:latest .
# Run container with GPU
docker run -d \
--gpus all \
--name vllm-container \
-p 8000:8000 \
vllm-server:latest
# Test
curl http://localhost:8000/v1/models
Create docker-compose.yml with vLLM + Prometheus + Grafana:
# docker-compose.yml
version: '3.8'
services:
vllm:
image: vllm-server:latest
ports:
- "8000:8000"
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
restart: unless-stopped
prometheus:
image: prom/prometheus:latest
ports:
- "9090:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- ./prometheus_alerts.yml:/etc/prometheus/alerts.yml
- prometheus_data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
restart: unless-stopped
grafana:
image: grafana/grafana:latest
ports:
- "3000:3000"
volumes:
- grafana_data:/var/lib/grafana
- ./grafana-dashboard.json:/etc/grafana/provisioning/dashboards/vllm.json
environment:
- GF_SECURITY_ADMIN_PASSWORD=admin
restart: unless-stopped
volumes:
prometheus_data:
grafana_data:
docker-compose up -d
Create k8s-deployment.yaml:
# k8s-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-server
labels:
app: vllm
spec:
replicas: 2 # Start with 2 replicas
selector:
matchLabels:
app: vllm
template:
metadata:
labels:
app: vllm
spec:
containers:
- name: vllm
image: vllm-server:latest
ports:
- containerPort: 8000
resources:
limits:
nvidia.com/gpu: 1 # 1 GPU per pod
requests:
memory: "16Gi"
cpu: "4"
env:
- name: MODEL_NAME
value: "meta-llama/Llama-2-7b-chat-hf"
livenessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 60
periodSeconds: 10
readinessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 30
periodSeconds: 5
---
apiVersion: v1
kind: Service
metadata:
name: vllm-service
spec:
selector:
app: vllm
ports:
- protocol: TCP
port: 80
targetPort: 8000
type: LoadBalancer
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: vllm-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: vllm-server
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
# Apply deployment
kubectl apply -f k8s-deployment.yaml
# Check status
kubectl get pods -l app=vllm
kubectl get svc vllm-service
# Get load balancer IP
kubectl get svc vllm-service -o jsonpath='{.status.loadBalancer.ingress[0].ip}'
Install and run load testing tool:
pip install locust
Create locustfile.py:
# locustfile.py
from locust import HttpUser, task, between
import json
class VLLMUser(HttpUser):
wait_time = between(1, 3)
@task
def generate_completion(self):
payload = {
"model": "meta-llama/Llama-2-7b-chat-hf",
"prompt": "Explain artificial intelligence in simple terms",
"max_tokens": 100,
"temperature": 0.7
}
headers = {"Content-Type": "application/json"}
with self.client.post(
"/v1/completions",
data=json.dumps(payload),
headers=headers,
catch_response=True
) as response:
if response.status_code == 200:
response.success()
else:
response.failure(f"Failed with status {response.status_code}")
Run load test:
# Start with 10 users, ramp up to 100
locust -f locustfile.py --host http://LOAD_BALANCER_IP
# Or headless mode
locust -f locustfile.py --host http://LOAD_BALANCER_IP \
--users 100 --spawn-rate 10 --run-time 5m --headless
Watch Kubernetes auto-scale based on load:
# Watch HPA
kubectl get hpa vllm-hpa --watch
# Watch pods
kubectl get pods -l app=vllm --watch
Collect and compare all metrics:
| Configuration | Throughput | p95 Latency | Cost/1K req | Improvement |
|---|---|---|---|---|
| Naive Flask (Exercise 1) | 0.29 req/s | 17.3s | $8.50 | Baseline |
| vLLM Basic (Exercise 2) | 1.19 req/s | 4.1s | $2.10 | 4.1x |
| vLLM + Quantization (Ex 3) | 2.34 req/s | 2.0s | $1.22 | 8.1x |
| K8s + Load Balancer (Ex 5) | 12+ req/s | 2.8s | $1.10 | 41x |
Add a Redis cache layer between the load balancer and vLLM servers to cache responses for duplicate prompts. Measure the cache hit rate and cost savings.
You've successfully built a production-grade LLM serving infrastructure from scratch! You now understand the key techniques that power real-world AI applications.
By applying these techniques, you transformed a system that could handle 0.29 requests/second into one that handles 12+ requests/second - a 41x improvement! You also reduced costs from $8.50 to $1.10 per 1,000 requests.