Module 5 - Hands-On Lab

Deploy a Production LLM Service with vLLM

Build, optimize, monitor, and deploy a production-ready LLM serving infrastructure. Master high-performance serving techniques, observability, and cost optimization.

⏱️ 2.5-3 hours 🔧 5 Exercises ☁️ Cloud GPU Required 🐳 Docker & K8s

Lab Overview

What You'll Build

By the end of this lab, you'll have a complete production LLM serving stack including:

  • ✅ High-performance vLLM server with PagedAttention
  • ✅ Continuous batching for maximum throughput
  • ✅ Response streaming for low latency
  • ✅ Intelligent prompt caching
  • ✅ Model quantization (AWQ) for cost savings
  • ✅ Complete monitoring stack (Prometheus + Grafana)
  • ✅ Kubernetes deployment with auto-scaling
  • ✅ Load testing and performance benchmarking

Production LLM Architecture


Client Requests
Load Balancer (nginx)
vLLM Server Pool
(PagedAttention + Batching)
Quantized LLM
(Llama-2-7B-AWQ)


↓ Metrics ↓

Prometheus
Grafana Dashboards
⚠️ Cost Warning: This lab uses cloud GPU instances (A10 or A100). Expected cost: $10-20 for 3 hours. Make sure to stop instances when done!

Prerequisites

  • ✅ Python 3.9+ installed
  • ✅ Docker and Docker Compose installed
  • ✅ kubectl configured (for Exercise 5)
  • ✅ Cloud GPU access (Google Cloud, AWS, or RunPod)
  • ✅ 50GB disk space
  • ✅ Basic Linux command line skills

Learning Objectives

  1. Understand the bottlenecks in naive LLM serving
  2. Master vLLM deployment and configuration
  3. Implement advanced optimization techniques
  4. Set up comprehensive monitoring and observability
  5. Deploy production-grade infrastructure with Kubernetes
EXERCISE 1

Naive Serving: Understanding the Bottlenecks

⏱️ 15 minutes

Objective: Build a basic Flask server that serves an LLM using Transformers. Observe performance issues and identify bottlenecks that production systems must overcome.

Step 1: Set Up Environment

Create a new directory and virtual environment:

mkdir llm-production-lab
cd llm-production-lab
python3 -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install basic dependencies
pip install torch transformers flask

Step 2: Create Naive Flask Server

Create naive_server.py with the following code:

from flask import Flask, request, jsonify
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
import time

app = Flask(__name__)

print("Loading model... This will take a few minutes.")
model_name = "meta-llama/Llama-2-7b-chat-hf"  # or "gpt2" for testing
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    device_map="auto"
)
print("Model loaded!")

@app.route('/generate', methods=['POST'])
def generate():
    start_time = time.time()

    data = request.json
    prompt = data.get('prompt', '')
    max_tokens = data.get('max_tokens', 100)

    # Tokenize
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

    # Generate
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_tokens,
            do_sample=True,
            temperature=0.7
        )

    # Decode
    generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

    total_time = time.time() - start_time

    return jsonify({
        'generated_text': generated_text,
        'time_seconds': total_time,
        'tokens_per_second': max_tokens / total_time
    })

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=5000)

Step 3: Run the Server

python naive_server.py
Expected Output:
Loading model... This will take a few minutes.
Model loaded!
* Running on http://0.0.0.0:5000

Step 4: Test with Single Request

In a new terminal, send a test request:

curl -X POST http://localhost:5000/generate \
  -H "Content-Type: application/json" \
  -d '{"prompt": "The future of AI is", "max_tokens": 50}'
Expected Output:
{
  "generated_text": "The future of AI is...",
  "time_seconds": 3.5,
  "tokens_per_second": 14.3
}

Step 5: Load Test with Concurrent Requests

Now test with multiple concurrent requests to see the bottleneck:

# load_test.py
import requests
import time
import concurrent.futures
import statistics

def send_request(i):
    start = time.time()
    response = requests.post('http://localhost:5000/generate', json={
        'prompt': f'Request {i}: Tell me about AI',
        'max_tokens': 50
    })
    duration = time.time() - start
    return duration, response.json()

# Test with 5 concurrent requests
print("Sending 5 concurrent requests...")
start_time = time.time()

with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
    futures = [executor.submit(send_request, i) for i in range(5)]
    results = [f.result() for f in futures]

total_time = time.time() - start_time
durations = [r[0] for r in results]

print(f"\nResults:")
print(f"Total time: {total_time:.2f}s")
print(f"Average request time: {statistics.mean(durations):.2f}s")
print(f"Min time: {min(durations):.2f}s")
print(f"Max time: {max(durations):.2f}s")
print(f"Throughput: {5/total_time:.2f} req/s")
python load_test.py
Expected Output:
Sending 5 concurrent requests...

Results:
Total time: 17.50s
Average request time: 15.20s
Min time: 3.50s
Max time: 17.30s
Throughput: 0.29 req/s
🚨 Performance Problem Identified!
Notice how 5 requests that should each take ~3.5s actually take 17.5s total. The server is processing requests sequentially, not in parallel. This is the fundamental problem with naive serving.

Step 6: Analyze Bottlenecks

The naive approach has several critical issues:

Problem Impact Solution
Sequential Processing Each request waits for previous to complete Continuous batching
No KV Cache Reuse Recomputes attention for every token PagedAttention
Memory Fragmentation GPU memory wasted, limits batch size Paged memory management
Full Precision (FP16) Higher memory usage, slower inference Quantization (AWQ/GPTQ)
Blocking I/O Client waits for entire response Streaming responses

Exercise 1 Checklist

💡 Key Takeaway: Naive LLM serving with standard frameworks (Transformers + Flask) cannot handle production workloads. We need specialized serving frameworks that implement batching, efficient memory management, and optimized kernels. That's where vLLM comes in!
EXERCISE 2

vLLM Setup: Deploy High-Performance Server

⏱️ 25 minutes

Objective: Deploy an LLM using vLLM and observe dramatic performance improvements through PagedAttention and continuous batching.

Step 1: Install vLLM

# Install vLLM (requires Python 3.8-3.11)
pip install vllm

# Verify installation
python -c "import vllm; print(f'vLLM version: {vllm.__version__}')"
Expected Output:
vLLM version: 0.2.6

Step 2: Launch vLLM Server

vLLM provides an OpenAI-compatible API server out of the box:

python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-2-7b-chat-hf \
    --port 8000 \
    --dtype float16
Expected Output:
INFO: Started server process [12345]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8000
💡 What Just Happened?
vLLM automatically configured:
  • PagedAttention for efficient KV cache management
  • Continuous batching for concurrent requests
  • Optimized CUDA kernels for faster inference
  • OpenAI-compatible API endpoints

Step 3: Test vLLM Server

Send a test request using the OpenAI-compatible API:

curl http://localhost:8000/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "meta-llama/Llama-2-7b-chat-hf",
        "prompt": "The future of AI is",
        "max_tokens": 50,
        "temperature": 0.7
    }'

Step 4: Compare Performance with Load Test

Update the load test to use vLLM:

# load_test_vllm.py
import requests
import time
import concurrent.futures
import statistics

def send_request(i):
    start = time.time()
    response = requests.post('http://localhost:8000/v1/completions', json={
        'model': 'meta-llama/Llama-2-7b-chat-hf',
        'prompt': f'Request {i}: Tell me about AI',
        'max_tokens': 50,
        'temperature': 0.7
    })
    duration = time.time() - start
    return duration, response.json()

# Test with 5 concurrent requests
print("Sending 5 concurrent requests to vLLM...")
start_time = time.time()

with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
    futures = [executor.submit(send_request, i) for i in range(5)]
    results = [f.result() for f in futures]

total_time = time.time() - start_time
durations = [r[0] for r in results]

print(f"\nResults:")
print(f"Total time: {total_time:.2f}s")
print(f"Average request time: {statistics.mean(durations):.2f}s")
print(f"Min time: {min(durations):.2f}s")
print(f"Max time: {max(durations):.2f}s")
print(f"Throughput: {5/total_time:.2f} req/s")
python load_test_vllm.py
Expected Output:
Sending 5 concurrent requests to vLLM...

Results:
Total time: 4.20s
Average request time: 3.80s
Min time: 3.50s
Max time: 4.10s
Throughput: 1.19 req/s

Step 5: Performance Comparison

Naive Flask Server
0.29 req/s

Sequential processing, no batching

vLLM Server
1.19 req/s

Continuous batching, PagedAttention

Performance Improvement
4.1x faster

Just by switching to vLLM!

Step 6: Understanding PagedAttention

Let's visualize how PagedAttention improves memory efficiency:

Traditional Attention (Naive Serving)

[Token 1][Token 2][Token 3][...][Token N]
[======= Contiguous Memory Block =======]

Problem: Memory fragmentation, wasted space
⬇️ PagedAttention ⬇️

PagedAttention (vLLM)

Page 1: [Token 1][Token 2][Token 3][Token 4]
Page 2: [Token 5][Token 6][Token 7][Token 8]
Page 3: [Token 9][Token 10][ Free ][ Free ]

Benefits:
✅ No memory fragmentation
✅ Efficient memory reuse
✅ 3-4x higher batch sizes
✅ 2-4x better throughput

Step 7: Configure Advanced vLLM Settings

Restart vLLM with optimized settings:

python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-2-7b-chat-hf \
    --port 8000 \
    --dtype float16 \
    --max-model-len 2048 \
    --gpu-memory-utilization 0.9 \
    --max-num-batched-tokens 4096 \
    --max-num-seqs 32
💡 Configuration Explained:
  • --max-model-len 2048: Maximum sequence length
  • --gpu-memory-utilization 0.9: Use 90% of GPU memory
  • --max-num-batched-tokens 4096: Max tokens per batch
  • --max-num-seqs 32: Max concurrent sequences

Exercise 2 Checklist

🎯 Bonus Challenge

Increase the load test to 20 concurrent requests and measure the throughput. How does vLLM handle higher concurrency compared to the naive server?

EXERCISE 3

Optimization Techniques: Streaming, Caching, Quantization

⏱️ 35 minutes

Objective: Implement advanced optimization techniques including response streaming, prompt caching, and model quantization to reduce latency and cost.

Part A: Response Streaming (10 minutes)

Streaming reduces perceived latency by sending tokens as they're generated:

# streaming_client.py
import requests
import json
import time

def stream_completion(prompt):
    """Send streaming request to vLLM server"""
    url = "http://localhost:8000/v1/completions"

    headers = {"Content-Type": "application/json"}
    data = {
        "model": "meta-llama/Llama-2-7b-chat-hf",
        "prompt": prompt,
        "max_tokens": 100,
        "temperature": 0.7,
        "stream": True  # Enable streaming
    }

    print(f"Prompt: {prompt}\n")
    print("Response: ", end="", flush=True)

    start_time = time.time()
    first_token_time = None
    token_count = 0

    with requests.post(url, headers=headers, json=data, stream=True) as response:
        for line in response.iter_lines():
            if line:
                line = line.decode('utf-8')
                if line.startswith('data: '):
                    data_str = line[6:]  # Remove 'data: ' prefix
                    if data_str == '[DONE]':
                        break

                    try:
                        chunk = json.loads(data_str)
                        text = chunk['choices'][0]['text']
                        print(text, end="", flush=True)

                        token_count += 1
                        if first_token_time is None:
                            first_token_time = time.time() - start_time
                    except json.JSONDecodeError:
                        continue

    total_time = time.time() - start_time
    print(f"\n\n--- Metrics ---")
    print(f"Time to first token: {first_token_time:.3f}s")
    print(f"Total time: {total_time:.3f}s")
    print(f"Tokens generated: {token_count}")
    print(f"Tokens per second: {token_count/total_time:.2f}")

if __name__ == "__main__":
    stream_completion("Explain quantum computing in simple terms:")
python streaming_client.py
Expected Output:
Prompt: Explain quantum computing in simple terms:

Response: Quantum computing is a new type of computing that uses quantum mechanics...

--- Metrics ---
Time to first token: 0.085s
Total time: 2.340s
Tokens generated: 100
Tokens per second: 42.74
💡 Why Streaming Matters:
Time to first token (TTFT) is critical for user experience. With streaming:
  • User sees response in ~0.1s instead of waiting 2-3s
  • Perceived latency is 20-30x lower
  • Better for chatbots and interactive applications

Part B: Prompt Caching (12 minutes)

Cache common prompt prefixes to avoid recomputing attention:

# prompt_cache.py
import hashlib
import json
from typing import Dict, Optional
import time

class PromptCache:
    """Simple LRU cache for prompt prefixes"""

    def __init__(self, max_size: int = 100):
        self.cache: Dict[str, dict] = {}
        self.max_size = max_size
        self.hits = 0
        self.misses = 0

    def _hash_prompt(self, prompt: str) -> str:
        """Hash prompt for cache key"""
        return hashlib.md5(prompt.encode()).hexdigest()

    def get(self, prompt: str) -> Optional[dict]:
        """Get cached result if exists"""
        cache_key = self._hash_prompt(prompt)
        if cache_key in self.cache:
            self.hits += 1
            result = self.cache[cache_key]
            result['cache_hit'] = True
            return result
        self.misses += 1
        return None

    def put(self, prompt: str, result: dict):
        """Store result in cache"""
        cache_key = self._hash_prompt(prompt)

        # Simple LRU: remove oldest if full
        if len(self.cache) >= self.max_size:
            oldest_key = next(iter(self.cache))
            del self.cache[oldest_key]

        self.cache[cache_key] = result

    def stats(self) -> dict:
        """Get cache statistics"""
        total = self.hits + self.misses
        hit_rate = self.hits / total if total > 0 else 0
        return {
            'hits': self.hits,
            'misses': self.misses,
            'hit_rate': hit_rate,
            'cache_size': len(self.cache)
        }

# Example usage
cache = PromptCache(max_size=50)

# Simulate repeated requests
common_prompts = [
    "Translate to French: Hello, how are you?",
    "Summarize this article: ...",
    "Write a Python function to sort a list",
]

print("Testing prompt cache...")
for round_num in range(3):
    print(f"\n--- Round {round_num + 1} ---")
    for prompt in common_prompts:
        # Check cache first
        cached = cache.get(prompt)

        if cached:
            print(f"✅ Cache HIT: {prompt[:40]}...")
            print(f"   Saved time: {cached.get('generation_time', 0):.3f}s")
        else:
            print(f"❌ Cache MISS: {prompt[:40]}...")
            # Simulate API call
            start = time.time()
            time.sleep(0.5)  # Simulate generation time
            generation_time = time.time() - start

            result = {
                'text': 'Generated response...',
                'generation_time': generation_time
            }
            cache.put(prompt, result)

print("\n--- Cache Statistics ---")
stats = cache.stats()
print(f"Total requests: {stats['hits'] + stats['misses']}")
print(f"Cache hits: {stats['hits']}")
print(f"Cache misses: {stats['misses']}")
print(f"Hit rate: {stats['hit_rate']*100:.1f}%")
print(f"Cache size: {stats['cache_size']}")
python prompt_cache.py
Expected Output:
Testing prompt cache...

--- Round 1 ---
❌ Cache MISS: Translate to French: Hello, how are you?
❌ Cache MISS: Summarize this article: ...
❌ Cache MISS: Write a Python function to sort a list

--- Round 2 ---
✅ Cache HIT: Translate to French: Hello, how are you?
   Saved time: 0.502s
✅ Cache HIT: Summarize this article: ...
   Saved time: 0.501s
✅ Cache HIT: Write a Python function to sort a list
   Saved time: 0.500s

--- Cache Statistics ---
Total requests: 9
Cache hits: 6
Cache misses: 3
Hit rate: 66.7%
Cache size: 3
💡 Caching Impact:
For applications with common prompt patterns (chatbots, translation, etc.), caching can reduce:
  • Latency by 50-90% for cached requests
  • GPU usage by 30-60%
  • API costs proportional to hit rate

Part C: Model Quantization with AWQ (13 minutes)

Quantize the model to 4-bit to reduce memory usage and increase throughput:

# Install AWQ
pip install autoawq

# Download pre-quantized model (saves time)
# Or quantize your own model - see lab-code.py for full script

Launch vLLM with quantized model:

python -m vllm.entrypoints.openai.api_server \
    --model TheBloke/Llama-2-7B-Chat-AWQ \
    --quantization awq \
    --port 8001 \
    --dtype float16 \
    --max-model-len 2048

Compare quantized vs full precision:

# compare_quantization.py
import requests
import time
import concurrent.futures

def benchmark_model(base_url, model_name, num_requests=10):
    """Benchmark model throughput"""
    def send_request():
        start = time.time()
        response = requests.post(f'{base_url}/v1/completions', json={
            'model': model_name,
            'prompt': 'Tell me about artificial intelligence in detail',
            'max_tokens': 100,
            'temperature': 0.7
        })
        return time.time() - start, response.json()

    start_time = time.time()
    with concurrent.futures.ThreadPoolExecutor(max_workers=num_requests) as executor:
        futures = [executor.submit(send_request) for _ in range(num_requests)]
        results = [f.result() for f in futures]

    total_time = time.time() - start_time
    durations = [r[0] for r in results]

    return {
        'total_time': total_time,
        'avg_latency': sum(durations) / len(durations),
        'throughput': num_requests / total_time
    }

print("Benchmarking FP16 model (port 8000)...")
fp16_results = benchmark_model(
    'http://localhost:8000',
    'meta-llama/Llama-2-7b-chat-hf',
    num_requests=10
)

print("\nBenchmarking AWQ quantized model (port 8001)...")
awq_results = benchmark_model(
    'http://localhost:8001',
    'TheBloke/Llama-2-7B-Chat-AWQ',
    num_requests=10
)

print("\n" + "="*60)
print("QUANTIZATION COMPARISON")
print("="*60)
print(f"\n{'Metric':<30} {'FP16':<15} {'AWQ (4-bit)':<15} {'Improvement'}")
print("-"*60)
print(f"{'Throughput (req/s)':<30} {fp16_results['throughput']:<15.2f} {awq_results['throughput']:<15.2f} {awq_results['throughput']/fp16_results['throughput']:.2f}x")
print(f"{'Avg Latency (s)':<30} {fp16_results['avg_latency']:<15.2f} {awq_results['avg_latency']:<15.2f} {fp16_results['avg_latency']/awq_results['avg_latency']:.2f}x faster")
print(f"{'GPU Memory (estimated)':<30} {'~14 GB':<15} {'~4 GB':<15} {'3.5x less'}")
print(f"{'Cost Savings':<30} {'Baseline':<15} {'~70%':<15} {'Major'}")
python compare_quantization.py
Expected Output:
============================================================
QUANTIZATION COMPARISON
============================================================

Metric                          FP16           AWQ (4-bit)     Improvement
------------------------------------------------------------
Throughput (req/s)                1.19            2.34            1.97x
Avg Latency (s)                    3.80            1.95            1.95x faster
GPU Memory (estimated)             ~14 GB          ~4 GB           3.5x less
Cost Savings                        Baseline        ~70%            Major
⚠️ Quantization Trade-offs:
While AWQ provides massive efficiency gains, be aware:
  • Slight quality degradation (usually <5% on benchmarks)
  • Not all models have pre-quantized versions
  • Quantization process takes time (1-2 hours for 7B model)
For most production use cases, the trade-off is worth it!

Exercise 3 Checklist

🎯 Bonus Challenge

Implement prefix caching in vLLM (use --enable-prefix-caching flag) and measure the improvement for chat applications with long system prompts.

EXERCISE 4

Monitoring & Observability: Prometheus + Grafana

⏱️ 30 minutes

Objective: Set up a complete monitoring stack with Prometheus and Grafana to track latency, throughput, costs, and system health.

Step 1: Export Metrics from vLLM

vLLM exposes Prometheus metrics by default. Verify they're available:

curl http://localhost:8000/metrics
Expected Output (sample):
# HELP vllm_num_requests_running Number of requests currently running
# TYPE vllm_num_requests_running gauge
vllm_num_requests_running 2.0
# HELP vllm_num_requests_waiting Number of requests waiting to be processed
# TYPE vllm_num_requests_waiting gauge
vllm_num_requests_waiting 0.0
# HELP vllm_gpu_cache_usage_perc GPU KV cache usage
# TYPE vllm_gpu_cache_usage_perc gauge
vllm_gpu_cache_usage_perc 45.2

Step 2: Set Up Prometheus

Create prometheus.yml configuration:

# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'vllm'
    static_configs:
      - targets: ['localhost:8000']
    metrics_path: '/metrics'

  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

Run Prometheus with Docker:

docker run -d \
    --name prometheus \
    -p 9090:9090 \
    -v $(pwd)/prometheus.yml:/etc/prometheus/prometheus.yml \
    --network host \
    prom/prometheus

Verify Prometheus is scraping metrics:

# Open browser to http://localhost:9090
# Query: vllm_num_requests_running

Step 3: Set Up Grafana

Run Grafana with Docker:

docker run -d \
    --name grafana \
    -p 3000:3000 \
    --network host \
    grafana/grafana

Access Grafana:

  • URL: http://localhost:3000
  • Default credentials: admin / admin
  • Add Prometheus data source: http://localhost:9090

Step 4: Create LLM Dashboard

Import this dashboard JSON (grafana-dashboard.json):

{
  "dashboard": {
    "title": "vLLM Production Monitoring",
    "panels": [
      {
        "title": "Requests Per Second",
        "targets": [
          {
            "expr": "rate(vllm_request_success_total[1m])"
          }
        ],
        "type": "graph"
      },
      {
        "title": "Running Requests",
        "targets": [
          {
            "expr": "vllm_num_requests_running"
          }
        ],
        "type": "stat"
      },
      {
        "title": "Waiting Requests",
        "targets": [
          {
            "expr": "vllm_num_requests_waiting"
          }
        ],
        "type": "stat"
      },
      {
        "title": "GPU Cache Usage",
        "targets": [
          {
            "expr": "vllm_gpu_cache_usage_perc"
          }
        ],
        "type": "gauge"
      },
      {
        "title": "Time to First Token (p95)",
        "targets": [
          {
            "expr": "histogram_quantile(0.95, rate(vllm_time_to_first_token_seconds_bucket[5m]))"
          }
        ],
        "type": "graph"
      },
      {
        "title": "E2E Request Latency (p95)",
        "targets": [
          {
            "expr": "histogram_quantile(0.95, rate(vllm_e2e_request_latency_seconds_bucket[5m]))"
          }
        ],
        "type": "graph"
      }
    ]
  }
}

Step 5: Key Metrics to Monitor

Metric What It Measures Target
Time to First Token (TTFT) Perceived latency < 200ms
E2E Request Latency Total generation time < 2s for 100 tokens
Throughput (tokens/sec) System capacity > 1000 tokens/sec
GPU Cache Usage Memory efficiency 70-90%
Requests Waiting Queue backlog < 5
Error Rate Service reliability < 0.1%

Step 6: Set Up Alerts

Create alerting rules in prometheus_alerts.yml:

# prometheus_alerts.yml
groups:
  - name: vllm_alerts
    interval: 30s
    rules:
      - alert: HighLatency
        expr: histogram_quantile(0.95, rate(vllm_e2e_request_latency_seconds_bucket[5m])) > 5
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "High p95 latency detected"
          description: "p95 latency is {{ $value }}s (threshold: 5s)"

      - alert: HighQueueDepth
        expr: vllm_num_requests_waiting > 10
        for: 1m
        labels:
          severity: warning
        annotations:
          summary: "Request queue is backing up"
          description: "{{ $value }} requests waiting (threshold: 10)"

      - alert: GPUMemoryHigh
        expr: vllm_gpu_cache_usage_perc > 95
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "GPU memory near capacity"
          description: "GPU cache at {{ $value }}% (threshold: 95%)"

      - alert: HighErrorRate
        expr: rate(vllm_request_failure_total[5m]) / rate(vllm_request_total[5m]) > 0.01
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Error rate above 1%"
          description: "Error rate: {{ $value | humanizePercentage }}"

Step 7: Cost Tracking

Calculate cost per request based on GPU time:

# cost_calculator.py
def calculate_costs(metrics):
    """Calculate LLM serving costs"""

    # Assumptions
    GPU_COST_PER_HOUR = 1.50  # A10 GPU on cloud

    # Get metrics
    total_requests = metrics['total_requests']
    avg_latency_seconds = metrics['avg_latency']
    gpu_utilization = metrics['gpu_utilization']  # 0-1

    # Calculate
    total_gpu_hours = (total_requests * avg_latency_seconds) / 3600
    effective_gpu_hours = total_gpu_hours / gpu_utilization
    total_cost = effective_gpu_hours * GPU_COST_PER_HOUR
    cost_per_request = total_cost / total_requests
    cost_per_1k_requests = cost_per_request * 1000

    return {
        'total_cost': total_cost,
        'cost_per_request': cost_per_request,
        'cost_per_1k_requests': cost_per_1k_requests,
        'gpu_hours': effective_gpu_hours
    }

# Example
metrics = {
    'total_requests': 10000,
    'avg_latency': 2.5,  # seconds
    'gpu_utilization': 0.85
}

costs = calculate_costs(metrics)
print(f"Total cost: ${costs['total_cost']:.2f}")
print(f"Cost per request: ${costs['cost_per_request']:.4f}")
print(f"Cost per 1K requests: ${costs['cost_per_1k_requests']:.2f}")
print(f"GPU hours used: {costs['gpu_hours']:.2f}")
Expected Output:
Total cost: $12.25
Cost per request: $0.0012
Cost per 1K requests: $1.22
GPU hours used: 8.17

Exercise 4 Checklist

💡 Production Monitoring Best Practices:
  • Track p95/p99 latency, not just averages
  • Monitor GPU cache usage to optimize batch sizes
  • Set up alerts for queue depth spikes
  • Calculate cost per request for budget planning
  • Use distributed tracing (Jaeger/Zipkin) for complex workflows
EXERCISE 5

Production Deployment: Kubernetes + Load Balancing

⏱️ 35 minutes

Objective: Containerize the LLM service with Docker, deploy to Kubernetes, add load balancing, and test under production-like load.

Step 1: Create Dockerfile

Create Dockerfile for the vLLM service:

# Dockerfile
FROM nvidia/cuda:11.8.0-devel-ubuntu22.04

# Install Python 3.10
RUN apt-get update && apt-get install -y \
    python3.10 \
    python3-pip \
    git \
    && rm -rf /var/lib/apt/lists/*

# Install vLLM
RUN pip3 install vllm

# Download model at build time (optional, for faster startup)
# RUN python3 -c "from transformers import AutoTokenizer, AutoModelForCausalLM; \
#     AutoTokenizer.from_pretrained('meta-llama/Llama-2-7b-chat-hf'); \
#     AutoModelForCausalLM.from_pretrained('meta-llama/Llama-2-7b-chat-hf')"

EXPOSE 8000

# Run vLLM server
CMD ["python3", "-m", "vllm.entrypoints.openai.api_server", \
     "--model", "meta-llama/Llama-2-7b-chat-hf", \
     "--port", "8000", \
     "--host", "0.0.0.0"]

Step 2: Build and Test Docker Image

# Build image
docker build -t vllm-server:latest .

# Run container with GPU
docker run -d \
    --gpus all \
    --name vllm-container \
    -p 8000:8000 \
    vllm-server:latest

# Test
curl http://localhost:8000/v1/models

Step 3: Create Docker Compose for Full Stack

Create docker-compose.yml with vLLM + Prometheus + Grafana:

# docker-compose.yml
version: '3.8'

services:
  vllm:
    image: vllm-server:latest
    ports:
      - "8000:8000"
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    restart: unless-stopped

  prometheus:
    image: prom/prometheus:latest
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - ./prometheus_alerts.yml:/etc/prometheus/alerts.yml
      - prometheus_data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
    restart: unless-stopped

  grafana:
    image: grafana/grafana:latest
    ports:
      - "3000:3000"
    volumes:
      - grafana_data:/var/lib/grafana
      - ./grafana-dashboard.json:/etc/grafana/provisioning/dashboards/vllm.json
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin
    restart: unless-stopped

volumes:
  prometheus_data:
  grafana_data:
docker-compose up -d

Step 4: Kubernetes Deployment

Create k8s-deployment.yaml:

# k8s-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-server
  labels:
    app: vllm
spec:
  replicas: 2  # Start with 2 replicas
  selector:
    matchLabels:
      app: vllm
  template:
    metadata:
      labels:
        app: vllm
    spec:
      containers:
      - name: vllm
        image: vllm-server:latest
        ports:
        - containerPort: 8000
        resources:
          limits:
            nvidia.com/gpu: 1  # 1 GPU per pod
          requests:
            memory: "16Gi"
            cpu: "4"
        env:
        - name: MODEL_NAME
          value: "meta-llama/Llama-2-7b-chat-hf"
        livenessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 60
          periodSeconds: 10
        readinessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 30
          periodSeconds: 5
---
apiVersion: v1
kind: Service
metadata:
  name: vllm-service
spec:
  selector:
    app: vllm
  ports:
  - protocol: TCP
    port: 80
    targetPort: 8000
  type: LoadBalancer
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: vllm-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: vllm-server
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70

Step 5: Deploy to Kubernetes

# Apply deployment
kubectl apply -f k8s-deployment.yaml

# Check status
kubectl get pods -l app=vllm
kubectl get svc vllm-service

# Get load balancer IP
kubectl get svc vllm-service -o jsonpath='{.status.loadBalancer.ingress[0].ip}'
Expected Output:
deployment.apps/vllm-server created
service/vllm-service created
horizontalpodautoscaler.autoscaling/vllm-hpa created

NAME                      READY   STATUS    RESTARTS   AGE
vllm-server-abc123      1/1     Running   0          2m
vllm-server-def456      1/1     Running   0          2m

Step 6: Load Testing

Install and run load testing tool:

pip install locust

Create locustfile.py:

# locustfile.py
from locust import HttpUser, task, between
import json

class VLLMUser(HttpUser):
    wait_time = between(1, 3)

    @task
    def generate_completion(self):
        payload = {
            "model": "meta-llama/Llama-2-7b-chat-hf",
            "prompt": "Explain artificial intelligence in simple terms",
            "max_tokens": 100,
            "temperature": 0.7
        }

        headers = {"Content-Type": "application/json"}

        with self.client.post(
            "/v1/completions",
            data=json.dumps(payload),
            headers=headers,
            catch_response=True
        ) as response:
            if response.status_code == 200:
                response.success()
            else:
                response.failure(f"Failed with status {response.status_code}")

Run load test:

# Start with 10 users, ramp up to 100
locust -f locustfile.py --host http://LOAD_BALANCER_IP

# Or headless mode
locust -f locustfile.py --host http://LOAD_BALANCER_IP \
    --users 100 --spawn-rate 10 --run-time 5m --headless
Expected Output:
[2024-01-15 10:30:00] Starting Locust 2.15.1
[2024-01-15 10:30:00] Ramping to 100 users at rate of 10 users/s

Type     Name                                   # reqs      # fails     Avg     Min     Max   Median
POST     /v1/completions                     5000        0        2100    1800    3200    2050

Aggregated                              5000        0        2100    1800    3200    2050

Response time percentiles (approximated):
50%    2050ms
95%    2800ms
99%    3100ms

Step 7: Monitor Auto-Scaling

Watch Kubernetes auto-scale based on load:

# Watch HPA
kubectl get hpa vllm-hpa --watch

# Watch pods
kubectl get pods -l app=vllm --watch
Expected Output:
NAME        REFERENCE                TARGETS    MINPODS   MAXPODS   REPLICAS
vllm-hpa   Deployment/vllm-server    45%/70%    2         10        2
vllm-hpa   Deployment/vllm-server    78%/70%    2         10        3
vllm-hpa   Deployment/vllm-server    85%/70%    2         10        4
💡 Auto-Scaling Strategy:
In production, tune auto-scaling based on:
  • Queue depth (vllm_num_requests_waiting)
  • GPU utilization (target 70-85%)
  • Request latency (scale up if p95 > threshold)
  • Cost constraints (max replicas based on budget)

Step 8: Final Performance Summary

Collect and compare all metrics:

Configuration Throughput p95 Latency Cost/1K req Improvement
Naive Flask (Exercise 1) 0.29 req/s 17.3s $8.50 Baseline
vLLM Basic (Exercise 2) 1.19 req/s 4.1s $2.10 4.1x
vLLM + Quantization (Ex 3) 2.34 req/s 2.0s $1.22 8.1x
K8s + Load Balancer (Ex 5) 12+ req/s 2.8s $1.10 41x

Exercise 5 Checklist

🎯 Bonus Challenge

Add a Redis cache layer between the load balancer and vLLM servers to cache responses for duplicate prompts. Measure the cache hit rate and cost savings.

🎉 Congratulations!

You've successfully built a production-grade LLM serving infrastructure from scratch! You now understand the key techniques that power real-world AI applications.

What You Learned

  • ✅ Why naive serving doesn't scale (sequential processing bottleneck)
  • ✅ How PagedAttention and continuous batching improve throughput by 4-8x
  • ✅ Streaming for reduced perceived latency (20-30x improvement)
  • ✅ Prompt caching to save 50-90% on repeated requests
  • ✅ Model quantization (AWQ) for 3.5x memory savings and 2x speedup
  • ✅ Production monitoring with Prometheus and Grafana
  • ✅ Kubernetes deployment with auto-scaling
  • ✅ Load testing and performance benchmarking

Real-World Impact

By applying these techniques, you transformed a system that could handle 0.29 requests/second into one that handles 12+ requests/second - a 41x improvement! You also reduced costs from $8.50 to $1.10 per 1,000 requests.

Next Steps

  • Take the Module 5 Quiz to test your knowledge
  • Experiment with different models and batch sizes
  • Try other serving frameworks (TensorRT-LLM, Text Generation Inference)
  • Implement A/B testing for model evaluation
  • Add distributed tracing (Jaeger) for complex workflows