Chapter 3: vLLM and Serving Frameworks - Module 5 - Deep Dive Track

Understanding the theory behind PagedAttention and continuous batching is essential, but implementing these techniques from scratch would require months of engineering effort and deep expertise in GPU programming. Fortunately, several production-ready frameworks have emerged that implement these optimizations and provide simple APIs for serving LLMs at scale.

In this chapter, we'll explore the three leading LLM serving frameworks: vLLM, Text Generation Inference (TGI), and TensorRT-LLM. We'll compare their strengths, walk through practical deployment examples, and provide guidance on choosing the right framework for your use case.

vLLM: The PagedAttention Pioneer

vLLM (Virtual LLM) is an open-source inference and serving engine developed by researchers at UC Berkeley. It's the original implementation of PagedAttention and has quickly become the de facto standard for high-throughput LLM serving.

Key Features

PagedAttention

vLLM's core innovation. Virtualizes KV Cache memory using fixed-size blocks, reducing memory waste from 60-80% to less than 4%.

Impact: 2-4x higher throughput compared to other optimized frameworks, 20-50x better than naive implementations.

Continuous Batching

Dynamically schedules requests, adding new ones to the active batch as soon as others complete. Keeps GPU utilization at 85-95%.

OpenAI-Compatible API

Provides a drop-in replacement for OpenAI's API. Switch from GPT-4 to your self-hosted model by changing one URL—no code changes required.

Example: Change https://api.openai.com/v1 to http://your-server:8000/v1 and you're serving from your own infrastructure.

Efficient Memory Sharing

For tasks like parallel sampling (generating multiple outputs from one prompt), vLLM shares the KV Cache for the prompt across sequences using copy-on-write, saving massive amounts of memory.

Installation and Setup

# Install vLLM (requires CUDA 11.8+, Python 3.8-3.11)
pip install vllm

# Or with specific CUDA version
pip install vllm --extra-index-url https://download.pytorch.org/whl/cu118

# Verify installation
python -c "import vllm; print(vllm.__version__)"

Basic Usage: Python API

from vllm import LLM, SamplingParams

# Initialize the model
# This loads the model and prepares the vLLM engine
llm = LLM(
    model="meta-llama/Llama-2-7b-chat-hf",
    tensor_parallel_size=1,  # Number of GPUs
    dtype="float16",
    max_model_len=2048,
)

# Define sampling parameters
sampling_params = SamplingParams(
    temperature=0.7,
    top_p=0.95,
    max_tokens=512,
)

# Single generation
prompts = ["Explain quantum computing in simple terms."]
outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    generated_text = output.outputs[0].text
    print(f"Generated: {generated_text}")


# Batch generation (much more efficient!)
prompts = [
    "Explain quantum computing in simple terms.",
    "What is the capital of France?",
    "Write a haiku about AI.",
    "How do neural networks work?",
]

# vLLM automatically batches and optimizes these
outputs = llm.generate(prompts, sampling_params)

for prompt, output in zip(prompts, outputs):
    print(f"\nPrompt: {prompt}")
    print(f"Response: {output.outputs[0].text}")
    print(f"Tokens generated: {len(output.outputs[0].token_ids)}")
    print(f"Time: {output.metrics.finished_time - output.metrics.started_time:.2f}s")

OpenAI-Compatible Server

# Start vLLM server with OpenAI-compatible API
python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-2-7b-chat-hf \
    --host 0.0.0.0 \
    --port 8000 \
    --tensor-parallel-size 1 \
    --dtype float16

# Server provides endpoints:
# - POST /v1/completions (text completion)
# - POST /v1/chat/completions (chat completion)
# - GET /v1/models (list models)
# - GET /health (health check)

# Client code (drop-in OpenAI replacement)
from openai import OpenAI

# Point to your vLLM server instead of OpenAI
client = OpenAI(
    api_key="EMPTY",  # vLLM doesn't require API key
    base_url="http://localhost:8000/v1",
)

# Use exactly like OpenAI API
response = client.chat.completions.create(
    model="meta-llama/Llama-2-7b-chat-hf",
    messages=[
        {"role": "system", "content": "You are a helpful AI assistant."},
        {"role": "user", "content": "Explain PagedAttention."}
    ],
    temperature=0.7,
    max_tokens=300,
)

print(response.choices[0].message.content)


# Streaming example
stream = client.chat.completions.create(
    model="meta-llama/Llama-2-7b-chat-hf",
    messages=[
        {"role": "user", "content": "Write a short story about AI."}
    ],
    stream=True,
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)

Advanced Configuration

from vllm import LLM, SamplingParams

# Production configuration with optimizations
llm = LLM(
    model="meta-llama/Llama-2-13b-chat-hf",

    # GPU configuration
    tensor_parallel_size=2,  # Use 2 GPUs with tensor parallelism
    gpu_memory_utilization=0.90,  # Use 90% of GPU memory for KV cache

    # Performance tuning
    max_num_batched_tokens=8192,  # Max tokens processed in one batch
    max_num_seqs=256,  # Max concurrent sequences
    max_model_len=4096,  # Max sequence length

    # Quantization for speed and memory
    quantization="awq",  # Use AWQ quantization
    dtype="float16",

    # Enable optimizations
    enforce_eager=False,  # Use CUDA graphs for faster inference
    trust_remote_code=True,  # Required for some custom models
)

# Advanced sampling parameters
sampling_params = SamplingParams(
    # Sampling strategy
    temperature=0.8,
    top_p=0.95,
    top_k=50,

    # Generation control
    max_tokens=1024,
    min_tokens=10,
    stop=["", "\n\n\n"],

    # Frequency and presence penalties
    frequency_penalty=0.1,
    presence_penalty=0.1,

    # Multiple outputs per prompt
    n=3,  # Generate 3 different completions
    best_of=5,  # Sample 5 and return best 3

    # Streaming
    stream=False,
)

Text Generation Inference (TGI): Hugging Face's Solution

Text Generation Inference (TGI) is Hugging Face's production-grade serving framework. It powers inference for many of Hugging Face's hosted models and provides excellent integration with the Hugging Face ecosystem.

Key Features

Hugging Face native: Seamless integration with Hub models, works with any Transformers-compatible model
Continuous batching: Dynamic batching for high throughput
Token streaming: Server-Sent Events (SSE) for real-time streaming
Safetensors support: Fast and secure model loading
Quantization: Built-in support for bitsandbytes, GPTQ, AWQ
Flash Attention: Optimized attention implementation for speed

Installation (Docker Recommended)

# Pull TGI Docker image
docker pull ghcr.io/huggingface/text-generation-inference:latest

# Run TGI server
docker run -d --gpus all \
    --shm-size 1g \
    -p 8080:80 \
    -v $PWD/data:/data \
    ghcr.io/huggingface/text-generation-inference:latest \
    --model-id meta-llama/Llama-2-7b-chat-hf \
    --num-shard 1 \
    --max-concurrent-requests 128 \
    --max-total-tokens 4096

# Check server health
curl http://localhost:8080/health

Client Usage

from huggingface_hub import InferenceClient

# Connect to TGI server
client = InferenceClient(model="http://localhost:8080")

# Text generation
prompt = "Explain machine learning in simple terms."
response = client.text_generation(
    prompt,
    max_new_tokens=500,
    temperature=0.7,
    top_p=0.95,
    repetition_penalty=1.1,
)
print(response)


# Streaming generation
for token in client.text_generation(
    prompt,
    max_new_tokens=500,
    stream=True,
):
    print(token, end="", flush=True)


# Chat completion format (for chat models)
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is the meaning of life?"}
]

response = client.chat_completion(
    messages=messages,
    max_tokens=300,
    temperature=0.8,
)
print(response.choices[0].message.content)

TGI vs vLLM: When to Choose TGI

Use Case	TGI Strengths
Hugging Face Ecosystem	Native integration, can serve any HF model instantly
Production Deployment	Battle-tested, powers HF Inference API
Docker/Kubernetes	Official Docker images, easy container deployment
Multi-model Serving	Can easily switch between models without code changes

⚖️ Performance Comparison

In benchmarks, vLLM typically achieves 20-30% higher throughput than TGI due to its more aggressive PagedAttention optimizations. However, TGI offers better ecosystem integration and is easier to deploy in containerized environments. For maximum performance, choose vLLM. For maximum compatibility and ease of deployment, choose TGI.

TensorRT-LLM: NVIDIA's Performance Champion

TensorRT-LLM is NVIDIA's framework for optimizing and serving LLMs on NVIDIA GPUs. It provides the absolute highest performance but requires more setup and is less flexible than vLLM or TGI.

Key Features

Maximum performance: 2-3x faster than vLLM for certain models and hardware configurations
Advanced optimizations: Custom CUDA kernels, graph optimizations, kernel fusion
Multi-GPU support: Tensor parallelism, pipeline parallelism, expert parallelism (for MoE models)
In-flight batching: NVIDIA's implementation of continuous batching
Quantization: INT8, INT4, FP8 support with minimal accuracy loss

When to Use TensorRT-LLM

Best for High-Volume Production

If you're serving millions of requests per day and every millisecond matters, TensorRT-LLM's extreme optimizations justify the additional setup complexity.

Example: A large-scale chatbot serving 10M+ daily users where even a 10% latency improvement translates to significant infrastructure cost savings.

NVIDIA Hardware Stack

If you're already invested in NVIDIA's ecosystem (A100, H100, L40S GPUs) and want to squeeze maximum performance from your hardware.

Installation (Requires NVIDIA GPUs)

# TensorRT-LLM requires Docker for easiest setup
docker pull nvcr.io/nvidia/tensorrt-llm:latest

# Or install from source (complex)
git clone https://github.com/NVIDIA/TensorRT-LLM.git
cd TensorRT-LLM
pip install -r requirements.txt
python setup.py install

Model Compilation (Required)

Unlike vLLM and TGI, TensorRT-LLM requires you to compile your model into an optimized engine before serving. This is a one-time process per model:

# Convert Hugging Face model to TensorRT-LLM format
python convert_checkpoint.py \
    --model_dir meta-llama/Llama-2-7b-chat-hf \
    --output_dir ./llama-2-7b-trt \
    --dtype float16

# Build TensorRT engine
trtllm-build \
    --checkpoint_dir ./llama-2-7b-trt \
    --output_dir ./llama-2-7b-engine \
    --gemm_plugin float16 \
    --max_batch_size 256 \
    --max_input_len 2048 \
    --max_output_len 1024

# This creates optimized engine files that are ready to serve

Serving with Triton Inference Server

# TensorRT-LLM integrates with NVIDIA Triton for serving
# Start Triton server with TensorRT-LLM backend
docker run -d --gpus all \
    -p 8000:8000 \
    -p 8001:8001 \
    -p 8002:8002 \
    -v ${PWD}/llama-2-7b-engine:/models \
    nvcr.io/nvidia/tritonserver:23.10-trtllm-python-py3

# Client code using Triton
import tritonclient.http as httpclient

client = httpclient.InferenceServerClient(url="localhost:8000")

# Prepare input
prompt = "Explain quantum entanglement."
inputs = [
    httpclient.InferInput("input_ids", [1, len(prompt)], "INT32")
]
inputs[0].set_data_from_numpy(tokenize(prompt))

# Request inference
outputs = [httpclient.InferRequestedOutput("output_ids")]
response = client.infer("tensorrt_llm", inputs, outputs=outputs)

# Decode output
output_ids = response.as_numpy("output_ids")
generated_text = decode(output_ids)
print(generated_text)

Performance Benchmarks

Model	Framework	Throughput (req/sec)	Latency (ms)
Llama 2 7B	TensorRT-LLM	42	85
Llama 2 7B	vLLM	35	102
Llama 2 7B	TGI	28	125
Llama 2 7B	Naive (HF)	2	1200+

Benchmark conditions: A100 80GB GPU, batch size=32, input=512 tokens, output=128 tokens

Choosing the Right Framework

Each framework has distinct strengths. Here's a comprehensive comparison to guide your decision:

Criteria	vLLM	TGI	TensorRT-LLM
Performance	Excellent (95/100)	Very Good (85/100)	Best (100/100)
Ease of Use	Excellent	Excellent	Moderate
Model Support	50+ models	Any HF model	Manual porting
Setup Complexity	pip install	Docker run	Complex build
Deployment	Simple	Very Simple	Moderate
OpenAI Compatible	Yes	Yes	No (Triton API)
Quantization	AWQ, GPTQ	AWQ, GPTQ, bitsandbytes	INT8, INT4, FP8
Multi-GPU	Tensor parallel	Tensor parallel	All strategies
Community	Very Active	Very Active	Moderate
Cost (Setup Time)	1-2 hours	1-2 hours	1-2 days

Decision Matrix

                    Choose vLLM if:
                    ✅ You want the best performance-to-ease ratio
✅ You need OpenAI API compatibility
✅ You're using popular open-source models (Llama, Mistral, Phi, etc.)
✅ You want fast iteration and experimentation
✅ You value community support and frequent updates

                

                    Choose TGI if:
                    ✅ You're deeply integrated with Hugging Face ecosystem
✅ You need to serve many different models dynamically
✅ You prefer containerized deployment (Docker/K8s)
✅ You want production-proven stability (powers HF Inference API)
✅ You need excellent documentation and enterprise support

                

                    Choose TensorRT-LLM if:
                    ✅ You need absolute maximum performance
✅ You're serving millions of requests per day
✅ You have NVIDIA hardware (A100, H100, L40S)
✅ You can invest engineering time in model optimization
✅ Infrastructure cost savings justify setup complexity

                

Recommendation for most practitioners: Start with vLLM. It provides 90-95% of TensorRT-LLM's performance with 10% of the setup complexity. If you later identify specific performance bottlenecks and have the engineering resources, you can migrate to TensorRT-LLM for maximum optimization.

Production Deployment Best Practices

1. Docker Deployment (Recommended)

# Dockerfile for vLLM production deployment
FROM nvidia/cuda:11.8.0-runtime-ubuntu22.04

# Install Python and dependencies
RUN apt-get update && apt-get install -y \
    python3.10 \
    python3-pip \
    git

# Install vLLM
RUN pip3 install vllm

# Copy model (or download at runtime)
# COPY ./models /models

# Set environment variables
ENV MODEL_NAME="meta-llama/Llama-2-7b-chat-hf"
ENV TENSOR_PARALLEL_SIZE="1"
ENV GPU_MEMORY_UTILIZATION="0.90"

# Expose port
EXPOSE 8000

# Start server
CMD python3 -m vllm.entrypoints.openai.api_server \
    --model $MODEL_NAME \
    --host 0.0.0.0 \
    --port 8000 \
    --tensor-parallel-size $TENSOR_PARALLEL_SIZE \
    --gpu-memory-utilization $GPU_MEMORY_UTILIZATION

# Build and run
docker build -t vllm-server:latest .

docker run -d \
    --name vllm-llama2 \
    --gpus all \
    -p 8000:8000 \
    -e HF_TOKEN=your_huggingface_token \
    vllm-server:latest

2. Kubernetes Deployment

# kubernetes/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-server
spec:
  replicas: 2  # Scale as needed
  selector:
    matchLabels:
      app: vllm-server
  template:
    metadata:
      labels:
        app: vllm-server
    spec:
      containers:
      - name: vllm
        image: vllm-server:latest
        ports:
        - containerPort: 8000
        resources:
          limits:
            nvidia.com/gpu: 1  # Request 1 GPU per pod
          requests:
            memory: "32Gi"
            cpu: "8"
        env:
        - name: MODEL_NAME
          value: "meta-llama/Llama-2-7b-chat-hf"
        - name: TENSOR_PARALLEL_SIZE
          value: "1"
        livenessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 120
          periodSeconds: 30
        readinessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 120
          periodSeconds: 10

---
apiVersion: v1
kind: Service
metadata:
  name: vllm-service
spec:
  selector:
    app: vllm-server
  ports:
  - protocol: TCP
    port: 80
    targetPort: 8000
  type: LoadBalancer

3. Load Balancing and Scaling

# nginx.conf for load balancing multiple vLLM instances
upstream vllm_backend {
    # Least connections strategy for LLM serving
    least_conn;

    server 10.0.1.10:8000 max_fails=3 fail_timeout=30s;
    server 10.0.1.11:8000 max_fails=3 fail_timeout=30s;
    server 10.0.1.12:8000 max_fails=3 fail_timeout=30s;
}

server {
    listen 80;
    server_name llm-api.example.com;

    # Increase timeouts for long-running LLM requests
    proxy_connect_timeout 300s;
    proxy_send_timeout 300s;
    proxy_read_timeout 300s;

    # Enable streaming
    proxy_buffering off;
    proxy_cache off;

    location / {
        proxy_pass http://vllm_backend;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;

        # For Server-Sent Events (streaming)
        proxy_set_header Connection '';
        proxy_http_version 1.1;
        chunked_transfer_encoding on;
    }

    # Health check endpoint
    location /health {
        proxy_pass http://vllm_backend/health;
    }
}

4. Monitoring and Observability

# Add Prometheus metrics to vLLM deployment
from prometheus_client import Counter, Histogram, Gauge
import time

# Metrics
request_count = Counter('vllm_requests_total', 'Total requests')
request_duration = Histogram('vllm_request_duration_seconds', 'Request duration')
tokens_generated = Counter('vllm_tokens_generated_total', 'Total tokens generated')
active_requests = Gauge('vllm_active_requests', 'Currently active requests')

# Wrap generation with metrics
@request_duration.time()
def generate_with_metrics(prompt, sampling_params):
    request_count.inc()
    active_requests.inc()

    try:
        output = llm.generate([prompt], sampling_params)[0]
        tokens_generated.inc(len(output.outputs[0].token_ids))
        return output
    finally:
        active_requests.dec()


# Expose metrics endpoint
from prometheus_client import start_http_server
start_http_server(9090)  # Metrics available at :9090/metrics

Summary: Building Production LLM Infrastructure

                    🔑 Key Takeaways
                    vLLM is the default choice: Best performance-to-ease ratio, excellent community support, OpenAI-compatible API
TGI for Hugging Face integration: Seamless ecosystem integration, production-proven, containerization-friendly
TensorRT-LLM for extreme optimization: Maximum performance but requires significant setup investment
All three are production-ready: Each powers large-scale deployments serving millions of requests
Start simple, optimize later: Begin with vLLM, profile your workload, migrate to TensorRT-LLM only if justified

                

With these frameworks, you can deploy production-grade LLM endpoints that achieve 20-50x better throughput than naive implementations. In the next chapter, we'll explore cost and latency optimization techniques to further improve efficiency and user experience.

← Previous: KV Caching Next: Cost Optimization →