Test your knowledge of production LLM serving, optimization techniques, monitoring, and deployment.
Sequential processing forces each request to wait for the previous one to complete, resulting in linear scaling (not concurrent). This is why 5 requests take 5x longer instead of being processed in parallel.
The lab showed vLLM improving throughput from 0.29 req/s to 1.19 req/s - a 4.1x improvement through continuous batching and PagedAttention, without any other optimization.
The lab achieved 12+ req/s with Kubernetes + load balancing + quantization, representing a 41x improvement over the naive baseline.
Time to first token (TTFT) is crucial for perceived latency. Users see a response in ~0.1s with streaming instead of waiting 2-3s, making the experience feel 20-30x faster even if total latency is the same.
PagedAttention manages KV cache in fixed-size pages (like OS virtual memory), eliminating fragmentation. This allows 3-4x higher batch sizes and 2-4x better throughput.
Continuous batching dynamically adds new requests to the batch as soon as slots become available (when other requests complete), maximizing GPU utilization without waiting for entire batches to finish.
PagedAttention typically enables 3-4x higher batch sizes by eliminating memory fragmentation, which translates to 3-4x more efficient memory usage for KV cache.
When a request completes, its PagedAttention pages are returned to a free pool and can immediately be reused by new requests, ensuring efficient memory recycling without fragmentation.
vLLM was used in the lab due to its ease of deployment, OpenAI-compatible API, and excellent performance through PagedAttention and continuous batching.
TensorRT-LLM provides the highest raw inference speed on NVIDIA GPUs through aggressive optimizations, but requires more complex setup. vLLM offers better ease of use with good performance.
The lab used AWQ (Activation-aware Weight Quantization) which quantizes to 4-bit while maintaining quality, providing 3.5x memory savings and ~2x speedup.
AWQ quantizes from FP16 (2 bytes/weight) to 4-bit (0.5 bytes/weight), reducing a 7B model from ~14 GB to ~4 GB, a 3.5x reduction.
Model quantization (AWQ) was the primary cost optimization, reducing memory from 14GB to 4GB, allowing use of cheaper GPUs (A10 vs A100) and enabling higher throughput per GPU.
Streaming reduced perceived latency by 20-30x: users saw first token in ~0.1s instead of waiting 2-3s for complete response. Total latency remained the same, but UX improved dramatically.
Prompt caching stores responses for identical prompts (or KV cache for common prompt prefixes), avoiding redundant computation. For applications with repeated patterns, this can save 50-90% of compute.
# Which command correctly launches vLLM with high GPU utilization?
The correct command is: python -m vllm.entrypoints.openai.api_server --model meta-llama/Llama-2-7b-chat-hf --gpu-memory-utilization 0.9. This launches the OpenAI-compatible API server with 90% GPU memory utilization.
python -m vllm.entrypoints.openai.api_server --model meta-llama/Llama-2-7b-chat-hf --gpu-memory-utilization 0.9
Time to first token (TTFT) at p95 is the most important metric for user experience, as it directly measures perceived responsiveness. Target: < 200ms.
70-90% GPU cache usage is ideal: high enough to maximize throughput, but with headroom to handle traffic spikes without OOM errors. Above 95% risks instability.
The correct Prometheus query for p95 latency is: histogram_quantile(0.95, rate(vllm_e2e_request_latency_seconds_bucket[5m])). This calculates the 95th percentile from histogram buckets.
histogram_quantile(0.95, rate(vllm_e2e_request_latency_seconds_bucket[5m]))
response = requests.post('http://localhost:8000/v1/completions', json={ 'model': 'meta-llama/Llama-2-7b-chat-hf', 'prompt': 'Hello', 'max_tokens': 100, # What goes here? })
The correct parameter is 'stream': True (OpenAI-compatible API format). You must also use stream=True in the requests.post() call to receive streaming responses.
'stream': True
stream=True