Module 5 Quiz: LLMs in Production

⏱️ 30 minutes 📝 20 questions ✅ 70% to pass 💻 4 code questions

Test your knowledge of production LLM serving, optimization techniques, monitoring, and deployment.

Q1
What is the primary bottleneck in naive LLM serving with sequential request processing?
Serving Basics
✅ Correct Answer: B

Sequential processing forces each request to wait for the previous one to complete, resulting in linear scaling (not concurrent). This is why 5 requests take 5x longer instead of being processed in parallel.

Q2
In the lab, switching from naive Flask to vLLM provided approximately what performance improvement?
Serving Basics
✅ Correct Answer: B

The lab showed vLLM improving throughput from 0.29 req/s to 1.19 req/s - a 4.1x improvement through continuous batching and PagedAttention, without any other optimization.

Q3
What is the maximum throughput achieved in the lab after all optimizations (quantization + Kubernetes)?
Serving Basics
✅ Correct Answer: C

The lab achieved 12+ req/s with Kubernetes + load balancing + quantization, representing a 41x improvement over the naive baseline.

Q4
Which metric is most important for perceived latency in interactive applications?
Serving Basics
✅ Correct Answer: B

Time to first token (TTFT) is crucial for perceived latency. Users see a response in ~0.1s with streaming instead of waiting 2-3s, making the experience feel 20-30x faster even if total latency is the same.

Q5
What is PagedAttention's primary benefit?
KV Caching
✅ Correct Answer: B

PagedAttention manages KV cache in fixed-size pages (like OS virtual memory), eliminating fragmentation. This allows 3-4x higher batch sizes and 2-4x better throughput.

Q6
What is continuous batching?
KV Caching
✅ Correct Answer: B

Continuous batching dynamically adds new requests to the batch as soon as slots become available (when other requests complete), maximizing GPU utilization without waiting for entire batches to finish.

Q7
How much memory savings does PagedAttention typically provide?
KV Caching
✅ Correct Answer: C

PagedAttention typically enables 3-4x higher batch sizes by eliminating memory fragmentation, which translates to 3-4x more efficient memory usage for KV cache.

Q8
What happens to KV cache entries when a request completes in vLLM?
KV Caching
✅ Correct Answer: C

When a request completes, its PagedAttention pages are returned to a free pool and can immediately be reused by new requests, ensuring efficient memory recycling without fragmentation.

Q9
Which LLM serving framework was used in the lab?
Frameworks
✅ Correct Answer: B

vLLM was used in the lab due to its ease of deployment, OpenAI-compatible API, and excellent performance through PagedAttention and continuous batching.

Q10
What is the primary advantage of TensorRT-LLM over vLLM?
Frameworks
✅ Correct Answer: C

TensorRT-LLM provides the highest raw inference speed on NVIDIA GPUs through aggressive optimizations, but requires more complex setup. vLLM offers better ease of use with good performance.

Q11
What quantization method was used in the lab to reduce model size?
Frameworks
✅ Correct Answer: B

The lab used AWQ (Activation-aware Weight Quantization) which quantizes to 4-bit while maintaining quality, providing 3.5x memory savings and ~2x speedup.

Q12
How much memory does AWQ quantization save for a 7B parameter model?
Frameworks
✅ Correct Answer: C

AWQ quantizes from FP16 (2 bytes/weight) to 4-bit (0.5 bytes/weight), reducing a 7B model from ~14 GB to ~4 GB, a 3.5x reduction.

Q13
What is the primary cost optimization strategy demonstrated in the lab?
Optimization
✅ Correct Answer: B

Model quantization (AWQ) was the primary cost optimization, reducing memory from 14GB to 4GB, allowing use of cheaper GPUs (A10 vs A100) and enabling higher throughput per GPU.

Q14
Response streaming reduced perceived latency by approximately what factor in the lab?
Optimization
✅ Correct Answer: C

Streaming reduced perceived latency by 20-30x: users saw first token in ~0.1s instead of waiting 2-3s for complete response. Total latency remained the same, but UX improved dramatically.

Q15
What is the purpose of prompt caching?
Optimization
✅ Correct Answer: B

Prompt caching stores responses for identical prompts (or KV cache for common prompt prefixes), avoiding redundant computation. For applications with repeated patterns, this can save 50-90% of compute.

Q16
What is the correct command to launch vLLM with optimized settings?
Code Question
# Which command correctly launches vLLM with high GPU utilization?
✅ Correct Answer: B

The correct command is: python -m vllm.entrypoints.openai.api_server --model meta-llama/Llama-2-7b-chat-hf --gpu-memory-utilization 0.9. This launches the OpenAI-compatible API server with 90% GPU memory utilization.

Q17
Which Prometheus metric is most important for tracking user experience?
Monitoring
✅ Correct Answer: B

Time to first token (TTFT) at p95 is the most important metric for user experience, as it directly measures perceived responsiveness. Target: < 200ms.

Q18
What is the ideal GPU cache usage percentage for production?
Monitoring
✅ Correct Answer: B

70-90% GPU cache usage is ideal: high enough to maximize throughput, but with headroom to handle traffic spikes without OOM errors. Above 95% risks instability.

Q19
What is the correct Prometheus query to get p95 request latency?
Code Question
✅ Correct Answer: B

The correct Prometheus query for p95 latency is: histogram_quantile(0.95, rate(vllm_e2e_request_latency_seconds_bucket[5m])). This calculates the 95th percentile from histogram buckets.

Q20
What is the correct way to enable streaming in a vLLM API request?
Code Question
response = requests.post('http://localhost:8000/v1/completions', json={
    'model': 'meta-llama/Llama-2-7b-chat-hf',
    'prompt': 'Hello',
    'max_tokens': 100,
    # What goes here?
})
✅ Correct Answer: B

The correct parameter is 'stream': True (OpenAI-compatible API format). You must also use stream=True in the requests.post() call to receive streaming responses.

Quiz Results

0%
0/20

Back to Module 5