Module 5 Quiz: LLMs in Production

Q1

What is the primary bottleneck in naive LLM serving with sequential request processing?

Serving Basics

GPU memory limitations

Each request waits for the previous request to complete

Network bandwidth constraints

Tokenization overhead

✅ Correct Answer: B

Sequential processing forces each request to wait for the previous one to complete, resulting in linear scaling (not concurrent). This is why 5 requests take 5x longer instead of being processed in parallel.

Q2

In the lab, switching from naive Flask to vLLM provided approximately what performance improvement?

Serving Basics

2x improvement

4.1x improvement

10x improvement

20x improvement

✅ Correct Answer: B

The lab showed vLLM improving throughput from 0.29 req/s to 1.19 req/s - a 4.1x improvement through continuous batching and PagedAttention, without any other optimization.

Q3

What is the maximum throughput achieved in the lab after all optimizations (quantization + Kubernetes)?

Serving Basics

2 req/s

5 req/s

12+ req/s

50+ req/s

✅ Correct Answer: C

The lab achieved 12+ req/s with Kubernetes + load balancing + quantization, representing a 41x improvement over the naive baseline.

Q4

Which metric is most important for perceived latency in interactive applications?

Serving Basics

End-to-end request latency

Time to first token (TTFT)

Tokens per second

GPU utilization

✅ Correct Answer: B

Time to first token (TTFT) is crucial for perceived latency. Users see a response in ~0.1s with streaming instead of waiting 2-3s, making the experience feel 20-30x faster even if total latency is the same.

Q5

What is PagedAttention's primary benefit?

KV Caching

Faster tokenization

Eliminates memory fragmentation in KV cache

Reduces model size

Improves prompt preprocessing

✅ Correct Answer: B

PagedAttention manages KV cache in fixed-size pages (like OS virtual memory), eliminating fragmentation. This allows 3-4x higher batch sizes and 2-4x better throughput.

Q6

What is continuous batching?

KV Caching

Processing requests in fixed-size batches

Adding new requests to the batch as soon as previous ones complete

Batching requests by prompt length

Processing all requests at once

✅ Correct Answer: B

Continuous batching dynamically adds new requests to the batch as soon as slots become available (when other requests complete), maximizing GPU utilization without waiting for entire batches to finish.

Q7

How much memory savings does PagedAttention typically provide?

KV Caching

20-30%

50-60%

3-4x (300-400%)

10x or more

✅ Correct Answer: C

PagedAttention typically enables 3-4x higher batch sizes by eliminating memory fragmentation, which translates to 3-4x more efficient memory usage for KV cache.

Q8

What happens to KV cache entries when a request completes in vLLM?

KV Caching

They are deleted and memory is freed

They remain in memory forever

Pages are returned to free pool and can be reused

They are compressed and stored on disk

✅ Correct Answer: C

When a request completes, its PagedAttention pages are returned to a free pool and can immediately be reused by new requests, ensuring efficient memory recycling without fragmentation.

Q9

Which LLM serving framework was used in the lab?

Frameworks

TensorRT-LLM

vLLM

Text Generation Inference (TGI)

Ray Serve

✅ Correct Answer: B

vLLM was used in the lab due to its ease of deployment, OpenAI-compatible API, and excellent performance through PagedAttention and continuous batching.

Q10

What is the primary advantage of TensorRT-LLM over vLLM?

Frameworks

Easier setup and deployment

Better documentation

Highest raw inference speed on NVIDIA GPUs

OpenAI-compatible API out of the box

✅ Correct Answer: C

TensorRT-LLM provides the highest raw inference speed on NVIDIA GPUs through aggressive optimizations, but requires more complex setup. vLLM offers better ease of use with good performance.

Q11

What quantization method was used in the lab to reduce model size?

Frameworks

GPTQ

AWQ (Activation-aware Weight Quantization)

GGUF

SmoothQuant

✅ Correct Answer: B

The lab used AWQ (Activation-aware Weight Quantization) which quantizes to 4-bit while maintaining quality, providing 3.5x memory savings and ~2x speedup.

Q12

How much memory does AWQ quantization save for a 7B parameter model?

Frameworks

~14 GB → ~10 GB (30% savings)

~14 GB → ~7 GB (50% savings)

~14 GB → ~4 GB (3.5x savings)

~14 GB → ~1 GB (14x savings)

✅ Correct Answer: C

AWQ quantizes from FP16 (2 bytes/weight) to 4-bit (0.5 bytes/weight), reducing a 7B model from ~14 GB to ~4 GB, a 3.5x reduction.

Q13

What is the primary cost optimization strategy demonstrated in the lab?

Optimization

Using cheaper cloud providers

Model quantization to reduce GPU memory requirements

Using smaller models

Reducing batch sizes

✅ Correct Answer: B

Model quantization (AWQ) was the primary cost optimization, reducing memory from 14GB to 4GB, allowing use of cheaper GPUs (A10 vs A100) and enabling higher throughput per GPU.

Q14

Response streaming reduced perceived latency by approximately what factor in the lab?

Optimization

2-3x

5-10x

20-30x

100x

✅ Correct Answer: C

Streaming reduced perceived latency by 20-30x: users saw first token in ~0.1s instead of waiting 2-3s for complete response. Total latency remained the same, but UX improved dramatically.

Q15

What is the purpose of prompt caching?

Optimization

Store model weights in cache

Cache identical or similar prompts to avoid recomputing

Speed up tokenization

Reduce network latency

✅ Correct Answer: B

Prompt caching stores responses for identical prompts (or KV cache for common prompt prefixes), avoiding redundant computation. For applications with repeated patterns, this can save 50-90% of compute.

Q16

What is the correct command to launch vLLM with optimized settings?

Code Question

# Which command correctly launches vLLM with high GPU utilization?

python -m vllm.server --model llama-2-7b

python -m vllm.entrypoints.openai.api_server --model meta-llama/Llama-2-7b-chat-hf --gpu-memory-utilization 0.9

vllm serve --model llama2 --batch-size 32

flask run --model vllm

✅ Correct Answer: B

The correct command is: python -m vllm.entrypoints.openai.api_server --model meta-llama/Llama-2-7b-chat-hf --gpu-memory-utilization 0.9. This launches the OpenAI-compatible API server with 90% GPU memory utilization.

Q17

Which Prometheus metric is most important for tracking user experience?

Monitoring

GPU cache usage

Time to first token (p95)

Total tokens generated

Number of running requests

✅ Correct Answer: B

Time to first token (TTFT) at p95 is the most important metric for user experience, as it directly measures perceived responsiveness. Target: < 200ms.

Q18

What is the ideal GPU cache usage percentage for production?

Monitoring

30-50%

70-90%

95-100%

As low as possible

✅ Correct Answer: B

70-90% GPU cache usage is ideal: high enough to maximize throughput, but with headroom to handle traffic spikes without OOM errors. Above 95% risks instability.

Q19

What is the correct Prometheus query to get p95 request latency?

Code Question

avg(vllm_e2e_request_latency_seconds)

histogram_quantile(0.95, rate(vllm_e2e_request_latency_seconds_bucket[5m]))

max(vllm_request_latency)

sum(vllm_latency_total) / 0.95

✅ Correct Answer: B

The correct Prometheus query for p95 latency is: histogram_quantile(0.95, rate(vllm_e2e_request_latency_seconds_bucket[5m])). This calculates the 95th percentile from histogram buckets.

Q20

What is the correct way to enable streaming in a vLLM API request?

Code Question

response = requests.post('http://localhost:8000/v1/completions', json={
    'model': 'meta-llama/Llama-2-7b-chat-hf',
    'prompt': 'Hello',
    'max_tokens': 100,
    # What goes here?
})

'streaming': True

'stream': True

'enable_streaming': 1

'use_stream': 'yes'

✅ Correct Answer: B

The correct parameter is 'stream': True (OpenAI-compatible API format). You must also use stream=True in the requests.post() call to receive streaming responses.

Quiz Results