Building a production LLM system involves navigating two competing priorities: minimizing operational costs while delivering excellent user experience through low latency. These goals are often in tension—the fastest solution may be prohibitively expensive, while the cheapest option may provide unacceptably slow responses.
This chapter explores proven techniques for optimizing both cost and latency. We'll cover semantic caching, intelligent model routing, quantization, streaming, prompt optimization, and advanced techniques like speculative decoding. By combining these strategies, you can build systems that are both affordable and performant.
Cost Optimization Strategies
1. Semantic Caching: The 30-70% Solution
Caching is the single most impactful cost optimization technique. By storing and reusing responses to common or similar queries, you can reduce API calls by 30-70% in most applications.
Exact Caching (Simple but Limited)
import redis
import hashlib
import json
class ExactCache:
def __init__(self, redis_client):
self.redis = redis_client
self.ttl = 3600 # 1 hour expiration
def _make_key(self, prompt, params):
"""Create cache key from prompt and parameters"""
cache_input = {
'prompt': prompt,
'temperature': params.get('temperature'),
'max_tokens': params.get('max_tokens'),
# Include other params that affect output
}
key_str = json.dumps(cache_input, sort_keys=True)
return hashlib.sha256(key_str.encode()).hexdigest()
def get(self, prompt, params):
"""Check if response is cached"""
key = self._make_key(prompt, params)
cached = self.redis.get(key)
if cached:
return json.loads(cached)
return None
def set(self, prompt, params, response):
"""Store response in cache"""
key = self._make_key(prompt, params)
self.redis.setex(key, self.ttl, json.dumps(response))
# Usage
redis_client = redis.Redis(host='localhost', port=6379, db=0)
cache = ExactCache(redis_client)
def generate_with_cache(prompt, params):
# Check cache first
cached_response = cache.get(prompt, params)
if cached_response:
print("Cache hit! Saved an API call.")
return cached_response
# Cache miss - call LLM
response = llm.generate(prompt, params)
# Store in cache
cache.set(prompt, params, response)
return response
# Example: These will hit the cache
generate_with_cache("What is 2+2?", {"temperature": 0.7})
generate_with_cache("What is 2+2?", {"temperature": 0.7}) # Cache hit!
⚠️ Exact Caching Limitation
Exact caching only works for identical prompts. Small variations—even a single character difference—result in cache misses. This severely limits effectiveness for natural language queries where users express the same intent in different ways.
Semantic Caching (Powerful)
Semantic caching uses embeddings to match queries by meaning, not exact text. This dramatically increases cache hit rates.
from sentence_transformers import SentenceTransformer
import numpy as np
import pickle
class SemanticCache:
def __init__(self, redis_client, similarity_threshold=0.95):
self.redis = redis_client
self.threshold = similarity_threshold
self.encoder = SentenceTransformer('all-MiniLM-L6-v2')
def _get_embedding(self, text):
"""Generate embedding for text"""
return self.encoder.encode(text, convert_to_numpy=True)
def _cosine_similarity(self, vec1, vec2):
"""Calculate cosine similarity between vectors"""
return np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2))
def search(self, prompt, params):
"""Search for semantically similar cached responses"""
query_embedding = self._get_embedding(prompt)
# Get all cache entries (in production, use vector DB like Pinecone/Weaviate)
cache_keys = self.redis.keys('semantic_cache:*')
best_match = None
best_similarity = 0
for key in cache_keys:
cached_data = pickle.loads(self.redis.get(key))
# Check parameter match (must be exact for params)
if cached_data['params'] != params:
continue
# Calculate semantic similarity
similarity = self._cosine_similarity(
query_embedding,
cached_data['embedding']
)
if similarity > best_similarity:
best_similarity = similarity
best_match = cached_data
# Return if similarity exceeds threshold
if best_similarity >= self.threshold:
print(f"Semantic cache hit! Similarity: {best_similarity:.3f}")
return best_match['response']
return None
def store(self, prompt, params, response):
"""Store prompt, embedding, and response"""
embedding = self._get_embedding(prompt)
cache_data = {
'prompt': prompt,
'embedding': embedding,
'params': params,
'response': response
}
# Use embedding hash as key
key = f"semantic_cache:{hashlib.sha256(embedding.tobytes()).hexdigest()}"
self.redis.setex(key, 3600, pickle.dumps(cache_data))
# Usage
semantic_cache = SemanticCache(redis_client, similarity_threshold=0.95)
def generate_with_semantic_cache(prompt, params):
# Check semantic cache
cached = semantic_cache.search(prompt, params)
if cached:
return cached
# Cache miss - generate
response = llm.generate(prompt, params)
semantic_cache.store(prompt, params, response)
return response
# These are semantically similar and will hit the cache!
generate_with_semantic_cache("What's 2 plus 2?", {"temperature": 0.7})
generate_with_semantic_cache("What is two plus two?", {"temperature": 0.7}) # Hit!
generate_with_semantic_cache("Calculate 2+2", {"temperature": 0.7}) # Hit!
💰 Cost Impact
Example Application: Customer Support Chatbot
- 1 million queries/month
- Average cost: $0.002/query = $2,000/month
- 50% semantic cache hit rate
- New cost: $1,000/month
- Savings: $1,000/month or $12,000/year
2. Intelligent Model Routing
Not all queries require your most powerful (and expensive) model. Route simple queries to cheaper models and complex ones to premium models.
class IntelligentRouter:
def __init__(self):
# Model costs (per 1M tokens)
self.models = {
'gpt-4-turbo': {'cost': 0.01, 'capability': 10},
'gpt-3.5-turbo': {'cost': 0.0005, 'capability': 7},
'llama-2-70b': {'cost': 0.0003, 'capability': 6},
'llama-2-13b': {'cost': 0.0001, 'capability': 4},
}
def classify_complexity(self, prompt):
"""
Classify query complexity
Returns: 'simple', 'medium', 'complex'
"""
# Simple heuristics (can use a classifier model for better accuracy)
prompt_lower = prompt.lower()
# Simple queries
simple_patterns = [
'what is', 'who is', 'when is', 'where is',
'define', 'meaning of', 'explain in simple terms'
]
if any(pattern in prompt_lower for pattern in simple_patterns):
return 'simple'
# Complex queries
complex_patterns = [
'analyze', 'compare and contrast', 'evaluate',
'multi-step', 'reasoning', 'code review'
]
if any(pattern in prompt_lower for pattern in complex_patterns):
return 'complex'
# Check length
if len(prompt.split()) > 100:
return 'complex'
elif len(prompt.split()) < 20:
return 'simple'
return 'medium'
def route(self, prompt):
"""Route query to appropriate model"""
complexity = self.classify_complexity(prompt)
routing_map = {
'simple': 'llama-2-13b',
'medium': 'gpt-3.5-turbo',
'complex': 'gpt-4-turbo'
}
model = routing_map[complexity]
print(f"Routing to {model} (complexity: {complexity})")
return model
# Usage
router = IntelligentRouter()
def generate_with_routing(prompt, params):
model = router.route(prompt)
# Generate with selected model
response = llm.generate(prompt, params, model=model)
return response
# Examples
generate_with_routing("What is Python?", {})
# Output: Routing to llama-2-13b (complexity: simple)
generate_with_routing(
"Analyze the architectural trade-offs between microservices and monoliths",
{}
)
# Output: Routing to gpt-4-turbo (complexity: complex)
| Query Type | Model | Cost/1M tokens | Distribution | Total Cost |
|---|---|---|---|---|
| Simple (50%) | Llama 2 13B | $0.0001 | 500K tokens | $0.05 |
| Medium (35%) | GPT-3.5 Turbo | $0.0005 | 350K tokens | $0.18 |
| Complex (15%) | GPT-4 Turbo | $0.01 | 150K tokens | $1.50 |
| Total (Routed) | Mixed | - | 1M tokens | $1.73 |
| All GPT-4 (baseline) | GPT-4 Turbo | $0.01 | 1M tokens | $10.00 |
Savings: 82.7% cost reduction ($10.00 → $1.73)
3. Prompt Optimization
Shorter prompts = lower costs. Every token you send (input) and receive (output) increases cost.
# BAD: Verbose prompt (234 tokens)
verbose_prompt = """
You are an AI assistant designed to help users with their questions.
Please read the following user query carefully and provide a detailed,
comprehensive, and accurate response. Make sure to consider all aspects
of the question and provide examples where appropriate. Your response
should be informative and helpful.
Here are some guidelines to follow:
1. Always be polite and professional
2. Provide clear explanations
3. Use examples to illustrate your points
4. Keep your response well-structured
Now, here is the user's question:
What is machine learning?
Please provide your response below:
"""
# GOOD: Concise prompt (11 tokens)
concise_prompt = "Explain machine learning in 2-3 sentences."
# Cost comparison (GPT-4 Turbo):
# Verbose: 234 input tokens × $0.01/1M = $0.00234
# Concise: 11 input tokens × $0.01/1M = $0.00011
# Savings per query: 95.3%
# At 1M queries: $2,340 vs $110 = $2,230 saved!
🎯 Prompt Optimization Techniques
- Remove fluff: Cut unnecessary politeness, disclaimers, and context
- Use system prompts: Place instructions once in system message, not repeated in each query
- Constrain output: Use max_tokens to prevent excessively long responses
- Test variations: Iterate to find shortest prompt that maintains quality
4. Batching Requests
For self-hosted models, batch processing dramatically reduces cost per token by maximizing GPU utilization.
# Cost comparison: Self-hosted Llama 2 70B on A100 ($3/hour)
# No batching: 1 request at a time
# Throughput: 5 requests/sec
# Hourly throughput: 18,000 requests
# Cost per request: $3 / 18,000 = $0.000167
# With batching (batch_size=32):
# Throughput: 40 requests/sec
# Hourly throughput: 144,000 requests
# Cost per request: $3 / 144,000 = $0.000021
# Savings: 88% cost reduction per request!
Latency Optimization Strategies
1. Streaming: The Perception Hack
Streaming is the single most impactful latency optimization for user-facing applications. It doesn't make generation faster, but it dramatically improves perceived latency.
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
# Non-streaming (poor UX)
def generate_blocking(prompt):
start = time.time()
response = client.chat.completions.create(
model="llama-2-7b",
messages=[{"role": "user", "content": prompt}],
stream=False
)
elapsed = time.time() - start
print(f"Time to first response: {elapsed:.2f}s") # Could be 5-10 seconds!
print(response.choices[0].message.content)
# Streaming (excellent UX)
def generate_streaming(prompt):
start = time.time()
first_token_time = None
stream = client.chat.completions.create(
model="llama-2-7b",
messages=[{"role": "user", "content": prompt}],
stream=True
)
for chunk in stream:
if chunk.choices[0].delta.content:
if first_token_time is None:
first_token_time = time.time() - start
print(f"Time to first token: {first_token_time:.2f}s") # ~0.5s!
print(chunk.choices[0].delta.content, end="", flush=True)
# Example
generate_streaming("Write a story about AI")
# Output:
# Time to first token: 0.48s ← User sees response immediately!
# Once upon a time, in a world where artificial intelligence...
⚡ Latency Metrics
- TTFT (Time-To-First-Token): Most critical for UX - aim for <500ms
- TPOT (Time-Per-Output-Token): Throughput metric - aim for <50ms
- End-to-end latency: Total time - less important than TTFT for streaming
2. Model Quantization
Quantization reduces model precision (e.g., float16 → int8), which speeds up inference and reduces memory usage.
# Load quantized model with vLLM
from vllm import LLM
# FP16 (baseline)
llm_fp16 = LLM(
model="meta-llama/Llama-2-7b-chat-hf",
dtype="float16" # Default
)
# AWQ quantization (4-bit)
llm_awq = LLM(
model="TheBloke/Llama-2-7B-Chat-AWQ",
quantization="awq"
)
# GPTQ quantization (4-bit)
llm_gptq = LLM(
model="TheBloke/Llama-2-7B-Chat-GPTQ",
quantization="gptq"
)
| Precision | Memory | Speed | Quality Loss |
|---|---|---|---|
| FP16 (float16) | 14 GB | 1.0x (baseline) | 0% |
| INT8 (8-bit) | 7 GB | 1.5x faster | 1-2% |
| AWQ (4-bit) | 4 GB | 2-3x faster | 2-3% |
| GPTQ (4-bit) | 4 GB | 2-3x faster | 3-5% |
For most applications, AWQ 4-bit quantization provides the best balance: 2-3x faster inference, 70% memory reduction, and minimal (2-3%) quality loss.
3. Speculative Decoding (Advanced)
Speculative decoding is a cutting-edge technique that can speed up generation by 2-3x. It uses a small "draft" model to predict multiple tokens ahead, then verifies them with the large model in a single pass.
from vllm import LLM, SamplingParams
# Load target model (large, accurate)
target_model = LLM(
model="meta-llama/Llama-2-70b-chat-hf",
tensor_parallel_size=4
)
# Load draft model (small, fast)
draft_model = LLM(
model="meta-llama/Llama-2-7b-chat-hf"
)
def speculative_decoding(prompt, num_speculative=4):
"""
Generate with speculative decoding
How it works:
1. Draft model generates 4 tokens quickly
2. Target model verifies all 4 tokens in one pass
3. Accept correct tokens, regenerate if wrong
4. Repeat until complete
"""
current_sequence = prompt
generated = []
while len(generated) < 100: # Max 100 tokens
# Draft phase: Small model generates 4 candidate tokens (fast)
draft_output = draft_model.generate(
[current_sequence],
SamplingParams(max_tokens=num_speculative)
)[0].outputs[0].text
# Target phase: Large model verifies all candidates (one pass)
verification_prompt = current_sequence + draft_output
target_output = target_model.generate(
[verification_prompt],
SamplingParams(max_tokens=1)
)[0].outputs[0].text
# Accept tokens up to first disagreement
# (In practice, 80-90% of draft tokens are accepted)
accepted_tokens = draft_output # Simplified
generated.append(accepted_tokens)
current_sequence += accepted_tokens
return ''.join(generated)
# Performance comparison:
# Normal decoding: 100 tokens × 50ms/token = 5000ms
# Speculative (4 ahead, 85% acceptance): 100 tokens × 20ms/token = 2000ms
# Speedup: 2.5x faster!
⚠️ Speculative Decoding Trade-offs
- Complexity: Requires running two models simultaneously
- Memory: Both models must fit in VRAM
- Compatibility: Not all frameworks support it yet
- Best for: High-throughput scenarios where 2-3x speedup justifies complexity
4. Hardware Acceleration
| Hardware | Cost/Hour | Throughput | Cost per 1M Tokens |
|---|---|---|---|
| NVIDIA H100 | $4.00 | 80K tokens/sec | $0.014 |
| NVIDIA A100 | $3.00 | 50K tokens/sec | $0.017 |
| NVIDIA L4 | $0.80 | 10K tokens/sec | $0.022 |
| NVIDIA T4 | $0.50 | 5K tokens/sec | $0.028 |
Counterintuitively, more expensive GPUs can be cheaper per token due to higher throughput. Always calculate cost per token, not just hourly cost.
Combining Strategies: Real-World Architecture
The most effective production systems combine multiple optimization techniques. Here's a reference architecture:
class ProductionLLMService:
def __init__(self):
# Layer 1: Semantic cache (fastest, cheapest)
self.cache = SemanticCache(redis_client)
# Layer 2: Model router
self.router = IntelligentRouter()
# Layer 3: Multiple model endpoints
self.models = {
'simple': LLM(model="llama-2-7b", quantization="awq"),
'medium': LLM(model="llama-2-13b", quantization="awq"),
'complex': LLM(model="llama-2-70b", dtype="float16")
}
def generate(self, prompt, params):
# Try cache first (saves 50% of calls)
cached = self.cache.search(prompt, params)
if cached:
return {
'response': cached,
'source': 'cache',
'cost': 0.0,
'latency': 0.02 # 20ms cache lookup
}
# Route to appropriate model
complexity = self.router.classify_complexity(prompt)
model_name = self.router.route(prompt)
model = self.models[complexity]
# Generate with streaming
start_time = time.time()
response = model.generate([prompt], params)[0].outputs[0].text
latency = time.time() - start_time
# Calculate cost
input_tokens = len(prompt.split()) * 1.3 # Rough estimate
output_tokens = len(response.split()) * 1.3
cost = self._calculate_cost(model_name, input_tokens, output_tokens)
# Store in cache for future
self.cache.store(prompt, params, response)
return {
'response': response,
'source': f'{model_name}_model',
'cost': cost,
'latency': latency
}
def _calculate_cost(self, model, input_tokens, output_tokens):
# Cost per 1M tokens
pricing = {
'simple': 0.0001,
'medium': 0.0003,
'complex': 0.001
}
cost_per_token = pricing[model] / 1_000_000
return (input_tokens + output_tokens) * cost_per_token
# Usage and results
service = ProductionLLMService()
# Example query
result = service.generate(
"What is Python?",
{"temperature": 0.7, "max_tokens": 200}
)
print(f"Response: {result['response'][:100]}...")
print(f"Source: {result['source']}")
print(f"Cost: ${result['cost']:.6f}")
print(f"Latency: {result['latency']:.3f}s")
Performance Metrics
| Metric | Baseline (No Optimization) | With All Optimizations | Improvement |
|---|---|---|---|
| Avg Cost/Query | $0.0020 | $0.0002 | 10x cheaper |
| Avg Latency (TTFT) | 2.5s | 0.4s | 6x faster |
| Monthly Cost (1M queries) | $2,000 | $200 | $1,800 saved |
| User Satisfaction | 60% | 92% | +53% |
Summary: Building Cost-Effective, Fast Systems
🔑 Key Takeaways
- Semantic caching: Single biggest cost saver—30-70% reduction with high cache hit rates
- Intelligent routing: Use cheap models for simple queries—80%+ cost reduction possible
- Streaming: Most impactful UX improvement—perceived latency drops from 5s to <0.5s
- Quantization: 2-3x faster with minimal quality loss—use AWQ 4-bit for best balance
- Combine strategies: 10-100x cost reduction and 5-10x latency improvement achievable
- Measure everything: Track cost per query, TTFT, cache hit rate, model distribution
In the final chapter, we'll explore production monitoring and observability—how to track these metrics, identify bottlenecks, and continuously optimize your LLM infrastructure.