Chapter 4: Cost and Latency Optimization - Module 5 - Deep Dive Track

Building a production LLM system involves navigating two competing priorities: minimizing operational costs while delivering excellent user experience through low latency. These goals are often in tension—the fastest solution may be prohibitively expensive, while the cheapest option may provide unacceptably slow responses.

This chapter explores proven techniques for optimizing both cost and latency. We'll cover semantic caching, intelligent model routing, quantization, streaming, prompt optimization, and advanced techniques like speculative decoding. By combining these strategies, you can build systems that are both affordable and performant.

Cost Optimization Strategies

1. Semantic Caching: The 30-70% Solution

Caching is the single most impactful cost optimization technique. By storing and reusing responses to common or similar queries, you can reduce API calls by 30-70% in most applications.

Exact Caching (Simple but Limited)

import redis
import hashlib
import json

class ExactCache:
    def __init__(self, redis_client):
        self.redis = redis_client
        self.ttl = 3600  # 1 hour expiration

    def _make_key(self, prompt, params):
        """Create cache key from prompt and parameters"""
        cache_input = {
            'prompt': prompt,
            'temperature': params.get('temperature'),
            'max_tokens': params.get('max_tokens'),
            # Include other params that affect output
        }
        key_str = json.dumps(cache_input, sort_keys=True)
        return hashlib.sha256(key_str.encode()).hexdigest()

    def get(self, prompt, params):
        """Check if response is cached"""
        key = self._make_key(prompt, params)
        cached = self.redis.get(key)
        if cached:
            return json.loads(cached)
        return None

    def set(self, prompt, params, response):
        """Store response in cache"""
        key = self._make_key(prompt, params)
        self.redis.setex(key, self.ttl, json.dumps(response))


# Usage
redis_client = redis.Redis(host='localhost', port=6379, db=0)
cache = ExactCache(redis_client)

def generate_with_cache(prompt, params):
    # Check cache first
    cached_response = cache.get(prompt, params)
    if cached_response:
        print("Cache hit! Saved an API call.")
        return cached_response

    # Cache miss - call LLM
    response = llm.generate(prompt, params)

    # Store in cache
    cache.set(prompt, params, response)

    return response


# Example: These will hit the cache
generate_with_cache("What is 2+2?", {"temperature": 0.7})
generate_with_cache("What is 2+2?", {"temperature": 0.7})  # Cache hit!

⚠️ Exact Caching Limitation

Exact caching only works for identical prompts. Small variations—even a single character difference—result in cache misses. This severely limits effectiveness for natural language queries where users express the same intent in different ways.

Semantic Caching (Powerful)

Semantic caching uses embeddings to match queries by meaning, not exact text. This dramatically increases cache hit rates.

from sentence_transformers import SentenceTransformer
import numpy as np
import pickle

class SemanticCache:
    def __init__(self, redis_client, similarity_threshold=0.95):
        self.redis = redis_client
        self.threshold = similarity_threshold
        self.encoder = SentenceTransformer('all-MiniLM-L6-v2')

    def _get_embedding(self, text):
        """Generate embedding for text"""
        return self.encoder.encode(text, convert_to_numpy=True)

    def _cosine_similarity(self, vec1, vec2):
        """Calculate cosine similarity between vectors"""
        return np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2))

    def search(self, prompt, params):
        """Search for semantically similar cached responses"""
        query_embedding = self._get_embedding(prompt)

        # Get all cache entries (in production, use vector DB like Pinecone/Weaviate)
        cache_keys = self.redis.keys('semantic_cache:*')

        best_match = None
        best_similarity = 0

        for key in cache_keys:
            cached_data = pickle.loads(self.redis.get(key))

            # Check parameter match (must be exact for params)
            if cached_data['params'] != params:
                continue

            # Calculate semantic similarity
            similarity = self._cosine_similarity(
                query_embedding,
                cached_data['embedding']
            )

            if similarity > best_similarity:
                best_similarity = similarity
                best_match = cached_data

        # Return if similarity exceeds threshold
        if best_similarity >= self.threshold:
            print(f"Semantic cache hit! Similarity: {best_similarity:.3f}")
            return best_match['response']

        return None

    def store(self, prompt, params, response):
        """Store prompt, embedding, and response"""
        embedding = self._get_embedding(prompt)

        cache_data = {
            'prompt': prompt,
            'embedding': embedding,
            'params': params,
            'response': response
        }

        # Use embedding hash as key
        key = f"semantic_cache:{hashlib.sha256(embedding.tobytes()).hexdigest()}"
        self.redis.setex(key, 3600, pickle.dumps(cache_data))


# Usage
semantic_cache = SemanticCache(redis_client, similarity_threshold=0.95)

def generate_with_semantic_cache(prompt, params):
    # Check semantic cache
    cached = semantic_cache.search(prompt, params)
    if cached:
        return cached

    # Cache miss - generate
    response = llm.generate(prompt, params)
    semantic_cache.store(prompt, params, response)

    return response


# These are semantically similar and will hit the cache!
generate_with_semantic_cache("What's 2 plus 2?", {"temperature": 0.7})
generate_with_semantic_cache("What is two plus two?", {"temperature": 0.7})  # Hit!
generate_with_semantic_cache("Calculate 2+2", {"temperature": 0.7})  # Hit!

💰 Cost Impact

Example Application: Customer Support Chatbot

1 million queries/month
Average cost: $0.002/query = $2,000/month
50% semantic cache hit rate
New cost: $1,000/month
Savings: $1,000/month or $12,000/year

2. Intelligent Model Routing

Not all queries require your most powerful (and expensive) model. Route simple queries to cheaper models and complex ones to premium models.

class IntelligentRouter:
    def __init__(self):
        # Model costs (per 1M tokens)
        self.models = {
            'gpt-4-turbo': {'cost': 0.01, 'capability': 10},
            'gpt-3.5-turbo': {'cost': 0.0005, 'capability': 7},
            'llama-2-70b': {'cost': 0.0003, 'capability': 6},
            'llama-2-13b': {'cost': 0.0001, 'capability': 4},
        }

    def classify_complexity(self, prompt):
        """
        Classify query complexity
        Returns: 'simple', 'medium', 'complex'
        """
        # Simple heuristics (can use a classifier model for better accuracy)
        prompt_lower = prompt.lower()

        # Simple queries
        simple_patterns = [
            'what is', 'who is', 'when is', 'where is',
            'define', 'meaning of', 'explain in simple terms'
        ]
        if any(pattern in prompt_lower for pattern in simple_patterns):
            return 'simple'

        # Complex queries
        complex_patterns = [
            'analyze', 'compare and contrast', 'evaluate',
            'multi-step', 'reasoning', 'code review'
        ]
        if any(pattern in prompt_lower for pattern in complex_patterns):
            return 'complex'

        # Check length
        if len(prompt.split()) > 100:
            return 'complex'
        elif len(prompt.split()) < 20:
            return 'simple'

        return 'medium'

    def route(self, prompt):
        """Route query to appropriate model"""
        complexity = self.classify_complexity(prompt)

        routing_map = {
            'simple': 'llama-2-13b',
            'medium': 'gpt-3.5-turbo',
            'complex': 'gpt-4-turbo'
        }

        model = routing_map[complexity]
        print(f"Routing to {model} (complexity: {complexity})")
        return model


# Usage
router = IntelligentRouter()

def generate_with_routing(prompt, params):
    model = router.route(prompt)

    # Generate with selected model
    response = llm.generate(prompt, params, model=model)
    return response


# Examples
generate_with_routing("What is Python?", {})
# Output: Routing to llama-2-13b (complexity: simple)

generate_with_routing(
    "Analyze the architectural trade-offs between microservices and monoliths",
    {}
)
# Output: Routing to gpt-4-turbo (complexity: complex)

Query Type	Model	Cost/1M tokens	Distribution	Total Cost
Simple (50%)	Llama 2 13B	$0.0001	500K tokens	$0.05
Medium (35%)	GPT-3.5 Turbo	$0.0005	350K tokens	$0.18
Complex (15%)	GPT-4 Turbo	$0.01	150K tokens	$1.50
Total (Routed)	Mixed	-	1M tokens	$1.73
All GPT-4 (baseline)	GPT-4 Turbo	$0.01	1M tokens	$10.00

Savings: 82.7% cost reduction ($10.00 → $1.73)

3. Prompt Optimization

Shorter prompts = lower costs. Every token you send (input) and receive (output) increases cost.

# BAD: Verbose prompt (234 tokens)
verbose_prompt = """
You are an AI assistant designed to help users with their questions.
Please read the following user query carefully and provide a detailed,
comprehensive, and accurate response. Make sure to consider all aspects
of the question and provide examples where appropriate. Your response
should be informative and helpful.

Here are some guidelines to follow:
1. Always be polite and professional
2. Provide clear explanations
3. Use examples to illustrate your points
4. Keep your response well-structured

Now, here is the user's question:

What is machine learning?

Please provide your response below:
"""

# GOOD: Concise prompt (11 tokens)
concise_prompt = "Explain machine learning in 2-3 sentences."

# Cost comparison (GPT-4 Turbo):
# Verbose: 234 input tokens × $0.01/1M = $0.00234
# Concise: 11 input tokens × $0.01/1M = $0.00011
# Savings per query: 95.3%
# At 1M queries: $2,340 vs $110 = $2,230 saved!

                    🎯 Prompt Optimization Techniques
                    Remove fluff: Cut unnecessary politeness, disclaimers, and context
Use system prompts: Place instructions once in system message, not repeated in each query
Constrain output: Use max_tokens to prevent excessively long responses
Test variations: Iterate to find shortest prompt that maintains quality

                

4. Batching Requests

For self-hosted models, batch processing dramatically reduces cost per token by maximizing GPU utilization.

# Cost comparison: Self-hosted Llama 2 70B on A100 ($3/hour)

# No batching: 1 request at a time
# Throughput: 5 requests/sec
# Hourly throughput: 18,000 requests
# Cost per request: $3 / 18,000 = $0.000167

# With batching (batch_size=32):
# Throughput: 40 requests/sec
# Hourly throughput: 144,000 requests
# Cost per request: $3 / 144,000 = $0.000021

# Savings: 88% cost reduction per request!

Latency Optimization Strategies

1. Streaming: The Perception Hack

Streaming is the single most impactful latency optimization for user-facing applications. It doesn't make generation faster, but it dramatically improves perceived latency.

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")

# Non-streaming (poor UX)
def generate_blocking(prompt):
    start = time.time()
    response = client.chat.completions.create(
        model="llama-2-7b",
        messages=[{"role": "user", "content": prompt}],
        stream=False
    )
    elapsed = time.time() - start

    print(f"Time to first response: {elapsed:.2f}s")  # Could be 5-10 seconds!
    print(response.choices[0].message.content)


# Streaming (excellent UX)
def generate_streaming(prompt):
    start = time.time()
    first_token_time = None

    stream = client.chat.completions.create(
        model="llama-2-7b",
        messages=[{"role": "user", "content": prompt}],
        stream=True
    )

    for chunk in stream:
        if chunk.choices[0].delta.content:
            if first_token_time is None:
                first_token_time = time.time() - start
                print(f"Time to first token: {first_token_time:.2f}s")  # ~0.5s!

            print(chunk.choices[0].delta.content, end="", flush=True)


# Example
generate_streaming("Write a story about AI")
# Output:
# Time to first token: 0.48s  ← User sees response immediately!
# Once upon a time, in a world where artificial intelligence...

                    ⚡ Latency Metrics
                    TTFT (Time-To-First-Token): Most critical for UX - aim for <500ms
TPOT (Time-Per-Output-Token): Throughput metric - aim for <50ms
End-to-end latency: Total time - less important than TTFT for streaming

                

2. Model Quantization

Quantization reduces model precision (e.g., float16 → int8), which speeds up inference and reduces memory usage.

# Load quantized model with vLLM
from vllm import LLM

# FP16 (baseline)
llm_fp16 = LLM(
    model="meta-llama/Llama-2-7b-chat-hf",
    dtype="float16"  # Default
)

# AWQ quantization (4-bit)
llm_awq = LLM(
    model="TheBloke/Llama-2-7B-Chat-AWQ",
    quantization="awq"
)

# GPTQ quantization (4-bit)
llm_gptq = LLM(
    model="TheBloke/Llama-2-7B-Chat-GPTQ",
    quantization="gptq"
)

Precision	Memory	Speed	Quality Loss
FP16 (float16)	14 GB	1.0x (baseline)	0%
INT8 (8-bit)	7 GB	1.5x faster	1-2%
AWQ (4-bit)	4 GB	2-3x faster	2-3%
GPTQ (4-bit)	4 GB	2-3x faster	3-5%

For most applications, AWQ 4-bit quantization provides the best balance: 2-3x faster inference, 70% memory reduction, and minimal (2-3%) quality loss.

3. Speculative Decoding (Advanced)

Speculative decoding is a cutting-edge technique that can speed up generation by 2-3x. It uses a small "draft" model to predict multiple tokens ahead, then verifies them with the large model in a single pass.

from vllm import LLM, SamplingParams

# Load target model (large, accurate)
target_model = LLM(
    model="meta-llama/Llama-2-70b-chat-hf",
    tensor_parallel_size=4
)

# Load draft model (small, fast)
draft_model = LLM(
    model="meta-llama/Llama-2-7b-chat-hf"
)

def speculative_decoding(prompt, num_speculative=4):
    """
    Generate with speculative decoding

    How it works:
    1. Draft model generates 4 tokens quickly
    2. Target model verifies all 4 tokens in one pass
    3. Accept correct tokens, regenerate if wrong
    4. Repeat until complete
    """
    current_sequence = prompt
    generated = []

    while len(generated) < 100:  # Max 100 tokens
        # Draft phase: Small model generates 4 candidate tokens (fast)
        draft_output = draft_model.generate(
            [current_sequence],
            SamplingParams(max_tokens=num_speculative)
        )[0].outputs[0].text

        # Target phase: Large model verifies all candidates (one pass)
        verification_prompt = current_sequence + draft_output
        target_output = target_model.generate(
            [verification_prompt],
            SamplingParams(max_tokens=1)
        )[0].outputs[0].text

        # Accept tokens up to first disagreement
        # (In practice, 80-90% of draft tokens are accepted)
        accepted_tokens = draft_output  # Simplified
        generated.append(accepted_tokens)
        current_sequence += accepted_tokens

    return ''.join(generated)


# Performance comparison:
# Normal decoding: 100 tokens × 50ms/token = 5000ms
# Speculative (4 ahead, 85% acceptance): 100 tokens × 20ms/token = 2000ms
# Speedup: 2.5x faster!

                    ⚠️ Speculative Decoding Trade-offs
                    Complexity: Requires running two models simultaneously
Memory: Both models must fit in VRAM
Compatibility: Not all frameworks support it yet
Best for: High-throughput scenarios where 2-3x speedup justifies complexity

                

4. Hardware Acceleration

Hardware	Cost/Hour	Throughput	Cost per 1M Tokens
NVIDIA H100	$4.00	80K tokens/sec	$0.014
NVIDIA A100	$3.00	50K tokens/sec	$0.017
NVIDIA L4	$0.80	10K tokens/sec	$0.022
NVIDIA T4	$0.50	5K tokens/sec	$0.028

Counterintuitively, more expensive GPUs can be cheaper per token due to higher throughput. Always calculate cost per token, not just hourly cost.

Combining Strategies: Real-World Architecture

The most effective production systems combine multiple optimization techniques. Here's a reference architecture:

class ProductionLLMService:
    def __init__(self):
        # Layer 1: Semantic cache (fastest, cheapest)
        self.cache = SemanticCache(redis_client)

        # Layer 2: Model router
        self.router = IntelligentRouter()

        # Layer 3: Multiple model endpoints
        self.models = {
            'simple': LLM(model="llama-2-7b", quantization="awq"),
            'medium': LLM(model="llama-2-13b", quantization="awq"),
            'complex': LLM(model="llama-2-70b", dtype="float16")
        }

    def generate(self, prompt, params):
        # Try cache first (saves 50% of calls)
        cached = self.cache.search(prompt, params)
        if cached:
            return {
                'response': cached,
                'source': 'cache',
                'cost': 0.0,
                'latency': 0.02  # 20ms cache lookup
            }

        # Route to appropriate model
        complexity = self.router.classify_complexity(prompt)
        model_name = self.router.route(prompt)
        model = self.models[complexity]

        # Generate with streaming
        start_time = time.time()
        response = model.generate([prompt], params)[0].outputs[0].text
        latency = time.time() - start_time

        # Calculate cost
        input_tokens = len(prompt.split()) * 1.3  # Rough estimate
        output_tokens = len(response.split()) * 1.3
        cost = self._calculate_cost(model_name, input_tokens, output_tokens)

        # Store in cache for future
        self.cache.store(prompt, params, response)

        return {
            'response': response,
            'source': f'{model_name}_model',
            'cost': cost,
            'latency': latency
        }

    def _calculate_cost(self, model, input_tokens, output_tokens):
        # Cost per 1M tokens
        pricing = {
            'simple': 0.0001,
            'medium': 0.0003,
            'complex': 0.001
        }
        cost_per_token = pricing[model] / 1_000_000
        return (input_tokens + output_tokens) * cost_per_token


# Usage and results
service = ProductionLLMService()

# Example query
result = service.generate(
    "What is Python?",
    {"temperature": 0.7, "max_tokens": 200}
)

print(f"Response: {result['response'][:100]}...")
print(f"Source: {result['source']}")
print(f"Cost: ${result['cost']:.6f}")
print(f"Latency: {result['latency']:.3f}s")

Performance Metrics

Metric	Baseline (No Optimization)	With All Optimizations	Improvement
Avg Cost/Query	$0.0020	$0.0002	10x cheaper
Avg Latency (TTFT)	2.5s	0.4s	6x faster
Monthly Cost (1M queries)	$2,000	$200	$1,800 saved
User Satisfaction	60%	92%	+53%

Summary: Building Cost-Effective, Fast Systems

                    🔑 Key Takeaways
                    Semantic caching: Single biggest cost saver—30-70% reduction with high cache hit rates
Intelligent routing: Use cheap models for simple queries—80%+ cost reduction possible
Streaming: Most impactful UX improvement—perceived latency drops from 5s to <0.5s
Quantization: 2-3x faster with minimal quality loss—use AWQ 4-bit for best balance
Combine strategies: 10-100x cost reduction and 5-10x latency improvement achievable
Measure everything: Track cost per query, TTFT, cache hit rate, model distribution

                

In the final chapter, we'll explore production monitoring and observability—how to track these metrics, identify bottlenecks, and continuously optimize your LLM infrastructure.

← Previous: vLLM Framework Next: Production Monitoring →