"You can't improve what you can't measure." This age-old adage is especially true for LLM systems, where opaque model behavior, variable costs, and complex performance characteristics make monitoring essential. Without proper observability, you're flying blind—unable to detect issues, optimize performance, or make data-driven decisions.
This chapter covers the three pillars of LLM observability (Cost, Latency, Quality), shows you how to build production monitoring systems with Prometheus and Grafana, and provides real-world examples of tracking, alerting, and debugging LLM applications.
The Three Pillars of LLM Observability
LLM observability differs from traditional application monitoring. You need to track not just system metrics (CPU, memory) but also LLM-specific concerns: token usage, generation quality, and model behavior.
Pillar 1: Cost Tracking
Cost is often the most immediate concern for production LLM systems. Without tracking, expenses can spiral out of control.
- Token consumption: Input and output tokens per request, per user, per feature
- Cost per query: Actual dollar cost of each request
- Cost by model: Which models are responsible for most spending?
- Daily/monthly spend: Track against budget limits
- Cache hit rate: Percentage of requests served from cache (cost-free)
from prometheus_client import Counter, Histogram, Gauge
# Define metrics
token_counter = Counter(
'llm_tokens_total',
'Total tokens consumed',
['model', 'type'] # type: input or output
)
cost_histogram = Histogram(
'llm_cost_per_request_usd',
'Cost per request in USD',
['model', 'complexity']
)
daily_spend_gauge = Gauge(
'llm_daily_spend_usd',
'Total spend today in USD'
)
cache_hit_counter = Counter(
'llm_cache_hits_total',
'Cache hit count',
['cache_type'] # exact or semantic
)
class CostTracker:
def __init__(self):
self.daily_spend = 0.0
def track_request(self, model, input_tokens, output_tokens, complexity):
# Update token counters
token_counter.labels(model=model, type='input').inc(input_tokens)
token_counter.labels(model=model, type='output').inc(output_tokens)
# Calculate cost
cost = self._calculate_cost(model, input_tokens, output_tokens)
# Update cost metrics
cost_histogram.labels(model=model, complexity=complexity).observe(cost)
self.daily_spend += cost
daily_spend_gauge.set(self.daily_spend)
return cost
def _calculate_cost(self, model, input_tokens, output_tokens):
# Pricing per 1M tokens
pricing = {
'gpt-4-turbo': {'input': 10.0, 'output': 30.0},
'gpt-3.5-turbo': {'input': 0.5, 'output': 1.5},
'llama-2-70b': {'input': 0.3, 'output': 0.3},
}
if model not in pricing:
return 0.0
cost = (
(input_tokens / 1_000_000) * pricing[model]['input'] +
(output_tokens / 1_000_000) * pricing[model]['output']
)
return cost
def track_cache_hit(self, cache_type):
cache_hit_counter.labels(cache_type=cache_type).inc()
# Usage
tracker = CostTracker()
# Track an expensive request
tracker.track_request(
model='gpt-4-turbo',
input_tokens=1500,
output_tokens=800,
complexity='complex'
)
# Track a cache hit
tracker.track_cache_hit(cache_type='semantic')
Pillar 2: Latency Monitoring
Latency directly impacts user experience. Track multiple latency metrics to understand system performance.
- TTFT (Time-To-First-Token): Most important for streaming—target <500ms
- TPOT (Time-Per-Output-Token): Measures throughput—target <50ms
- End-to-end latency: Total request time
- Queue wait time: How long requests wait before processing
- P50, P95, P99 latencies: Distribution of latencies across all requests
from prometheus_client import Histogram
import time
# Latency metrics
ttft_histogram = Histogram(
'llm_ttft_seconds',
'Time to first token',
['model'],
buckets=[0.1, 0.25, 0.5, 1.0, 2.0, 5.0] # Focus on sub-second
)
tpot_histogram = Histogram(
'llm_tpot_seconds',
'Time per output token',
['model'],
buckets=[0.01, 0.025, 0.05, 0.1, 0.25, 0.5]
)
e2e_latency_histogram = Histogram(
'llm_e2e_latency_seconds',
'End-to-end request latency',
['model', 'cache_status'],
buckets=[0.1, 0.5, 1.0, 2.0, 5.0, 10.0, 30.0]
)
class LatencyTracker:
def __init__(self):
pass
def track_streaming_request(self, model, prompt, generate_fn):
"""Track latency for streaming request"""
start_time = time.time()
first_token_time = None
token_count = 0
last_token_time = start_time
# Stream generation
for token in generate_fn(prompt):
token_count += 1
current_time = time.time()
# Record TTFT
if first_token_time is None:
first_token_time = current_time
ttft = first_token_time - start_time
ttft_histogram.labels(model=model).observe(ttft)
# Record TPOT
tpot = current_time - last_token_time
tpot_histogram.labels(model=model).observe(tpot)
last_token_time = current_time
yield token
# Record end-to-end latency
e2e_latency = time.time() - start_time
e2e_latency_histogram.labels(
model=model,
cache_status='miss'
).observe(e2e_latency)
# Usage
tracker = LatencyTracker()
def my_generate_fn(prompt):
# Your generation logic here
for token in llm.generate_stream(prompt):
yield token
# Track latency during generation
for token in tracker.track_streaming_request('llama-2-7b', prompt, my_generate_fn):
print(token, end='', flush=True)
Pillar 3: Quality Monitoring
Quality is the hardest to measure but most important. A system that's fast and cheap but produces poor outputs is worthless.
- User feedback: Thumbs up/down, star ratings
- Error rates: Empty responses, timeouts, malformed outputs
- Hallucination detection: Factual accuracy checks
- Semantic similarity: Compare to golden answers (for known queries)
- Output length: Too short or too long indicates issues
from prometheus_client import Counter, Histogram
# Quality metrics
user_feedback_counter = Counter(
'llm_user_feedback_total',
'User feedback count',
['model', 'rating'] # rating: positive, negative, neutral
)
error_counter = Counter(
'llm_errors_total',
'Error count',
['model', 'error_type'] # timeout, empty, malformed, etc.
)
output_length_histogram = Histogram(
'llm_output_length_tokens',
'Output length in tokens',
['model'],
buckets=[10, 50, 100, 250, 500, 1000, 2000]
)
class QualityTracker:
def track_user_feedback(self, model, rating):
"""Record user feedback"""
user_feedback_counter.labels(model=model, rating=rating).inc()
def track_error(self, model, error_type):
"""Record errors"""
error_counter.labels(model=model, error_type=error_type).inc()
def track_output(self, model, output_text):
"""Track output metrics"""
token_count = len(output_text.split())
output_length_histogram.labels(model=model).observe(token_count)
# Check for common issues
if token_count == 0:
self.track_error(model, 'empty_response')
elif token_count < 5:
self.track_error(model, 'too_short')
elif token_count > 2000:
self.track_error(model, 'too_long')
# Usage
quality_tracker = QualityTracker()
# Track user feedback
quality_tracker.track_user_feedback('gpt-4-turbo', 'positive')
quality_tracker.track_user_feedback('llama-2-7b', 'negative')
# Track output quality
output = "This is a sample model output with sufficient length."
quality_tracker.track_output('gpt-3.5-turbo', output)
Building a Monitoring Stack: Prometheus + Grafana
Prometheus and Grafana form the industry-standard monitoring stack. Prometheus collects and stores metrics, while Grafana visualizes them with beautiful dashboards.
Step 1: Expose Metrics Endpoint
from prometheus_client import start_http_server, Counter, Histogram, Gauge
from flask import Flask, request, jsonify
import time
# Initialize Flask app
app = Flask(__name__)
# Start Prometheus metrics server on port 9090
start_http_server(9090)
# Define all metrics
request_count = Counter('llm_requests_total', 'Total requests', ['model', 'status'])
latency_histogram = Histogram('llm_latency_seconds', 'Request latency', ['model'])
active_requests = Gauge('llm_active_requests', 'Currently active requests')
token_gauge = Gauge('llm_tokens_per_second', 'Current tokens/second throughput')
@app.route('/generate', methods=['POST'])
def generate():
model = request.json.get('model', 'default')
prompt = request.json.get('prompt')
# Track request
active_requests.inc()
start_time = time.time()
try:
# Generate (your logic here)
response = llm.generate(prompt, model=model)
# Track success
request_count.labels(model=model, status='success').inc()
latency_histogram.labels(model=model).observe(time.time() - start_time)
return jsonify({'response': response})
except Exception as e:
# Track error
request_count.labels(model=model, status='error').inc()
return jsonify({'error': str(e)}), 500
finally:
active_requests.dec()
if __name__ == '__main__':
# Main app on port 8000
app.run(host='0.0.0.0', port=8000)
# Metrics exposed at http://localhost:9090/metrics
Step 2: Configure Prometheus
# prometheus.yml
global:
scrape_interval: 15s # Scrape metrics every 15 seconds
evaluation_interval: 15s
scrape_configs:
# Scrape LLM service metrics
- job_name: 'llm-service'
static_configs:
- targets: ['localhost:9090'] # Your metrics endpoint
# Scrape vLLM built-in metrics (if using vLLM)
- job_name: 'vllm'
static_configs:
- targets: ['localhost:8000'] # vLLM exposes /metrics
# Alerting rules
rule_files:
- 'alerts.yml'
# Alertmanager configuration (optional)
alerting:
alertmanagers:
- static_configs:
- targets: ['localhost:9093']
Step 3: Start Prometheus
# Download Prometheus
wget https://github.com/prometheus/prometheus/releases/download/v2.45.0/prometheus-2.45.0.linux-amd64.tar.gz
tar xvf prometheus-2.45.0.linux-amd64.tar.gz
cd prometheus-2.45.0.linux-amd64/
# Start Prometheus
./prometheus --config.file=prometheus.yml
# Prometheus UI available at http://localhost:9090
Step 4: Configure Grafana
# Install Grafana (Docker)
docker run -d \
--name=grafana \
-p 3000:3000 \
grafana/grafana
# Access Grafana at http://localhost:3000
# Default credentials: admin / admin
Step 5: Create Dashboards
// Grafana Dashboard JSON (LLM Monitoring)
{
"dashboard": {
"title": "LLM Production Monitoring",
"panels": [
{
"title": "Requests per Second",
"targets": [{
"expr": "rate(llm_requests_total[1m])"
}],
"type": "graph"
},
{
"title": "P95 Latency",
"targets": [{
"expr": "histogram_quantile(0.95, rate(llm_latency_seconds_bucket[5m]))"
}],
"type": "graph"
},
{
"title": "Error Rate",
"targets": [{
"expr": "rate(llm_requests_total{status=\"error\"}[5m]) / rate(llm_requests_total[5m])"
}],
"type": "graph"
},
{
"title": "Daily Spend",
"targets": [{
"expr": "llm_daily_spend_usd"
}],
"type": "stat"
},
{
"title": "Cache Hit Rate",
"targets": [{
"expr": "rate(llm_cache_hits_total[5m]) / rate(llm_requests_total[5m])"
}],
"type": "gauge"
},
{
"title": "Tokens per Second",
"targets": [{
"expr": "rate(llm_tokens_total[1m])"
}],
"type": "graph"
}
]
}
}
Example Dashboard Queries
| Metric | PromQL Query |
|---|---|
| Requests/sec by model | rate(llm_requests_total[1m]) |
| P95 latency | histogram_quantile(0.95, llm_latency_seconds_bucket) |
| Error rate % | rate(llm_requests_total{status="error"}[5m]) * 100 |
| Cost per hour | increase(llm_cost_per_request_usd_sum[1h]) |
| Cache hit rate | rate(llm_cache_hits_total[5m]) / rate(llm_requests_total[5m]) |
| Avg tokens/request | rate(llm_tokens_total[5m]) / rate(llm_requests_total[5m]) |
Alerting: Detect Issues Before Users Do
Monitoring dashboards are passive—you have to look at them. Alerts are proactive, notifying you when things go wrong.
Alert Rules Configuration
# alerts.yml
groups:
- name: llm_alerts
interval: 30s
rules:
# High error rate
- alert: HighErrorRate
expr: rate(llm_requests_total{status="error"}[5m]) / rate(llm_requests_total[5m]) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate detected"
description: "Error rate is {{ $value | humanizePercentage }} (threshold: 5%)"
# High latency
- alert: HighLatency
expr: histogram_quantile(0.95, rate(llm_latency_seconds_bucket[5m])) > 5
for: 10m
labels:
severity: warning
annotations:
summary: "High P95 latency"
description: "P95 latency is {{ $value }}s (threshold: 5s)"
# Cost spike
- alert: CostSpike
expr: rate(llm_daily_spend_usd[1h]) > 100
for: 5m
labels:
severity: warning
annotations:
summary: "Cost spike detected"
description: "Hourly spend rate: ${{ $value }}/hour"
# Low cache hit rate
- alert: LowCacheHitRate
expr: rate(llm_cache_hits_total[10m]) / rate(llm_requests_total[10m]) < 0.3
for: 30m
labels:
severity: info
annotations:
summary: "Cache hit rate below target"
description: "Cache hit rate: {{ $value | humanizePercentage }} (target: 30%+)"
# Service down
- alert: ServiceDown
expr: up{job="llm-service"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "LLM service is down"
description: "The LLM service has been unreachable for 1 minute"
# High queue depth
- alert: HighQueueDepth
expr: llm_active_requests > 50
for: 5m
labels:
severity: warning
annotations:
summary: "High request queue depth"
description: "{{ $value }} requests in queue (threshold: 50)"
Alertmanager Configuration
# alertmanager.yml
global:
resolve_timeout: 5m
route:
group_by: ['alertname', 'severity']
group_wait: 10s
group_interval: 10s
repeat_interval: 1h
receiver: 'team-notifications'
# Route critical alerts to PagerDuty
routes:
- match:
severity: critical
receiver: 'pagerduty'
receivers:
# Slack notifications
- name: 'team-notifications'
slack_configs:
- api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'
channel: '#llm-alerts'
title: 'LLM Alert: {{ .GroupLabels.alertname }}'
text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
# PagerDuty for critical issues
- name: 'pagerduty'
pagerduty_configs:
- service_key: 'your-pagerduty-key'
description: '{{ .GroupLabels.alertname }}: {{ .Annotations.summary }}'
Logging and Distributed Tracing
Structured Logging
import logging
import json
from datetime import datetime
class LLMLogger:
def __init__(self):
self.logger = logging.getLogger('llm_service')
self.logger.setLevel(logging.INFO)
# JSON formatter
handler = logging.StreamHandler()
self.logger.addHandler(handler)
def log_request(self, request_id, model, prompt, user_id=None):
"""Log incoming request"""
log_data = {
'timestamp': datetime.utcnow().isoformat(),
'event': 'request_received',
'request_id': request_id,
'model': model,
'prompt_length': len(prompt),
'user_id': user_id
}
self.logger.info(json.dumps(log_data))
def log_response(self, request_id, model, response, latency, cost):
"""Log response"""
log_data = {
'timestamp': datetime.utcnow().isoformat(),
'event': 'response_generated',
'request_id': request_id,
'model': model,
'response_length': len(response),
'latency_seconds': latency,
'cost_usd': cost
}
self.logger.info(json.dumps(log_data))
def log_error(self, request_id, model, error_type, error_message):
"""Log errors"""
log_data = {
'timestamp': datetime.utcnow().isoformat(),
'event': 'error',
'request_id': request_id,
'model': model,
'error_type': error_type,
'error_message': error_message
}
self.logger.error(json.dumps(log_data))
# Usage
logger = LLMLogger()
request_id = "req_abc123"
logger.log_request(request_id, "gpt-4-turbo", "What is AI?", user_id="user_456")
# ... generate response ...
logger.log_response(request_id, "gpt-4-turbo", response, latency=1.2, cost=0.003)
Distributed Tracing (OpenTelemetry)
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.jaeger.thrift import JaegerExporter
# Configure tracer
trace.set_tracer_provider(TracerProvider())
tracer = trace.get_tracer(__name__)
# Export to Jaeger
jaeger_exporter = JaegerExporter(
agent_host_name='localhost',
agent_port=6831,
)
trace.get_tracer_provider().add_span_processor(
BatchSpanProcessor(jaeger_exporter)
)
def generate_with_tracing(prompt, model):
# Create parent span
with tracer.start_as_current_span("llm_generation") as span:
span.set_attribute("model", model)
span.set_attribute("prompt_length", len(prompt))
# Cache lookup span
with tracer.start_as_current_span("cache_lookup"):
cached = check_cache(prompt)
if cached:
span.set_attribute("cache_hit", True)
return cached
# Inference span
with tracer.start_as_current_span("model_inference") as inf_span:
inf_span.set_attribute("cache_hit", False)
# Track prefill phase
with tracer.start_as_current_span("prefill"):
embeddings = model.encode(prompt)
# Track generation phase
with tracer.start_as_current_span("generation"):
response = model.generate(embeddings)
inf_span.set_attribute("output_length", len(response))
# Cache storage span
with tracer.start_as_current_span("cache_store"):
store_in_cache(prompt, response)
span.set_attribute("total_latency", span.end_time - span.start_time)
return response
# Traces visible in Jaeger UI: http://localhost:16686
Production Readiness Checklist
✅ Metrics & Monitoring
- ☐ Prometheus metrics exposed on /metrics endpoint
- ☐ Grafana dashboard showing cost, latency, quality
- ☐ Track TTFT, TPOT, end-to-end latency
- ☐ Monitor token consumption and daily spend
- ☐ Track cache hit rates
- ☐ Monitor error rates and types
✅ Alerting
- ☐ Alert on high error rate (>5%)
- ☐ Alert on high latency (P95 > 5s)
- ☐ Alert on cost spikes
- ☐ Alert when service is down
- ☐ Slack/PagerDuty integration configured
- ☐ On-call rotation established
✅ Logging
- ☐ Structured JSON logging
- ☐ Log every request with unique request_id
- ☐ Log errors with full stack traces
- ☐ Centralized log aggregation (ELK, Datadog, etc.)
- ☐ Log retention policy defined
✅ Performance
- ☐ Semantic caching implemented
- ☐ Model routing based on complexity
- ☐ Streaming enabled for all endpoints
- ☐ Load balancing across multiple instances
- ☐ Auto-scaling configured
- ☐ Rate limiting to prevent abuse
✅ Cost Management
- ☐ Budget alerts configured
- ☐ Cost per query tracked
- ☐ Expensive queries identified and optimized
- ☐ Prompt optimization ongoing
- ☐ Weekly cost review process
✅ Quality Assurance
- ☐ User feedback mechanism implemented
- ☐ A/B testing framework ready
- ☐ Regression testing for model changes
- ☐ Output validation rules
- ☐ Hallucination detection (for critical use cases)
Summary: Building Observable LLM Systems
🔑 Key Takeaways
- Monitor the three pillars: Cost, latency, and quality—optimize the balance based on your use case
- Prometheus + Grafana is the standard: Easy to set up, scales to enterprise, rich visualization
- Alert proactively: Don't wait for users to report issues—detect and fix problems before they impact users
- Track TTFT obsessively: Time-to-first-token is the most critical user experience metric
- Log everything: Structured logs enable debugging, auditing, and compliance
- Iterate based on data: Use metrics to make informed optimization decisions, not guesses
- Build checklists: Production readiness isn't optional—systematic monitoring prevents disasters
Congratulations! You've completed Module 5: LLMs in Production. You now have the knowledge to build, deploy, optimize, and monitor production-grade LLM systems that are fast, cost-effective, and reliable.
🎓 What You've Learned
- Why naive serving fails and how to fix it
- KV caching, PagedAttention, and continuous batching
- Production frameworks: vLLM, TGI, TensorRT-LLM
- Cost optimization: caching, routing, quantization
- Latency optimization: streaming, quantization, speculative decoding
- Monitoring: Prometheus, Grafana, alerting, logging
Next steps: Apply these techniques to your own projects. Start with vLLM and semantic caching for immediate 10-20x performance improvements. Build monitoring dashboards to track your progress. Iterate based on real data. Good luck building production LLM systems!