Chapter 5: Production Monitoring - Module 5 - Deep Dive Track

"You can't improve what you can't measure." This age-old adage is especially true for LLM systems, where opaque model behavior, variable costs, and complex performance characteristics make monitoring essential. Without proper observability, you're flying blind—unable to detect issues, optimize performance, or make data-driven decisions.

This chapter covers the three pillars of LLM observability (Cost, Latency, Quality), shows you how to build production monitoring systems with Prometheus and Grafana, and provides real-world examples of tracking, alerting, and debugging LLM applications.

The Three Pillars of LLM Observability

LLM observability differs from traditional application monitoring. You need to track not just system metrics (CPU, memory) but also LLM-specific concerns: token usage, generation quality, and model behavior.

Pillar 1: Cost Tracking

Cost is often the most immediate concern for production LLM systems. Without tracking, expenses can spiral out of control.

Key Metrics

Token consumption: Input and output tokens per request, per user, per feature
Cost per query: Actual dollar cost of each request
Cost by model: Which models are responsible for most spending?
Daily/monthly spend: Track against budget limits
Cache hit rate: Percentage of requests served from cache (cost-free)

from prometheus_client import Counter, Histogram, Gauge

# Define metrics
token_counter = Counter(
    'llm_tokens_total',
    'Total tokens consumed',
    ['model', 'type']  # type: input or output
)

cost_histogram = Histogram(
    'llm_cost_per_request_usd',
    'Cost per request in USD',
    ['model', 'complexity']
)

daily_spend_gauge = Gauge(
    'llm_daily_spend_usd',
    'Total spend today in USD'
)

cache_hit_counter = Counter(
    'llm_cache_hits_total',
    'Cache hit count',
    ['cache_type']  # exact or semantic
)


class CostTracker:
    def __init__(self):
        self.daily_spend = 0.0

    def track_request(self, model, input_tokens, output_tokens, complexity):
        # Update token counters
        token_counter.labels(model=model, type='input').inc(input_tokens)
        token_counter.labels(model=model, type='output').inc(output_tokens)

        # Calculate cost
        cost = self._calculate_cost(model, input_tokens, output_tokens)

        # Update cost metrics
        cost_histogram.labels(model=model, complexity=complexity).observe(cost)
        self.daily_spend += cost
        daily_spend_gauge.set(self.daily_spend)

        return cost

    def _calculate_cost(self, model, input_tokens, output_tokens):
        # Pricing per 1M tokens
        pricing = {
            'gpt-4-turbo': {'input': 10.0, 'output': 30.0},
            'gpt-3.5-turbo': {'input': 0.5, 'output': 1.5},
            'llama-2-70b': {'input': 0.3, 'output': 0.3},
        }

        if model not in pricing:
            return 0.0

        cost = (
            (input_tokens / 1_000_000) * pricing[model]['input'] +
            (output_tokens / 1_000_000) * pricing[model]['output']
        )

        return cost

    def track_cache_hit(self, cache_type):
        cache_hit_counter.labels(cache_type=cache_type).inc()


# Usage
tracker = CostTracker()

# Track an expensive request
tracker.track_request(
    model='gpt-4-turbo',
    input_tokens=1500,
    output_tokens=800,
    complexity='complex'
)

# Track a cache hit
tracker.track_cache_hit(cache_type='semantic')

Pillar 2: Latency Monitoring

Latency directly impacts user experience. Track multiple latency metrics to understand system performance.

Critical Latency Metrics

TTFT (Time-To-First-Token): Most important for streaming—target <500ms
TPOT (Time-Per-Output-Token): Measures throughput—target <50ms
End-to-end latency: Total request time
Queue wait time: How long requests wait before processing
P50, P95, P99 latencies: Distribution of latencies across all requests

from prometheus_client import Histogram
import time

# Latency metrics
ttft_histogram = Histogram(
    'llm_ttft_seconds',
    'Time to first token',
    ['model'],
    buckets=[0.1, 0.25, 0.5, 1.0, 2.0, 5.0]  # Focus on sub-second
)

tpot_histogram = Histogram(
    'llm_tpot_seconds',
    'Time per output token',
    ['model'],
    buckets=[0.01, 0.025, 0.05, 0.1, 0.25, 0.5]
)

e2e_latency_histogram = Histogram(
    'llm_e2e_latency_seconds',
    'End-to-end request latency',
    ['model', 'cache_status'],
    buckets=[0.1, 0.5, 1.0, 2.0, 5.0, 10.0, 30.0]
)


class LatencyTracker:
    def __init__(self):
        pass

    def track_streaming_request(self, model, prompt, generate_fn):
        """Track latency for streaming request"""
        start_time = time.time()
        first_token_time = None
        token_count = 0
        last_token_time = start_time

        # Stream generation
        for token in generate_fn(prompt):
            token_count += 1
            current_time = time.time()

            # Record TTFT
            if first_token_time is None:
                first_token_time = current_time
                ttft = first_token_time - start_time
                ttft_histogram.labels(model=model).observe(ttft)

            # Record TPOT
            tpot = current_time - last_token_time
            tpot_histogram.labels(model=model).observe(tpot)
            last_token_time = current_time

            yield token

        # Record end-to-end latency
        e2e_latency = time.time() - start_time
        e2e_latency_histogram.labels(
            model=model,
            cache_status='miss'
        ).observe(e2e_latency)


# Usage
tracker = LatencyTracker()

def my_generate_fn(prompt):
    # Your generation logic here
    for token in llm.generate_stream(prompt):
        yield token

# Track latency during generation
for token in tracker.track_streaming_request('llama-2-7b', prompt, my_generate_fn):
    print(token, end='', flush=True)

Pillar 3: Quality Monitoring

Quality is the hardest to measure but most important. A system that's fast and cheap but produces poor outputs is worthless.

Quality Metrics

User feedback: Thumbs up/down, star ratings
Error rates: Empty responses, timeouts, malformed outputs
Hallucination detection: Factual accuracy checks
Semantic similarity: Compare to golden answers (for known queries)
Output length: Too short or too long indicates issues

from prometheus_client import Counter, Histogram

# Quality metrics
user_feedback_counter = Counter(
    'llm_user_feedback_total',
    'User feedback count',
    ['model', 'rating']  # rating: positive, negative, neutral
)

error_counter = Counter(
    'llm_errors_total',
    'Error count',
    ['model', 'error_type']  # timeout, empty, malformed, etc.
)

output_length_histogram = Histogram(
    'llm_output_length_tokens',
    'Output length in tokens',
    ['model'],
    buckets=[10, 50, 100, 250, 500, 1000, 2000]
)


class QualityTracker:
    def track_user_feedback(self, model, rating):
        """Record user feedback"""
        user_feedback_counter.labels(model=model, rating=rating).inc()

    def track_error(self, model, error_type):
        """Record errors"""
        error_counter.labels(model=model, error_type=error_type).inc()

    def track_output(self, model, output_text):
        """Track output metrics"""
        token_count = len(output_text.split())
        output_length_histogram.labels(model=model).observe(token_count)

        # Check for common issues
        if token_count == 0:
            self.track_error(model, 'empty_response')
        elif token_count < 5:
            self.track_error(model, 'too_short')
        elif token_count > 2000:
            self.track_error(model, 'too_long')


# Usage
quality_tracker = QualityTracker()

# Track user feedback
quality_tracker.track_user_feedback('gpt-4-turbo', 'positive')
quality_tracker.track_user_feedback('llama-2-7b', 'negative')

# Track output quality
output = "This is a sample model output with sufficient length."
quality_tracker.track_output('gpt-3.5-turbo', output)

Building a Monitoring Stack: Prometheus + Grafana

Prometheus and Grafana form the industry-standard monitoring stack. Prometheus collects and stores metrics, while Grafana visualizes them with beautiful dashboards.

Step 1: Expose Metrics Endpoint

from prometheus_client import start_http_server, Counter, Histogram, Gauge
from flask import Flask, request, jsonify
import time

# Initialize Flask app
app = Flask(__name__)

# Start Prometheus metrics server on port 9090
start_http_server(9090)

# Define all metrics
request_count = Counter('llm_requests_total', 'Total requests', ['model', 'status'])
latency_histogram = Histogram('llm_latency_seconds', 'Request latency', ['model'])
active_requests = Gauge('llm_active_requests', 'Currently active requests')
token_gauge = Gauge('llm_tokens_per_second', 'Current tokens/second throughput')


@app.route('/generate', methods=['POST'])
def generate():
    model = request.json.get('model', 'default')
    prompt = request.json.get('prompt')

    # Track request
    active_requests.inc()
    start_time = time.time()

    try:
        # Generate (your logic here)
        response = llm.generate(prompt, model=model)

        # Track success
        request_count.labels(model=model, status='success').inc()
        latency_histogram.labels(model=model).observe(time.time() - start_time)

        return jsonify({'response': response})

    except Exception as e:
        # Track error
        request_count.labels(model=model, status='error').inc()
        return jsonify({'error': str(e)}), 500

    finally:
        active_requests.dec()


if __name__ == '__main__':
    # Main app on port 8000
    app.run(host='0.0.0.0', port=8000)

# Metrics exposed at http://localhost:9090/metrics

Step 2: Configure Prometheus

# prometheus.yml
global:
  scrape_interval: 15s  # Scrape metrics every 15 seconds
  evaluation_interval: 15s

scrape_configs:
  # Scrape LLM service metrics
  - job_name: 'llm-service'
    static_configs:
      - targets: ['localhost:9090']  # Your metrics endpoint

  # Scrape vLLM built-in metrics (if using vLLM)
  - job_name: 'vllm'
    static_configs:
      - targets: ['localhost:8000']  # vLLM exposes /metrics

# Alerting rules
rule_files:
  - 'alerts.yml'

# Alertmanager configuration (optional)
alerting:
  alertmanagers:
    - static_configs:
        - targets: ['localhost:9093']

Step 3: Start Prometheus

# Download Prometheus
wget https://github.com/prometheus/prometheus/releases/download/v2.45.0/prometheus-2.45.0.linux-amd64.tar.gz
tar xvf prometheus-2.45.0.linux-amd64.tar.gz
cd prometheus-2.45.0.linux-amd64/

# Start Prometheus
./prometheus --config.file=prometheus.yml

# Prometheus UI available at http://localhost:9090

Step 4: Configure Grafana

# Install Grafana (Docker)
docker run -d \
  --name=grafana \
  -p 3000:3000 \
  grafana/grafana

# Access Grafana at http://localhost:3000
# Default credentials: admin / admin

Step 5: Create Dashboards

// Grafana Dashboard JSON (LLM Monitoring)
{
  "dashboard": {
    "title": "LLM Production Monitoring",
    "panels": [
      {
        "title": "Requests per Second",
        "targets": [{
          "expr": "rate(llm_requests_total[1m])"
        }],
        "type": "graph"
      },
      {
        "title": "P95 Latency",
        "targets": [{
          "expr": "histogram_quantile(0.95, rate(llm_latency_seconds_bucket[5m]))"
        }],
        "type": "graph"
      },
      {
        "title": "Error Rate",
        "targets": [{
          "expr": "rate(llm_requests_total{status=\"error\"}[5m]) / rate(llm_requests_total[5m])"
        }],
        "type": "graph"
      },
      {
        "title": "Daily Spend",
        "targets": [{
          "expr": "llm_daily_spend_usd"
        }],
        "type": "stat"
      },
      {
        "title": "Cache Hit Rate",
        "targets": [{
          "expr": "rate(llm_cache_hits_total[5m]) / rate(llm_requests_total[5m])"
        }],
        "type": "gauge"
      },
      {
        "title": "Tokens per Second",
        "targets": [{
          "expr": "rate(llm_tokens_total[1m])"
        }],
        "type": "graph"
      }
    ]
  }
}

Example Dashboard Queries

Metric	PromQL Query
Requests/sec by model	rate(llm_requests_total[1m])
P95 latency	histogram_quantile(0.95, llm_latency_seconds_bucket)
Error rate %	rate(llm_requests_total{status="error"}[5m]) * 100
Cost per hour	increase(llm_cost_per_request_usd_sum[1h])
Cache hit rate	rate(llm_cache_hits_total[5m]) / rate(llm_requests_total[5m])
Avg tokens/request	rate(llm_tokens_total[5m]) / rate(llm_requests_total[5m])

Alerting: Detect Issues Before Users Do

Monitoring dashboards are passive—you have to look at them. Alerts are proactive, notifying you when things go wrong.

Alert Rules Configuration

# alerts.yml
groups:
  - name: llm_alerts
    interval: 30s
    rules:
      # High error rate
      - alert: HighErrorRate
        expr: rate(llm_requests_total{status="error"}[5m]) / rate(llm_requests_total[5m]) > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate detected"
          description: "Error rate is {{ $value | humanizePercentage }} (threshold: 5%)"

      # High latency
      - alert: HighLatency
        expr: histogram_quantile(0.95, rate(llm_latency_seconds_bucket[5m])) > 5
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High P95 latency"
          description: "P95 latency is {{ $value }}s (threshold: 5s)"

      # Cost spike
      - alert: CostSpike
        expr: rate(llm_daily_spend_usd[1h]) > 100
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Cost spike detected"
          description: "Hourly spend rate: ${{ $value }}/hour"

      # Low cache hit rate
      - alert: LowCacheHitRate
        expr: rate(llm_cache_hits_total[10m]) / rate(llm_requests_total[10m]) < 0.3
        for: 30m
        labels:
          severity: info
        annotations:
          summary: "Cache hit rate below target"
          description: "Cache hit rate: {{ $value | humanizePercentage }} (target: 30%+)"

      # Service down
      - alert: ServiceDown
        expr: up{job="llm-service"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "LLM service is down"
          description: "The LLM service has been unreachable for 1 minute"

      # High queue depth
      - alert: HighQueueDepth
        expr: llm_active_requests > 50
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High request queue depth"
          description: "{{ $value }} requests in queue (threshold: 50)"

Alertmanager Configuration

# alertmanager.yml
global:
  resolve_timeout: 5m

route:
  group_by: ['alertname', 'severity']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 1h
  receiver: 'team-notifications'

  # Route critical alerts to PagerDuty
  routes:
    - match:
        severity: critical
      receiver: 'pagerduty'

receivers:
  # Slack notifications
  - name: 'team-notifications'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'
        channel: '#llm-alerts'
        title: 'LLM Alert: {{ .GroupLabels.alertname }}'
        text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'

  # PagerDuty for critical issues
  - name: 'pagerduty'
    pagerduty_configs:
      - service_key: 'your-pagerduty-key'
        description: '{{ .GroupLabels.alertname }}: {{ .Annotations.summary }}'

Logging and Distributed Tracing

Structured Logging

import logging
import json
from datetime import datetime

class LLMLogger:
    def __init__(self):
        self.logger = logging.getLogger('llm_service')
        self.logger.setLevel(logging.INFO)

        # JSON formatter
        handler = logging.StreamHandler()
        self.logger.addHandler(handler)

    def log_request(self, request_id, model, prompt, user_id=None):
        """Log incoming request"""
        log_data = {
            'timestamp': datetime.utcnow().isoformat(),
            'event': 'request_received',
            'request_id': request_id,
            'model': model,
            'prompt_length': len(prompt),
            'user_id': user_id
        }
        self.logger.info(json.dumps(log_data))

    def log_response(self, request_id, model, response, latency, cost):
        """Log response"""
        log_data = {
            'timestamp': datetime.utcnow().isoformat(),
            'event': 'response_generated',
            'request_id': request_id,
            'model': model,
            'response_length': len(response),
            'latency_seconds': latency,
            'cost_usd': cost
        }
        self.logger.info(json.dumps(log_data))

    def log_error(self, request_id, model, error_type, error_message):
        """Log errors"""
        log_data = {
            'timestamp': datetime.utcnow().isoformat(),
            'event': 'error',
            'request_id': request_id,
            'model': model,
            'error_type': error_type,
            'error_message': error_message
        }
        self.logger.error(json.dumps(log_data))


# Usage
logger = LLMLogger()

request_id = "req_abc123"
logger.log_request(request_id, "gpt-4-turbo", "What is AI?", user_id="user_456")
# ... generate response ...
logger.log_response(request_id, "gpt-4-turbo", response, latency=1.2, cost=0.003)

Distributed Tracing (OpenTelemetry)

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.jaeger.thrift import JaegerExporter

# Configure tracer
trace.set_tracer_provider(TracerProvider())
tracer = trace.get_tracer(__name__)

# Export to Jaeger
jaeger_exporter = JaegerExporter(
    agent_host_name='localhost',
    agent_port=6831,
)
trace.get_tracer_provider().add_span_processor(
    BatchSpanProcessor(jaeger_exporter)
)


def generate_with_tracing(prompt, model):
    # Create parent span
    with tracer.start_as_current_span("llm_generation") as span:
        span.set_attribute("model", model)
        span.set_attribute("prompt_length", len(prompt))

        # Cache lookup span
        with tracer.start_as_current_span("cache_lookup"):
            cached = check_cache(prompt)
            if cached:
                span.set_attribute("cache_hit", True)
                return cached

        # Inference span
        with tracer.start_as_current_span("model_inference") as inf_span:
            inf_span.set_attribute("cache_hit", False)

            # Track prefill phase
            with tracer.start_as_current_span("prefill"):
                embeddings = model.encode(prompt)

            # Track generation phase
            with tracer.start_as_current_span("generation"):
                response = model.generate(embeddings)

            inf_span.set_attribute("output_length", len(response))

        # Cache storage span
        with tracer.start_as_current_span("cache_store"):
            store_in_cache(prompt, response)

        span.set_attribute("total_latency", span.end_time - span.start_time)
        return response


# Traces visible in Jaeger UI: http://localhost:16686

Production Readiness Checklist

                    ✅ Metrics & Monitoring
                    ☐ Prometheus metrics exposed on /metrics endpoint
☐ Grafana dashboard showing cost, latency, quality
☐ Track TTFT, TPOT, end-to-end latency
☐ Monitor token consumption and daily spend
☐ Track cache hit rates
☐ Monitor error rates and types

                

                    ✅ Alerting
                    ☐ Alert on high error rate (>5%)
☐ Alert on high latency (P95 > 5s)
☐ Alert on cost spikes
☐ Alert when service is down
☐ Slack/PagerDuty integration configured
☐ On-call rotation established

                

                    ✅ Logging
                    ☐ Structured JSON logging
☐ Log every request with unique request_id
☐ Log errors with full stack traces
☐ Centralized log aggregation (ELK, Datadog, etc.)
☐ Log retention policy defined

                

                    ✅ Performance
                    ☐ Semantic caching implemented
☐ Model routing based on complexity
☐ Streaming enabled for all endpoints
☐ Load balancing across multiple instances
☐ Auto-scaling configured
☐ Rate limiting to prevent abuse

                

                    ✅ Cost Management
                    ☐ Budget alerts configured
☐ Cost per query tracked
☐ Expensive queries identified and optimized
☐ Prompt optimization ongoing
☐ Weekly cost review process

                

                    ✅ Quality Assurance
                    ☐ User feedback mechanism implemented
☐ A/B testing framework ready
☐ Regression testing for model changes
☐ Output validation rules
☐ Hallucination detection (for critical use cases)

                

Summary: Building Observable LLM Systems

                    🔑 Key Takeaways
                    Monitor the three pillars: Cost, latency, and quality—optimize the balance based on your use case
Prometheus + Grafana is the standard: Easy to set up, scales to enterprise, rich visualization
Alert proactively: Don't wait for users to report issues—detect and fix problems before they impact users
Track TTFT obsessively: Time-to-first-token is the most critical user experience metric
Log everything: Structured logs enable debugging, auditing, and compliance
Iterate based on data: Use metrics to make informed optimization decisions, not guesses
Build checklists: Production readiness isn't optional—systematic monitoring prevents disasters

                

Congratulations! You've completed Module 5: LLMs in Production. You now have the knowledge to build, deploy, optimize, and monitor production-grade LLM systems that are fast, cost-effective, and reliable.

                    🎓 What You've Learned
                    Why naive serving fails and how to fix it
KV caching, PagedAttention, and continuous batching
Production frameworks: vLLM, TGI, TensorRT-LLM
Cost optimization: caching, routing, quantization
Latency optimization: streaming, quantization, speculative decoding
Monitoring: Prometheus, Grafana, alerting, logging

                

Next steps: Apply these techniques to your own projects. Start with vLLM and semantic caching for immediate 10-20x performance improvements. Build monitoring dashboards to track your progress. Iterate based on real data. Good luck building production LLM systems!

← Previous: Cost Optimization Take Final Quiz →