A comprehensive framework for monitoring LLMs in production — from latency and cost to output quality and user satisfaction signals.

Learn which metrics to track for LLMs in production. Covers latency, cost, error rates, quality metrics, user satisfaction, and operational observability.

Production LLM Monitoring Metrics - Interview Question

Why This Is Asked

Deploying an LLM without proper monitoring is flying blind. Interviewers ask this to see if you think about AI systems operationally — not just "does it work" but "how do you know when it stops working?"

Key Concepts to Cover

Latency metrics — p50, p90, p99 response times, time-to-first-token
Cost metrics — token usage, cost per request, cost per user
Error rates — API failures, timeout rates, refusal rates
Quality metrics — output length, format validity, LLM-as-judge scores
User signals — thumbs up/down, follow-up query rate, session abandonment
Drift detection — monitoring for changes in input distribution

How to Approach This

1. Operational Metrics (Infrastructure Health)

Latency:

p50, p90, p99 end-to-end response time
Time-to-first-token (for streaming) — users perceive this as responsiveness
LLM API call latency separate from total request latency

Availability:

API error rate (5xx from provider)
Timeout rate
Rate limit hit frequency

Cost:

Prompt tokens per request (input cost)
Completion tokens per request (output cost)
Cost per request, cost per day, cost per user

2. Output Quality Metrics

Format validity:

If you expect JSON, what % of responses are valid JSON?

Length distribution:

Very short responses may indicate refusal or failure
Very long responses may indicate runaway generation

Refusal rate:

How often does the model refuse to answer?
Sudden spikes may indicate a model update or prompt issue

LLM-as-judge:

Run a sample of production outputs through an evaluator daily
Track quality scores over time with alerts on degradation

3. User Satisfaction Signals

Explicit signals: Thumbs up/down ratings, "Was this helpful?" prompts.

Implicit signals:

Follow-up query rate (users re-asking suggests the first answer failed)
Session abandonment after AI response
Copy rate (users copy the output — signals it was useful)

4. Drift Detection

Model behavior can change without any code change:

LLM providers silently update models
Input distribution shifts as users adopt the feature

Monitor rolling averages of quality metrics, response length distribution, topic distribution of incoming queries.

5. Dashboard and Alerting

Page immediately: error rate > 5%, p99 latency > 10s, cost spike > 300%
Investigate same day: quality score drop > 10%, refusal rate spike
Weekly review: cost trends, satisfaction trends, new failure patterns

Common Follow-ups

"How do you monitor for prompt injection attacks?" Log and scan inputs for common injection patterns, monitor for unusual output patterns, rate limit aggressive users, anomaly detection on output content.
"How do you attribute costs to specific features or teams?" Tag every LLM API call with metadata (feature name, team, user segment). Calculate cost by tag. Show each team their own cost dashboard.
"What is the most important metric to watch if you could only pick one?" User satisfaction signal (thumbs down rate or follow-up query rate) — it is the closest proxy for real-world quality.

What Metrics Would You Track for an LLM in Production?

Why This Is Asked

Key Concepts to Cover

How to Approach This

1. Operational Metrics (Infrastructure Health)

2. Output Quality Metrics

3. User Satisfaction Signals

4. Drift Detection

5. Dashboard and Alerting

Common Follow-ups

Related Questions

How Do You Build an Eval Suite for an LLM-Powered Feature?

How Would You Detect and Handle LLM Output Regressions?

Explain the Tradeoffs Between Latency, Cost, and Quality in LLM Selection

Prep for the full interview loop