Beginner3 min read

What Metrics Would You Track for an LLM in Production?

A comprehensive framework for monitoring LLMs in production — from latency and cost to output quality and user satisfaction signals.

Also preparing for coding interviews?

Rubduck is an AI mock interviewer for DSA and coding rounds — get instant feedback on your solutions.

Daily tips, confessions & AI news. Unsubscribe anytime. Questions? [email protected]

Why This Is Asked

Deploying an LLM without proper monitoring is flying blind. Interviewers ask this to see if you think about AI systems operationally — not just "does it work" but "how do you know when it stops working?"

Key Concepts to Cover

  • Latency metrics — p50, p90, p99 response times, time-to-first-token
  • Cost metrics — token usage, cost per request, cost per user
  • Error rates — API failures, timeout rates, refusal rates
  • Quality metrics — output length, format validity, LLM-as-judge scores
  • User signals — thumbs up/down, follow-up query rate, session abandonment
  • Drift detection — monitoring for changes in input distribution

How to Approach This

1. Operational Metrics (Infrastructure Health)

Latency:

  • p50, p90, p99 end-to-end response time
  • Time-to-first-token (for streaming) — users perceive this as responsiveness
  • LLM API call latency separate from total request latency

Availability:

  • API error rate (5xx from provider)
  • Timeout rate
  • Rate limit hit frequency

Cost:

  • Prompt tokens per request (input cost)
  • Completion tokens per request (output cost)
  • Cost per request, cost per day, cost per user

2. Output Quality Metrics

Format validity:

  • If you expect JSON, what % of responses are valid JSON?

Length distribution:

  • Very short responses may indicate refusal or failure
  • Very long responses may indicate runaway generation

Refusal rate:

  • How often does the model refuse to answer?
  • Sudden spikes may indicate a model update or prompt issue

LLM-as-judge:

  • Run a sample of production outputs through an evaluator daily
  • Track quality scores over time with alerts on degradation

3. User Satisfaction Signals

Explicit signals: Thumbs up/down ratings, "Was this helpful?" prompts.

Implicit signals:

  • Follow-up query rate (users re-asking suggests the first answer failed)
  • Session abandonment after AI response
  • Copy rate (users copy the output — signals it was useful)

4. Drift Detection

Model behavior can change without any code change:

  • LLM providers silently update models
  • Input distribution shifts as users adopt the feature

Monitor rolling averages of quality metrics, response length distribution, topic distribution of incoming queries.

5. Dashboard and Alerting

  • Page immediately: error rate > 5%, p99 latency > 10s, cost spike > 300%
  • Investigate same day: quality score drop > 10%, refusal rate spike
  • Weekly review: cost trends, satisfaction trends, new failure patterns

Common Follow-ups

  1. "How do you monitor for prompt injection attacks?" Log and scan inputs for common injection patterns, monitor for unusual output patterns, rate limit aggressive users, anomaly detection on output content.

  2. "How do you attribute costs to specific features or teams?" Tag every LLM API call with metadata (feature name, team, user segment). Calculate cost by tag. Show each team their own cost dashboard.

  3. "What is the most important metric to watch if you could only pick one?" User satisfaction signal (thumbs down rate or follow-up query rate) — it is the closest proxy for real-world quality.

Related Questions

Prep the coding round too

AI knowledge is only half the picture. Rubduck helps you nail DSA and coding interviews with an AI interviewer that gives real-time feedback.

Daily tips, confessions & AI news. Unsubscribe anytime. Questions? [email protected]