A safe, systematic approach to upgrading LLM model versions in production — from pre-upgrade evaluation to canary deployment and rollback.

Learn how to safely upgrade LLM model versions in production. Covers eval gates, shadow testing, canary deployments, rollback, and prompt compatibility.

LLM Model Version Upgrades in Production - Interview Question

Why This Is Asked

LLM providers regularly update models — sometimes silently. A model update can change output format, tone, reasoning quality, or safety behavior in ways that break your application. Interviewers want to see if you treat model upgrades with the same rigor as software deployments.

Key Concepts to Cover

Eval gate — run your test suite against the new model before any production traffic
Shadow testing — run new model in parallel, compare outputs, do not serve to users
Canary deployment — route a small % of traffic to the new model, monitor metrics
Prompt compatibility — new models may respond differently to the same prompts
Rollback plan — always be able to revert to the previous model version
Pinned model versions — use explicit version IDs, not "latest"

How to Approach This

1. Never Use "latest" in Production

Always pin to a specific model version:

Bad: model: "latest" (or any floating alias) — this can change without notice
Good: model: "provider-model-YYYY-MM-DD" (or exact immutable version ID) — pinned and reproducible

Monitor provider announcements and deprecation notices.

2. Pre-Upgrade: Eval Gate

Before touching production, run your full eval suite against the new model version:

old_results = run_eval_suite(model="current_pinned_model_version")
new_results = run_eval_suite(model="candidate_model_version")

regression = compare_results(old_results, new_results)
if regression.any_critical_failures:
    raise Exception("New model failed critical eval cases — block upgrade")

3. Shadow Testing

Run the new model in parallel with production, but do not serve its responses to users:

Route 100% of traffic through the current model (users see this)
Also route 100% through the new model (log results, do not serve)
Compare outputs side by side
Run for 24-48 hours to get a representative sample

4. Canary Deployment

If shadow testing looks good, route a small percentage to the new model:

Start at 1-5% for 24 hours
Monitor quality metrics, error rates, user satisfaction
Gradually increase: 5% → 10% → 25% → 50% → 100%
Roll back immediately if any metric degrades

5. Rollback Plan

Always have an instant rollback path:

Feature flag to switch model version without a code deploy
Keep the previous model pinned in your config for 30+ days after upgrade
Document which model version was used when, for debugging historical issues

Common Follow-ups

"How do you handle provider-forced upgrades when an old model is deprecated?" Start the upgrade process 2-3 months before deprecation. Use the deadline as a forcing function for the eval gate and canary process.
"What if the new model is better on most metrics but worse on one critical dimension?" The critical dimension wins. Fix the regression (update the prompt, add guardrails) or do not upgrade.
"How do you manage model upgrades across multiple features using the same LLM?" Each feature should have its own eval suite. Run all evals in parallel before any upgrade. Different features may be ready to upgrade on different timelines.

How Do You Handle Model Version Upgrades Without Breaking Production?

Why This Is Asked

Key Concepts to Cover

How to Approach This

1. Never Use "latest" in Production

2. Pre-Upgrade: Eval Gate

3. Shadow Testing

4. Canary Deployment

5. Rollback Plan

Common Follow-ups

Related Questions

How Do You Build an Eval Suite for an LLM-Powered Feature?

How Would You Detect and Handle LLM Output Regressions?

What Metrics Would You Track for an LLM in Production?

Prep for the full interview loop