Why This Is Asked
Choosing the right LLM for a given task is a fundamental engineering decision. Interviewers ask this to see if you understand the practical constraints of production AI systems — not just which model is "best" in benchmarks.
Key Concepts to Cover
- The tradeoff triangle — you cannot maximize all three simultaneously
- Model tiers — small/fast/cheap vs. large/slow/expensive
- Task routing — using different models for different task complexity levels
- Caching — eliminating redundant LLM calls
- Streaming — reducing perceived latency with incremental output
- Measurement — you must measure on your actual workload, not benchmarks
How to Approach This
1. The Core Tradeoff
Larger models generally produce higher quality output but are slower and more expensive:
| Model Tier | Examples | Latency (typical) | Best For | |------------|---------|-------------------|---------| | Small | Mini/nano instruction models | ~sub-second to 1s | Classification, extraction, routing | | Medium | General-purpose chat models | ~1-3s | Most production tasks | | Large | Frontier reasoning models | ~several seconds to tens of seconds | Complex reasoning, high-stakes decisions |
2. Match Model to Task Complexity
Use a small model for:
- Intent classification
- Simple extraction
- Routing decisions
- Summarization of short text
Use a medium model for:
- Most user-facing conversational tasks
- Moderate-complexity generation
Use a large model for:
- Multi-step reasoning
- High-stakes outputs
- Tasks where quality matters far more than cost or speed
3. Optimize for Latency
- Streaming: Start rendering tokens to the user as they arrive
- Async processing: For non-real-time tasks, run LLM inference asynchronously
- Parallel calls: If a task requires multiple independent LLM calls, run them in parallel
- Prompt optimization: Shorter prompts = lower time-to-first-token
4. Optimize for Cost
- Caching: Cache identical or near-identical prompts
- Prompt compression: Use shorter prompts where possible
- Model downgrade for simple tasks: If the small model handles 80% of traffic adequately, only escalate the complex 20%
- Batching: For async workloads, batch requests for provider batch pricing discounts
5. Measure on Your Actual Workload
Benchmark results do not predict your production performance. Measure:
- Your actual latency distribution (p90 and p99 matter more than average)
- Your actual cost per feature
- Your actual quality on your eval suite
Common Follow-ups
-
"How do you handle a use case where all three constraints are tight?" Fine-tune a small model on your specific task, use heavy caching, invest in prompt optimization. There is no free lunch — you may need to redefine the problem.
-
"How do you justify LLM costs to stakeholders?" Frame cost in terms of value delivered: cost per resolved support ticket, cost per successful code review. Compare to the alternative (human labor cost).
-
"What is speculative decoding and how does it affect the tradeoffs?" A small draft model proposes multiple tokens ahead, then the large model runs a single parallel forward pass over all draft tokens simultaneously to accept or reject them. Because the large model evaluates all draft tokens in one pass (instead of generating one token at a time autoregressively), throughput increases significantly when the draft model has high acceptance rates — typically 2-3x speedup for predictable text, with 4x near the upper bound of published results. Quality is essentially unchanged since rejected tokens are resampled from the large model. Many providers use this internally, so you may already be benefiting without configuring it.