Beginner3 min read

Explain the Tradeoffs Between Latency, Cost, and Quality in LLM Selection

Navigate the three-way tradeoff between LLM latency, cost, and quality — and learn how to make the right selection for different use cases.

Also preparing for coding interviews?

Rubduck is an AI mock interviewer for DSA and coding rounds — get instant feedback on your solutions.

Daily tips, confessions & AI news. Unsubscribe anytime. Questions? [email protected]

Why This Is Asked

Choosing the right LLM for a given task is a fundamental engineering decision. Interviewers ask this to see if you understand the practical constraints of production AI systems — not just which model is "best" in benchmarks.

Key Concepts to Cover

  • The tradeoff triangle — you cannot maximize all three simultaneously
  • Model tiers — small/fast/cheap vs. large/slow/expensive
  • Task routing — using different models for different task complexity levels
  • Caching — eliminating redundant LLM calls
  • Streaming — reducing perceived latency with incremental output
  • Measurement — you must measure on your actual workload, not benchmarks

How to Approach This

1. The Core Tradeoff

Larger models generally produce higher quality output but are slower and more expensive:

| Model Tier | Examples | Latency (typical) | Best For | |------------|---------|-------------------|---------| | Small | Mini/nano instruction models | ~sub-second to 1s | Classification, extraction, routing | | Medium | General-purpose chat models | ~1-3s | Most production tasks | | Large | Frontier reasoning models | ~several seconds to tens of seconds | Complex reasoning, high-stakes decisions |

2. Match Model to Task Complexity

Use a small model for:

  • Intent classification
  • Simple extraction
  • Routing decisions
  • Summarization of short text

Use a medium model for:

  • Most user-facing conversational tasks
  • Moderate-complexity generation

Use a large model for:

  • Multi-step reasoning
  • High-stakes outputs
  • Tasks where quality matters far more than cost or speed

3. Optimize for Latency

  • Streaming: Start rendering tokens to the user as they arrive
  • Async processing: For non-real-time tasks, run LLM inference asynchronously
  • Parallel calls: If a task requires multiple independent LLM calls, run them in parallel
  • Prompt optimization: Shorter prompts = lower time-to-first-token

4. Optimize for Cost

  • Caching: Cache identical or near-identical prompts
  • Prompt compression: Use shorter prompts where possible
  • Model downgrade for simple tasks: If the small model handles 80% of traffic adequately, only escalate the complex 20%
  • Batching: For async workloads, batch requests for provider batch pricing discounts

5. Measure on Your Actual Workload

Benchmark results do not predict your production performance. Measure:

  • Your actual latency distribution (p90 and p99 matter more than average)
  • Your actual cost per feature
  • Your actual quality on your eval suite

Common Follow-ups

  1. "How do you handle a use case where all three constraints are tight?" Fine-tune a small model on your specific task, use heavy caching, invest in prompt optimization. There is no free lunch — you may need to redefine the problem.

  2. "How do you justify LLM costs to stakeholders?" Frame cost in terms of value delivered: cost per resolved support ticket, cost per successful code review. Compare to the alternative (human labor cost).

  3. "What is speculative decoding and how does it affect the tradeoffs?" A small draft model proposes multiple tokens ahead, then the large model runs a single parallel forward pass over all draft tokens simultaneously to accept or reject them. Because the large model evaluates all draft tokens in one pass (instead of generating one token at a time autoregressively), throughput increases significantly when the draft model has high acceptance rates — typically 2-3x speedup for predictable text, with 4x near the upper bound of published results. Quality is essentially unchanged since rejected tokens are resampled from the large model. Many providers use this internally, so you may already be benefiting without configuring it.

Related Questions

Prep the coding round too

AI knowledge is only half the picture. Rubduck helps you nail DSA and coding interviews with an AI interviewer that gives real-time feedback.

Daily tips, confessions & AI news. Unsubscribe anytime. Questions? [email protected]