Intermediate3 min read

How Would You Evaluate Retrieval Quality in a RAG System?

Walk through metrics and methods for evaluating retrieval quality in a RAG pipeline — from offline metrics to end-to-end answer quality.

Also preparing for coding interviews?

Rubduck is an AI mock interviewer for DSA and coding rounds — get instant feedback on your solutions.

Daily tips, confessions & AI news. Unsubscribe anytime. Questions? [email protected]

Why This Is Asked

Knowing how to build a RAG pipeline is one thing; knowing how to measure whether it's actually working is what separates senior engineers from everyone else. Interviewers want to see if you think empirically and can identify problems at the retrieval layer before they cause generation failures.

Key Concepts to Cover

  • Precision@k — of the top-k retrieved chunks, how many are relevant?
  • Recall@k — of all relevant chunks, how many are in the top-k?
  • MRR (Mean Reciprocal Rank) — how highly ranked is the first relevant result?
  • NDCG — normalized discounted cumulative gain
  • Context relevance — is what was retrieved actually useful?
  • Faithfulness — does the generated answer stick to the retrieved context?
  • Answer relevance — does the answer address the question?

How to Approach This

1. Separate Retrieval from Generation Evaluation

RAG has two components that can fail independently:

  • Retrieval: Did we find the right chunks?
  • Generation: Did the LLM use those chunks well?

Evaluate them separately so you know where to fix problems.

2. Offline Retrieval Metrics

You need a labeled dataset: pairs of (query, set of relevant document IDs).

Precision@k: Of the k chunks retrieved, what fraction are relevant?

Recall@k: Of all relevant chunks, what fraction are in the top k?

MRR: Averages 1/rank of the first relevant result across queries. Good for tasks where finding one relevant result is enough.

NDCG: Weighted by position — finding relevant results early matters more.

3. Context Relevance (LLM-as-Judge)

For each retrieved chunk, ask an LLM: "Given this query, is this passage relevant?" Score 1-3. Doesn't require ground truth labels — scalable.

4. End-to-End Metrics (RAG Triad)

  • Context relevance: Are retrieved chunks relevant to the query?
  • Faithfulness: Does the answer only use information from the context?
  • Answer relevance: Does the answer address the question?

Frameworks like RAGAS automate this evaluation.

5. Building Your Evaluation Dataset Without Labels

  • Take documents from your corpus
  • Ask an LLM to generate questions that would be answered by each document
  • Use those (question, document) pairs as ground truth
  • This "reverse generation" approach bootstraps labels quickly

Common Follow-ups

  1. "What is a reasonable target for P@5?" Targets depend heavily on corpus quality and query distribution. For a well-tuned system over a clean, well-labeled corpus with clear queries, P@5 > 0.7 is achievable. For large, noisy production corpora (mixed quality docs, ambiguous queries), P@5 of 0.5–0.6 is often considered good. Establish your baseline first, then track improvement over time rather than targeting a fixed number in the abstract.

  2. "How do you detect retrieval failures in production without ground truth labels?" Monitor proxy signals: short answers, "I don't know" response rate, thumbs-down rate, follow-up query rate.

  3. "How would you compare two retrieval approaches?" A/B test on your evaluation dataset: run both on the same queries, measure metrics, then compare end-to-end generation quality.

Related Questions

Prep the coding round too

AI knowledge is only half the picture. Rubduck helps you nail DSA and coding interviews with an AI interviewer that gives real-time feedback.

Daily tips, confessions & AI news. Unsubscribe anytime. Questions? [email protected]