Walk through metrics and methods for evaluating retrieval quality in a RAG pipeline — from offline metrics to end-to-end answer quality.

Learn how to measure retrieval quality in a RAG system. Covers precision@k, recall@k, MRR, NDCG, context relevance, and end-to-end evaluation.

Evaluate RAG Retrieval Quality - Interview Question

Why This Is Asked

Knowing how to build a RAG pipeline is one thing; knowing how to measure whether it's actually working is what separates senior engineers from everyone else. Interviewers want to see if you think empirically and can identify problems at the retrieval layer before they cause generation failures.

Key Concepts to Cover

Precision@k — of the top-k retrieved chunks, how many are relevant?
Recall@k — of all relevant chunks, how many are in the top-k?
MRR (Mean Reciprocal Rank) — how highly ranked is the first relevant result?
NDCG — normalized discounted cumulative gain
Context relevance — is what was retrieved actually useful?
Faithfulness — does the generated answer stick to the retrieved context?
Answer relevance — does the answer address the question?

How to Approach This

1. Separate Retrieval from Generation Evaluation

RAG has two components that can fail independently:

Retrieval: Did we find the right chunks?
Generation: Did the LLM use those chunks well?

Evaluate them separately so you know where to fix problems.

2. Offline Retrieval Metrics

You need a labeled dataset: pairs of (query, set of relevant document IDs).

Precision@k: Of the k chunks retrieved, what fraction are relevant?

Recall@k: Of all relevant chunks, what fraction are in the top k?

MRR: Averages 1/rank of the first relevant result across queries. Good for tasks where finding one relevant result is enough.

NDCG: Weighted by position — finding relevant results early matters more.

3. Context Relevance (LLM-as-Judge)

For each retrieved chunk, ask an LLM: "Given this query, is this passage relevant?" Score 1-3. Doesn't require ground truth labels — scalable.

4. End-to-End Metrics (RAG Triad)

Context relevance: Are retrieved chunks relevant to the query?
Faithfulness: Does the answer only use information from the context?
Answer relevance: Does the answer address the question?

Frameworks like RAGAS automate this evaluation.

5. Building Your Evaluation Dataset Without Labels

Take documents from your corpus
Ask an LLM to generate questions that would be answered by each document
Use those (question, document) pairs as ground truth
This "reverse generation" approach bootstraps labels quickly

Common Follow-ups

"What is a reasonable target for P@5?" Targets depend heavily on corpus quality and query distribution. For a well-tuned system over a clean, well-labeled corpus with clear queries, P@5 > 0.7 is achievable. For large, noisy production corpora (mixed quality docs, ambiguous queries), P@5 of 0.5–0.6 is often considered good. Establish your baseline first, then track improvement over time rather than targeting a fixed number in the abstract.
"How do you detect retrieval failures in production without ground truth labels?" Monitor proxy signals: short answers, "I don't know" response rate, thumbs-down rate, follow-up query rate.
"How would you compare two retrieval approaches?" A/B test on your evaluation dataset: run both on the same queries, measure metrics, then compare end-to-end generation quality.

How Would You Evaluate Retrieval Quality in a RAG System?

Why This Is Asked

Key Concepts to Cover

How to Approach This

1. Separate Retrieval from Generation Evaluation

2. Offline Retrieval Metrics

3. Context Relevance (LLM-as-Judge)

4. End-to-End Metrics (RAG Triad)

5. Building Your Evaluation Dataset Without Labels

Common Follow-ups

Related Questions

Design a RAG Pipeline from Scratch

How Do You Build an Eval Suite for an LLM-Powered Feature?

How Do You Handle Chunking Strategies for Different Document Types?

Prep for the full interview loop