Walk through designing a production-ready RAG system covering document ingestion, chunking strategies, embedding models, vector search, and LLM generation.

Learn how to design a production RAG pipeline. Covers document ingestion, chunking, embedding, vector search, re-ranking, and generation.

Design a RAG Pipeline from Scratch - Interview Question

Why This Is Asked

Interviewers ask this to evaluate your understanding of end-to-end AI system design. They want to see if you can reason about tradeoffs between retrieval accuracy, latency, cost, and complexity. RAG is one of the most common production AI patterns, so deep familiarity signals real-world experience.

Key Concepts to Cover

Document ingestion — how documents enter the system
Chunking — splitting documents into retrievable units
Embedding models — converting text to vectors
Vector database — storing and searching embeddings
Retrieval strategies — similarity search, re-ranking, hybrid search
Context assembly — building the LLM prompt with retrieved chunks
Generation — LLM inference with retrieved context
Evaluation — measuring retrieval and generation quality

How to Approach This

1. Clarify Requirements

Start by asking the interviewer:

What type of documents? (PDFs, web pages, code, structured data?)
What scale? (1K docs vs 10M docs changes the architecture)
Latency requirements? (Real-time chat vs batch processing)
Accuracy requirements? (Customer-facing vs internal tool)

2. High-Level Architecture

Walk through the two main pipelines:

Ingestion pipeline (offline): Documents → Preprocessing → Chunking → Embedding → Vector DB

Query pipeline (online): User query → Embedding → Vector search → Re-ranking → Context assembly → LLM → Response

3. Deep Dive: Chunking

This is where most candidates differentiate themselves. Discuss:

Fixed-size chunking (simple but loses context)
Semantic chunking (split on topic boundaries)
Recursive character splitting (practical middle ground)
Chunk overlap (prevents information loss at boundaries)
Metadata preservation (source, page number, section title)

4. Deep Dive: Retrieval

Cover the retrieval strategy:

Dense retrieval (embedding similarity)
Sparse retrieval (BM25 keyword matching)
Hybrid approach (combine both with reciprocal rank fusion)
Re-ranking with a cross-encoder for top-k results (important: cross-encoders see query + document together and are much more accurate than embedding similarity alone, but they are O(n) in compute — they can only be applied to a small shortlist of 20-50 results from the initial retrieval, not the entire corpus)

5. Discuss Tradeoffs

Chunk size: smaller = more precise retrieval, larger = more context
Number of retrieved chunks: more = better recall, more = higher cost/latency
Embedding model choice: larger models = better quality, higher latency
Caching: cache frequent queries to reduce latency and cost

Common Follow-ups

"How would you handle documents that update frequently?" Discuss incremental ingestion, document versioning, and cache invalidation.
"How do you evaluate whether the RAG system is working well?" Cover retrieval metrics (precision@k, recall@k, MRR) and generation metrics (faithfulness, relevance, answer correctness). Mention human evaluation loops.
"What happens when the retrieved context contradicts itself?" Discuss conflict resolution strategies, source prioritization, and prompting the LLM to acknowledge uncertainty.

Design a RAG Pipeline from Scratch

Why This Is Asked

Key Concepts to Cover

How to Approach This

1. Clarify Requirements

2. High-Level Architecture

3. Deep Dive: Chunking

4. Deep Dive: Retrieval

5. Discuss Tradeoffs

Common Follow-ups

Related Questions

How Do You Handle Chunking Strategies for Different Document Types?

How Would You Evaluate Retrieval Quality in a RAG System?

When Would You Choose RAG Over Fine-Tuning?

Prep for the full interview loop