Why This Is Asked
Interviewers ask this to evaluate your understanding of end-to-end AI system design. They want to see if you can reason about tradeoffs between retrieval accuracy, latency, cost, and complexity. RAG is one of the most common production AI patterns, so deep familiarity signals real-world experience.
Key Concepts to Cover
- Document ingestion — how documents enter the system
- Chunking — splitting documents into retrievable units
- Embedding models — converting text to vectors
- Vector database — storing and searching embeddings
- Retrieval strategies — similarity search, re-ranking, hybrid search
- Context assembly — building the LLM prompt with retrieved chunks
- Generation — LLM inference with retrieved context
- Evaluation — measuring retrieval and generation quality
How to Approach This
1. Clarify Requirements
Start by asking the interviewer:
- What type of documents? (PDFs, web pages, code, structured data?)
- What scale? (1K docs vs 10M docs changes the architecture)
- Latency requirements? (Real-time chat vs batch processing)
- Accuracy requirements? (Customer-facing vs internal tool)
2. High-Level Architecture
Walk through the two main pipelines:
Ingestion pipeline (offline): Documents → Preprocessing → Chunking → Embedding → Vector DB
Query pipeline (online): User query → Embedding → Vector search → Re-ranking → Context assembly → LLM → Response
3. Deep Dive: Chunking
This is where most candidates differentiate themselves. Discuss:
- Fixed-size chunking (simple but loses context)
- Semantic chunking (split on topic boundaries)
- Recursive character splitting (practical middle ground)
- Chunk overlap (prevents information loss at boundaries)
- Metadata preservation (source, page number, section title)
4. Deep Dive: Retrieval
Cover the retrieval strategy:
- Dense retrieval (embedding similarity)
- Sparse retrieval (BM25 keyword matching)
- Hybrid approach (combine both with reciprocal rank fusion)
- Re-ranking with a cross-encoder for top-k results (important: cross-encoders see query + document together and are much more accurate than embedding similarity alone, but they are O(n) in compute — they can only be applied to a small shortlist of 20-50 results from the initial retrieval, not the entire corpus)
5. Discuss Tradeoffs
- Chunk size: smaller = more precise retrieval, larger = more context
- Number of retrieved chunks: more = better recall, more = higher cost/latency
- Embedding model choice: larger models = better quality, higher latency
- Caching: cache frequent queries to reduce latency and cost
Common Follow-ups
-
"How would you handle documents that update frequently?" Discuss incremental ingestion, document versioning, and cache invalidation.
-
"How do you evaluate whether the RAG system is working well?" Cover retrieval metrics (precision@k, recall@k, MRR) and generation metrics (faithfulness, relevance, answer correctness). Mention human evaluation loops.
-
"What happens when the retrieved context contradicts itself?" Discuss conflict resolution strategies, source prioritization, and prompting the LLM to acknowledge uncertainty.