Why This Is Asked
Document Q&A is one of the most practical and frequently built AI features. It tests your ability to apply RAG to a concrete product requirement, handle real-world complications, and think about UX.
Key Concepts to Cover
- Ingestion pipeline — parsing, cleaning, chunking documents at scale
- Multi-document retrieval — finding relevant content across many sources
- Citation and sourcing — attributing answers to specific documents
- Conflicting information — handling contradictions across documents
- Query understanding — rewriting vague queries
- Access control — users should only see authorized documents
How to Approach This
1. Clarify Requirements
- What document types? (PDFs, Word, HTML, code?)
- How large is the corpus? (100 docs vs. 10M?)
- What latency is acceptable?
- Do users need citations?
- Any access control?
2. High-Level Architecture
Documents → Ingestion Pipeline → Vector Store
User Query → Query Processor → Retriever → Context Builder → LLM → Answer + Citations
3. Ingestion at Scale
For 10M+ documents:
- Distributed ingestion workers
- Document deduplication (hash-based)
- Incremental updates (re-process only changed documents)
- Metadata extraction: author, date, source, document type
4. Citations
- Chunk-level citation: tag each chunk with source document and page
- Post-hoc attribution: ask LLM to annotate which sentence came from which source
- Inline citation format: instruct LLM to output
[1],[2]references - Validate: verify each citation actually supports the claim
5. Query Understanding
- Query rewriting: expand vague queries with context
- HyDE (Hypothetical Document Embeddings): instead of embedding the query directly, ask the LLM to generate a hypothetical answer to the query, then embed that hypothetical answer and use it for retrieval. This works because the hypothetical answer uses the vocabulary and style of relevant documents — producing a retrieval vector that's closer to real answers in embedding space than the question alone, especially useful when question and answer vocabulary differ significantly
- Multi-query retrieval: generate multiple phrasings and union results
Common Follow-ups
-
"How would you handle a question that spans multiple documents?" Retrieving from multiple sources, map-reduce summarization, iterative retrieval.
-
"How would you measure accuracy?" Golden Q&A dataset, exact match, semantic similarity, citation accuracy, human evaluation.
-
"How do you handle documents that are updated or deleted?" Re-indexing on change events, chunk-level versioning, invalidating cached answers.