Why This Is Asked
Structured extraction is one of the most common production LLM use cases. Interviewers ask this to see if you can design prompts that are reliably correct, with graceful handling of edge cases and failures.
Key Concepts to Cover
- Output schema definition — specifying exactly what JSON to return
- Handling missing fields — when source text lacks expected information
- Ambiguity resolution — when the same concept appears multiple times
- Validation — parsing and validating LLM output before using it
- Structured output APIs — provider-level JSON mode or function calling
- Error recovery — retry strategies when output is malformed
How to Approach This
1. Define the Output Schema Precisely
{
"invoice_number": "string | null",
"vendor_name": "string",
"total_amount": "number",
"currency": "string (ISO 4217 code)",
"due_date": "string (ISO 8601) | null"
}
2. Prompt Structure
- Role + task: "You are a data extraction assistant."
- Schema: Provide the JSON schema with field descriptions
- Rules for missing/ambiguous data: "If a field is not present, use null."
- Output instruction: "Return ONLY valid JSON. No explanation or other text."
3. Use Provider-Level Structured Output
Modern LLM APIs support structured output natively:
- OpenAI:
response_format: { type: "json_schema", json_schema: {...} }enforces valid JSON. The newerstrict: truemode (added August 2024) additionally enforces that the output matches your schema exactly, with no extra fields — prefer this when available. - Anthropic: Define your schema as a tool with a JSON Schema in the
input_schemafield and instruct the model to call that tool. The model's tool call arguments are your structured output. This is the recommended pattern for guaranteed structured output with Anthropic models.
Use these when available — they enforce schema conformance at the API level, eliminating the most common causes of parse failures.
4. Few-Shot Examples
For complex extractions, include 1-2 examples. Skip them for simple schemas — they add tokens without benefit.
5. Validation and Error Recovery
try:
result = json.loads(llm_response)
validated = ExtractionSchema.model_validate(result)
except (json.JSONDecodeError, ValidationError) as e:
retry_response = call_llm(f"Previous response was invalid: {e}. Try again.")
Common Follow-ups
-
"What do you do when the LLM extracts a field incorrectly with high confidence?" Human review queues, confidence scoring, ensemble approaches with multiple extraction passes.
-
"How do you handle documents in multiple languages?" Language detection pre-processing, language-specific prompts, multilingual models, canonical format normalization.
-
"How would you scale to 100,000 documents per day?" Async batch processing, cheaper model for simple extractions, escalate complex ones, caching identical documents.