Design a prompt that reliably extracts structured data (JSON, tables) from unstructured text — handling missing fields, ambiguity, and format errors.

Learn how to design prompts for reliable structured data extraction from unstructured text. Covers JSON output, validation, error handling, and edge cases.

Prompt Design for Structured Data Extraction - Interview Question

Why This Is Asked

Structured extraction is one of the most common production LLM use cases. Interviewers ask this to see if you can design prompts that are reliably correct, with graceful handling of edge cases and failures.

Key Concepts to Cover

Output schema definition — specifying exactly what JSON to return
Handling missing fields — when source text lacks expected information
Ambiguity resolution — when the same concept appears multiple times
Validation — parsing and validating LLM output before using it
Structured output APIs — provider-level JSON mode or function calling
Error recovery — retry strategies when output is malformed

How to Approach This

1. Define the Output Schema Precisely

{
  "invoice_number": "string | null",
  "vendor_name": "string",
  "total_amount": "number",
  "currency": "string (ISO 4217 code)",
  "due_date": "string (ISO 8601) | null"
}

2. Prompt Structure

Role + task: "You are a data extraction assistant."
Schema: Provide the JSON schema with field descriptions
Rules for missing/ambiguous data: "If a field is not present, use null."
Output instruction: "Return ONLY valid JSON. No explanation or other text."

3. Use Provider-Level Structured Output

Modern LLM APIs support structured output natively:

OpenAI: response_format: { type: "json_schema", json_schema: {...} } enforces valid JSON. The newer strict: true mode (added August 2024) additionally enforces that the output matches your schema exactly, with no extra fields — prefer this when available.
Anthropic: Define your schema as a tool with a JSON Schema in the input_schema field and instruct the model to call that tool. The model's tool call arguments are your structured output. This is the recommended pattern for guaranteed structured output with Anthropic models.

Use these when available — they enforce schema conformance at the API level, eliminating the most common causes of parse failures.

4. Few-Shot Examples

For complex extractions, include 1-2 examples. Skip them for simple schemas — they add tokens without benefit.

5. Validation and Error Recovery

try:
    result = json.loads(llm_response)
    validated = ExtractionSchema.model_validate(result)
except (json.JSONDecodeError, ValidationError) as e:
    retry_response = call_llm(f"Previous response was invalid: {e}. Try again.")

Common Follow-ups

"What do you do when the LLM extracts a field incorrectly with high confidence?" Human review queues, confidence scoring, ensemble approaches with multiple extraction passes.
"How do you handle documents in multiple languages?" Language detection pre-processing, language-specific prompts, multilingual models, canonical format normalization.
"How would you scale to 100,000 documents per day?" Async batch processing, cheaper model for simple extractions, escalate complex ones, caching identical documents.

How Would You Design a Prompt for Structured Data Extraction?

Why This Is Asked

Key Concepts to Cover

How to Approach This

1. Define the Output Schema Precisely

2. Prompt Structure

3. Use Provider-Level Structured Output

4. Few-Shot Examples

5. Validation and Error Recovery

Common Follow-ups

Related Questions

How Do You Evaluate Whether a Prompt Is Working Well?

What Strategies Do You Use to Reduce Hallucinations?

Explain Chain-of-Thought Prompting and When to Use It

Prep for the full interview loop