Intermediate3 min read

How Would You Design a Prompt for Structured Data Extraction?

Design a prompt that reliably extracts structured data (JSON, tables) from unstructured text — handling missing fields, ambiguity, and format errors.

Also preparing for coding interviews?

Rubduck is an AI mock interviewer for DSA and coding rounds — get instant feedback on your solutions.

Daily tips, confessions & AI news. Unsubscribe anytime. Questions? [email protected]

Why This Is Asked

Structured extraction is one of the most common production LLM use cases. Interviewers ask this to see if you can design prompts that are reliably correct, with graceful handling of edge cases and failures.

Key Concepts to Cover

  • Output schema definition — specifying exactly what JSON to return
  • Handling missing fields — when source text lacks expected information
  • Ambiguity resolution — when the same concept appears multiple times
  • Validation — parsing and validating LLM output before using it
  • Structured output APIs — provider-level JSON mode or function calling
  • Error recovery — retry strategies when output is malformed

How to Approach This

1. Define the Output Schema Precisely

{
  "invoice_number": "string | null",
  "vendor_name": "string",
  "total_amount": "number",
  "currency": "string (ISO 4217 code)",
  "due_date": "string (ISO 8601) | null"
}

2. Prompt Structure

  1. Role + task: "You are a data extraction assistant."
  2. Schema: Provide the JSON schema with field descriptions
  3. Rules for missing/ambiguous data: "If a field is not present, use null."
  4. Output instruction: "Return ONLY valid JSON. No explanation or other text."

3. Use Provider-Level Structured Output

Modern LLM APIs support structured output natively:

  • OpenAI: response_format: { type: "json_schema", json_schema: {...} } enforces valid JSON. The newer strict: true mode (added August 2024) additionally enforces that the output matches your schema exactly, with no extra fields — prefer this when available.
  • Anthropic: Define your schema as a tool with a JSON Schema in the input_schema field and instruct the model to call that tool. The model's tool call arguments are your structured output. This is the recommended pattern for guaranteed structured output with Anthropic models.

Use these when available — they enforce schema conformance at the API level, eliminating the most common causes of parse failures.

4. Few-Shot Examples

For complex extractions, include 1-2 examples. Skip them for simple schemas — they add tokens without benefit.

5. Validation and Error Recovery

try:
    result = json.loads(llm_response)
    validated = ExtractionSchema.model_validate(result)
except (json.JSONDecodeError, ValidationError) as e:
    retry_response = call_llm(f"Previous response was invalid: {e}. Try again.")

Common Follow-ups

  1. "What do you do when the LLM extracts a field incorrectly with high confidence?" Human review queues, confidence scoring, ensemble approaches with multiple extraction passes.

  2. "How do you handle documents in multiple languages?" Language detection pre-processing, language-specific prompts, multilingual models, canonical format normalization.

  3. "How would you scale to 100,000 documents per day?" Async batch processing, cheaper model for simple extractions, escalate complex ones, caching identical documents.

Related Questions

Prep the coding round too

AI knowledge is only half the picture. Rubduck helps you nail DSA and coding interviews with an AI interviewer that gives real-time feedback.

Daily tips, confessions & AI news. Unsubscribe anytime. Questions? [email protected]