← Back to blog
Evaluation·7 min read·February 10, 2026

The Structured Extraction Trap: When Your LLM Returns Garbage and Your Eval Doesn't Notice

Models fail to return parseable output on 10–15% of prompts. Here's why standard evals miss it, and how to catch format failures before they hit production.

You asked the model for JSON. It returned JSON-adjacent prose wrapped in markdown. Your eval parser choked silently. You logged a score of 0.0. The pipeline continued.

This happens more than you think. In testing across a range of structured extraction tasks — JSON fields, Pydantic schemas, nested arrays — models fail to return parseable output on roughly 10–15% of prompts. Not catastrophically wrong answers — parse failures. The model understood the task, but produced output that a downstream JSON parser rejects.

When your evaluator silently returns 0.0 on a parse failure, you're not measuring model quality. You're measuring a mix of model quality and the model's willingness to comply with output format instructions on any given day — which is itself non-deterministic. The failure is invisible unless you instrument for it specifically.

This post covers what causes extraction failures, how to detect them, and how to build evals that are robust to them.


What "Structured Extraction" Means Here

Any task where you need the model's output to conform to a defined structure:

  • JSON with specific fields
  • A Pydantic model (typed Python object)
  • A fixed schema (product name, price, category, etc.)
  • Tool call arguments
  • Numbered lists, headers, YAML

All of these have the same failure mode: the model produces output that looks right to a human but fails programmatic parsing.


The Four Failure Modes

1. Markdown wrapping

```json
{"title": "Quarterly Report", "score": 0.87}
```

A bare json.loads() fails on this. Your eval logs 0.0. The model actually got it right.

2. Extra prose before or after

Sure! Here's the JSON you asked for:

{"title": "Quarterly Report", "score": 0.87}

Let me know if you need anything else!

Again, parse fails. Common with instruction-tuned models that are optimized for conversational helpfulness.

3. Type coercion errors

{"title": "Quarterly Report", "score": "0.87"}

The model returned a string "0.87" instead of a float 0.87. Valid JSON, wrong type. A Pydantic model will reject it. A JSON Schema validator with "type": "number" will reject it.

4. Missing required fields

{"title": "Quarterly Report"}

The model omitted a field. Sometimes it's hallucinating absent data and refuses to include it. Sometimes it misread the schema. Either way, your downstream pipeline breaks silently.


How Standard Evals Miss This

Most eval frameworks check the model's answer against a reference. If the model returns "0.87" (string) when you expected 0.87 (float), a string-matching evaluator says "wrong answer." A numeric comparison crashes. Neither tells you why it failed.

The deeper problem: if your evaluator returns 0.0 on a parse failure and 0.0 on a genuine wrong answer, you lose signal. You can't distinguish "the model doesn't know the answer" from "the model knows the answer but can't format its response."

This distinction matters. Format failures are fixable with better prompt engineering or retry logic. Knowledge failures require a different model or retrieval strategy.


Schema Evaluation: Catching Format Failures Separately

The fix is to evaluate structure and content as separate concerns:

from pydantic import BaseModel
from multivon_eval import EvalSuite, SchemaEvaluator, Faithfulness

class DocumentSummary(BaseModel):
    title: str
    score: float
    tags: list[str]
    confidence: float

suite = EvalSuite("Document Intelligence")
suite.add_cases(cases)
suite.add_evaluators(
    SchemaEvaluator(DocumentSummary),  # structure first
    Faithfulness(),                     # content second
)

SchemaEvaluator does four things:

  1. Strips markdown code fences — so ```json ... ``` is parsed correctly
  2. Validates types"0.87" fails when float is expected, with a clear error message
  3. Checks required fields — reports which fields are missing
  4. Reports per-field errors — not just "schema invalid" but exactly what went wrong

Sample output when structure fails:

Schema validation failed:
  score: Input should be a valid number, unable to parse string as a float
  tags: Field required

You see the structure error separate from any content evaluation. Your faithfulness score (which would be meaningless on unparsed output) only runs when the schema check passes.


The Failure Rate in Context

The 10–15% format failure rate is an average across multiple models and extraction tasks. The distribution matters:

  • Simple 3-field schemas: ~4% failure rate
  • Nested schemas with arrays: ~15% failure rate
  • Schemas requiring numeric types: ~20% failure rate (type coercion)
  • Multi-model pipelines (extraction → summarization → classification): failures compound

If you have a 5-step pipeline where each step has a 10% failure rate, your end-to-end success rate is 0.9^5 = 59%. You built something that works 6 times out of 10. Your single-step eval didn't catch it because each step looked fine in isolation.

Testing extraction robustness explicitly — not just "does the model return the right answer" but "does the model return a parseable, valid-schema answer consistently" — is the key shift.


JSON Schema for Non-Python Pipelines

If you're not using Python or Pydantic, SchemaEvaluator also accepts a JSON Schema dict:

from multivon_eval import SchemaEvaluator

suite.add_evaluators(SchemaEvaluator({
    "type": "object",
    "required": ["title", "score", "tags"],
    "additionalProperties": False,
    "properties": {
        "title": {"type": "string", "minLength": 1},
        "score": {"type": "number", "minimum": 0, "maximum": 1},
        "tags": {
            "type": "array",
            "items": {"type": "string"},
            "minItems": 1
        }
    }
}))

additionalProperties: False catches models that invent extra fields — often a sign that the model is hallucinating structure it wasn't asked for.


Connecting to Non-Determinism

Structured extraction failures interact with the non-determinism problem from the previous post. If a case fails due to a parse error on run 1, it might succeed on run 3. This looks like flakiness, but it's actually format compliance variance.

Running each case 5 times and taking majority vote helps here: if the model returns valid JSON 4 out of 5 times, the case is a robust pass with one format hiccup. If it returns valid JSON 2 out of 5 times, you have a systemic format compliance problem worth fixing.

suite = EvalSuite("Document Intelligence")
suite.add_cases(cases)
suite.add_evaluators(SchemaEvaluator(DocumentSummary), Faithfulness())

# 5 runs per case — separates format flakiness from content failure
report = suite.run(my_pipeline, runs=5)

The flakiness report will separate cases where format was the root cause from cases where the model's answer itself was inconsistent.


What to Actually Do

If you're building a structured extraction pipeline:

  1. Add SchemaEvaluator to every eval suite that involves JSON output
  2. Use Pydantic models — they give you per-field error messages, which are much more useful for debugging than JSON Schema's generic errors
  3. Set strict=True if you want to catch models that add extra fields

If you're evaluating document intelligence, data extraction, or anything that produces structured output:

  1. Separate structure evaluation from content evaluation
  2. Run at least 5 trials per case — format compliance is non-deterministic
  3. Check for type coercion failures specifically — these are the most common and least visible

If you inherit an eval suite without schema validation:

  1. Add a schema check as a pass/fail gate before any content evaluators
  2. Look at your historical 0.0 scores — many of them are probably parse failures, not genuine wrong answers

Further Reading

  • Pydantic documentation: Model validation — the authoritative reference for Pydantic v2 validation. The model_validate_json() method is what SchemaEvaluator uses under the hood.
  • JSON Schema specification: json-schema.org — includes the reference for additionalProperties, required, type constraints.
  • NAACL 2025 non-determinism paper: arxiv.org/abs/2407.10457 — Song et al., "The Good, The Bad, and The Greedy: Evaluation of LLMs Should Not Ignore Non-Determinism".
  • multivon-eval: pip install multivon-evalSchemaEvaluator is available with Pydantic and JSON Schema support as of v0.3.0.

Next in this series: How to Evaluate AI Agents Without Getting Fooled

Ready to put this into practice?

Multivon builds AI evaluation tooling for teams shipping models to production.

Get in touch