PDF Hell

Adversarial PDFs that break AI document readers.

Every case is a PDF generated from code. The answer is exactly known. No LLM-as-judge, no circular assurance — just procedural ground truth and a model that either passes or fails. The mini suite is 30 cases across 3 trap families today; more trap families land as they're written and tested.

Quickstart

1

Install (zero setup with uv)

2

Generate one trap PDF for inspection

3

Run the mini suite against your favorite vision model

Provider shorthand: anthropic:, openai:, google:. API key from env (ANTHROPIC_API_KEY, etc.).

The traps in detail

hidden_ocr_mismatch

Hidden OCR mismatch

Invisible text layer disagrees with the rendered page.

How it's generated

An invoice with realistic line items and totals. The visible TOTAL DUE is set by the seed. A second copy of the total — with a different amount — is written into the PDF's text content stream using PDF render mode 3 (placed in the text stream but never rasterised). A human sees one amount; a text-extraction pipeline sees another.

What it detects

Vision models that trust the page (correct) vs. agents that fuse a text-extraction layer with vision output and silently prefer the layer (incorrect). The most common silent failure in scanned-then-OCR'd document pipelines in production.

Expected failure mode

Model answers the hidden amount instead of the visible amount — diagnosable from the recorded forbidden-answer match.

footnote_override

Footnote override

A 6pt footnote overrides the body clause.

How it's generated

A short legal/contract document (MSA, DPA, SOW, Order Form) with a confident body clause — e.g. 'liability is capped at 12 months of fees' — and a 6pt footnote on the same page that overrides it: 'Notwithstanding the foregoing, liability for breaches of Sections 4.2 and 7.1 shall be uncapped.' Three clause families: liability caps, termination notice periods, data residency.

What it detects

Contract-analysis agents that read the body and skip the footnote. RAG pipelines that drop low-font-size text on ingest. Compliance summarisers that produce confident, partially-right answers.

Expected failure mode

Model returns only the body answer, missing the carve-out — a malpractice-grade failure for any legal-AI vendor.

split_table_across_pages

Split table across pages

Header on page 1, body rows on page 2 — no header repeat.

How it's generated

A financial-results table with 6 columns (Region, Quarter, Gross Revenue, COGS, Operating Income, Net Revenue). The column-header row sits at the bottom of page 1, surrounded by realistic filler text. The 8 data rows sit at the top of page 2, headerless. The case asks for one specific cell (e.g. 'Net Revenue for the Northwest region in Q3').

What it detects

Any RAG loader that paginates documents independently and loses cross-page table context. Table-extraction models that don't persist column headers when a table spans a page break.

Expected failure mode

Model returns a value from an adjacent column in the correct row — column confusion. The diagnostic distinguishes this from outright hallucination.

Methodology

Procedural ground truth. Not vibes.

01

Generated from code

Every trap is a Python generator with a deterministic seed. The PDF is constructed cell-by-cell with reportlab, and the answer key is whatever literal value the generator chose. Re-running with the same seed produces byte-identical PDFs and identical answer keys.

02

Scored by string match

The headline correctness signal is contains-match (whitespace-tolerant, case-insensitive, currency-prefix-tolerant) between the model's free-text answer and the expected value. No LLM judges. No circular assurance.

03

Designed failure modes

Each trap names the specific failure it elicits (e.g. 'model trusted hidden OCR over visible page'). Forbidden-answer detection records when a model fell into the designed trap vs. hallucinated some third value. The diagnostic is the product, not the score.

04

QAG is the explanation, not the score

multivon-eval's DocumentGrounding (QAG-based) is available as a separate layer for users who want a human-readable 'why did this fail' breakdown. It runs after primary scoring, so a judge-model failure cannot change the pass/fail signal.

Reproducible by design

The mini suite’s 30 cases are produced from these seeds. Anyone can re-derive byte-identical PDFs with pdfhell make --trap X --seed N. No private holdout; the methodology is its own holdout.

Trap familySeedsCases
hidden_ocr_mismatch1001 – 101010
footnote_override2001 – 201010
split_table_across_pages3001 – 301010

PDF Hell on your data

Want PDF Hell on your document templates?

The OSS suite covers three trap families against generic invoices, contracts, and financial tables. If you need adversarial variants of your own MSAs / claim forms / EOBs / medical records, or a custom trap family for your vertical, drop us a line — inbound only, no fake pricing.

FAQ

How is this different from DocVQA / MMBench / SWE-Bench / etc.?

Existing document benchmarks measure correctness on clean, naturally-occurring PDFs. PDF Hell measures correctness under adversarial PDF structures that are common in production but absent in academic benchmarks (hidden OCR layers, footnote overrides, page-broken tables). It's a stress test, not a sufficient eval — pair it with a domain benchmark for full coverage.

Why no LLM-as-judge?

Because the same complexity that fools a document AI also fools a judge model. We've watched customers buy 'Evidence Pack' reports and discover the judge passed an answer that contradicted the source. PDF Hell removes the judge from the scoring path. The answer is set by code; the model's output is matched against a known string.

Can I add my own traps?

Yes. Each generator is a single Python file under pdfhell/generators/. Pattern: take a seed, draw a PDF with reportlab, return (pdf_bytes, HellCase) with the answer key. Submit a PR; if the trap surfaces a new failure mode the existing families don't catch, we'll land it.

What about voice / images / code?

PDF Hell is the first wedge — documents are the most concrete entry point. The pattern (procedural adversarial input + code-based ground truth + designed failure modes) generalises. code-hell, voice-hell, image-hell are roadmap. Subscribe to the GitHub repo for releases.

Is this Multivon's product or just a research project?

Both. PDF Hell is open-source under Apache 2.0 (the public benchmark). Multivon's commercial layer is the hosted generator (Pro, $49/mo) and the enterprise tier ($25k+/yr) with custom trap families, CI integration, and on-prem deployment.