multivon-eval is a Python framework for evaluating LLMs, agents, and RAG pipelines.
It puts a confidence interval on every number — a Wilson 95% CI on pass rate and a bootstrap 95% CI on average score — so you can tell whether a change actually moved the metric or just moved the noise. 44 evaluators across 7 tiers, one API. Judge thresholds are calibrated against human-labeled data, not guessed. Runs locally against Ollama, LM Studio, vLLM, or any OpenAI-compatible server. No account, no telemetry.
An eval vendor should be auditable. Every entry below links to an artifact we can’t edit after the fact — PyPI release history, GitHub issues, dated posts.
2026-05-13
Cross-framework disagreement study published: κ = 0.03
multivon-eval, DeepEval, and RAGAS — same judge, same dataset, same seed (n=50) — disagree on 56% of cases; the worst pair agrees barely better than chance. A result that indicts the category we sell in, published anyway.
A provider temperature bug had been silently scored as wrong answers; the “0% on all seven trap families” claim was an eval artifact, not a model failure. Retracted, leaderboard re-run and corrected, notice left permanently in the repo README.
A peer review caught the 0.9.4 “held-out” claim being in-distribution — the evaluator's threshold was calibrated on the same dataset it was tested on. 0.9.5 corrected the claim, 0.9.6 fixed runtime bugs the same review caught in the generated bootstrap template, 0.9.7 fixed a threshold-vs-default mismatch in the held-out reproducer. All four releases left on PyPI — yanking them would erase the record.
Determinacy gate run on real repos — failed at 20.9%
Scanner v3 (0.10.1) measured 278 prompt call sites across aider, gpt-researcher, open-interpreter, letta, and pr-agent: 20.9% statically resolvable, below the 50% gate we set ourselves. Published with the per-repo table on the epic; the runtime recorder was promoted to the priority path past the static ceiling.
Runtime recorder shipped (0.11.0) — the answer to the 20.9% ceiling
Opt-in pytest --record-prompts captures rendered prompt fingerprints at the same three SDK surfaces the static scanner reads. Three trust tiers, never collapsed: static scan proves prompt text; runtime recordings prove only the k-of-N renderings observed; templates stay honestly deferred. Case-to-site bindings are propose-only — observation replaced fabrication.
Our own pixels modality caught a bug in our own benchmark
The cross-modality run exposed that two autoresearch trap families (zero_width_space_split, unicode_confusable_total) rendered visible tofu boxes where they claimed visual normality — the “visually identical” premise was false. Both were redesigned in pdfhell 0.6.1 (adjacent-text-run fragmentation; a digit-zero T0TAL confusable) and a sixth glyph_clean validation gate now pins the invariant. Alongside it, multivon-eval 0.11.1 shipped scanner v4 with an explicit UNSCANNABLE tier — honest UNKNOWN over confident wrong.
Each package solves a piece of the AI-evaluation problem and composes with the others. All five are Apache 2.0 — three on PyPI, eval-action and eval-framework-benchmark on GitHub; multivon-guard is in private beta. No telemetry.
multivon-evalv0.12.0✓ Stable
the SDK
44 evaluators across deterministic, LLM-judge (QAG), agent-trace, conversation, compliance, and multimodal tiers, plus a new bootstrap CLI that proposes a tuned eval suite from your product description + sample traces. Calibrated thresholds, hash-chained audit logs, pytest plugin.
510 adversarial PDFs across 17 trap families that fool AI document readers — like an invoice whose visible total differs from its hidden text layer. The correct answer comes from code, so no AI is grading another AI. Drops straight into CI.
MCP server exposing 22 evaluation tools. Drop into Claude Desktop, Claude Code, Cursor, Cline, or OpenCode and your AI coding agent calls evals directly mid-edit.
Head-to-head benchmark vs DeepEval + RAGAS on hallucination detection. Same judge, same dataset, same seed, fully reproducible. multivon-eval F1 0.79 vs DeepEval 0.0 at defaults.
Evaluation alone isn't enough. multivon-guard catches what your AI is about to send before it hits the wire, secrets, PII, API keys. Local proxy, auto-detects Claude Code / Codex / Cursor / Aider / Continue. Now in private beta.
Bring your own API key. Local-first by default. Hash-chained audit logs ship with every run.
What's new across the stack
Recent ships
Persona simulator + gated case generation (0.12.0), the runtime prompt recorder (0.11.0), prompt-drift staleness (0.10.0), and the intelligent-eval layer underneath. See the full roadmap →
0.12.0
Persona simulator — adaptive multi-turn eval
multivon-eval simulate drives your bot through persona-driven, adaptive multi-turn conversations: a simulator LLM plays the user, reacts to each reply, and stops at a hard budget ceiling. Honesty contract built in — every transcript carries provenance and is labeled simulated, not real traffic, so simulated results can never blend into production metrics.
bootstrap --n-seed-cases scales seed-case generation to 500 behind duplicate and hardness gates, and the generation report states exactly what survived — "generated 500, accepted 431 — no silent caps." Rejected cases are counted and named, never silently dropped, so a thin suite can't masquerade as a big one.
Opt-in pytest --record-prompts captures rendered prompt fingerprints at call time — the honest path past the 20.9% static-resolvability ceiling. Three trust tiers, never collapsed: static scan proves prompt text, runtime recordings prove only the k-of-N renderings observed, templates stay honestly deferred. Case-to-site bindings are propose-only — observation, never fabrication.
multivon-eval staleness diffs a live scan of your repo's prompt call sites against a committed prompt_baseline.json: CHANGED (with before/after fingerprints and the cases bound to that prompt), REMOVED, ADDED, and UNKNOWN for dynamic prompts static analysis can't read — honestly unknown, never fake-fresh. staleness stamp binds cases to the prompts they exercise; bootstrap writes the baseline automatically. Every report opens with a determinacy headline and closes with a blind-spots footer. --fail-on gates CI.
multivon-eval bootstrap --product PRODUCT.md --traces TRACES.jsonl emits a tuned EvalSuite + adversarial seed cases + calibrated thresholds + a forwardable DISCOVERY_REPORT.md. Single LLM call, ~$0.12 per bootstrap. The fastest path from "I don't know what to eval" to a runnable suite — and 0.12.0's --n-seed-cases scales it to 500 gated cases.
The 0.10-era substrate under bootstrap, in one card: auto_evaluators(case) ranks evaluators heuristically in microseconds (zero LLM cost); generate_adversarial_cases targets 10 failure modes with stress_test routing metadata; an N-shot judge-noise filter keeps only validated-hard cases (+0.80 mean failure-rate separation); a local PII/secret scan redacts before any trace leaves your machine; thresholds calibrate from your own traces (p25 of baseline scores).
install-skills
multivon-eval install-skills (Claude Code)
One-command installer for the three bundled Claude Code skills (eval-bootstrap, eval-audit, eval-explain). Symlinks SKILL.md files from the wheel into ~/.claude/skills/ so pip install -U multivon-eval propagates skill edits without re-running the installer.
pdfhell.research — autoresearch loop for trap discovery
An autoresearch loop (Karpathy pattern) where a rotation of Opus 4-7, GPT-5, and Gemini 2.5 Pro propose adversarial PDF traps against an 8-model eval panel. Six validation gates filter candidates before any vision-eval spend (glyph_clean added in 0.6.1 after the tofu-box bug). $88 total across two overnight runs produced 11 surviving trap families — now in mini-v4 (510 cases). The agent does not merge its own work; a human curator promotes from keep/.
The full mini-v4 suite is 510 cases across 17 trap families; the public leaderboard runs the 170-case mini-v4-sample (first 10 seeds/family). The sample surfaces three real per-trap blind spots: GPT-4o on hidden-OCR invoices (0/10), Anthropic's premium tier on a 3.5pt-footnote trap (0/10), and three models on text-run fragmentation (0/10, measured pre-0.6.1-redesign). The 2026-06-12 pixels runs traced all three to PDF ingestion, not vision.
I ship AI that reads PDFs / contracts / claims / medical records.
PDF Hell is an adversarial benchmark — 510 procedurally-generated cases across 17 trap families that break the assumptions document-AI vendors silently rely on. Six hand-authored, 11 discovered by an autoresearch loop. Code-based ground truth, not LLM-as-judge. The current leaderboard exposes per-trap blind spots in GPT-4o (hidden-OCR text-layer trust), Anthropic's premium tier (a 3.5pt-footnote trap that inverts on locally-rasterised pixels — a PDF-ingestion failure, not vision), and three different models (text-run fragmentation).
multivon-eval ships Faithfulness, ContextPrecision, and ContextRecall as QAG-graded evaluators with calibrated thresholds per judge. Pass-rate gates your CI without you babysitting an LLM judge. Bootstrap CLI generates a runnable suite + 30 adversarial seed cases + calibrated thresholds from your own traces in one command — and when your prompts change later, the staleness report tells you which cases were authored against the old ones.
Procurement needs an audit pack our enterprise customers will accept.
Hash-chained NDJSON, SHA-256 manifest, EU AI Act / NIST AI RMF / HIPAA / DPDP India mappings. Generate one from any pdfhell or multivon-eval run — free, no signup. JudgeConfig(base_url=…) routes judging through any OpenAI-compatible endpoint (Ollama, vLLM, on-prem) so production data never leaves the VPC.
My LLM coding agent is sending PII / secrets upstream.
multivon-guard is a local proxy that intercepts your agent's outbound traffic and redacts secrets before they leave your machine. Auto-detects Claude Code, Codex, Cursor, Aider, Continue. Now in private beta.
The five public packages above are Apache 2.0 and free forever. They share the same engine (multivon-eval) and the same methodology (QAG + code-based ground truth). multivon-guard is in private beta — email for access.
Statistical rigor + procurement-grade artifacts, by default.
Frameworks that show a single number lie by omission. Multivon’s default output ships with confidence intervals, power analysis, and a tamper-evident audit trail — so a shipping decision actually corresponds to real signal.
One real run on the left tab — GPT-4o vs the current pdfhell mini-v4-sample suite (170 cases, 17 trap families) — plus three illustrative shapes for faithfulness, experiment comparison, and plain-English checks. Real numbers and reproducer commands for every model we've scored live on the leaderboard.
JUnit XML output renders natively in every CI runner. A 10-line YAML wires PDF Hell into a pull-request gate. multivon-eval’s pytest plugin slots into your existing test suite without a special harness.
✓JUnit XML renders inline in the GitHub Actions / GitLab / Jenkins PR test panel
✓--fail-threshold 0.85 gates the build on pass-rate; distinct exit codes for quality vs infra failures
✓--audit-pack out.zip bundles every test PDF, answer key, and a SHA-256 manifest into a downloadable ZIP — attach to a procurement diligence appendix without post-processing
✓Cost-controlled — run smoke (3 cases, ~$0.001) on every PR, full mini suite on merge to main, full custom suite nightly
✓multivon-eval staleness --format markdown drops a prompt-drift summary into $GITHUB_STEP_SUMMARY — warn-only by default, --fail-on changed,removed when you want a gate. How staleness works →
Every public package is on PyPI today. Stars and issues welcome on GitHub. For custom adversarial trap families, on-prem deployment, or paid integration support, see /commercial — inbound only, no fake tiers.
Real run · reproducibleReal run · openai:gpt-4o · mini-v4-sample
GPT-4o still falls for the hidden-OCR trap, 10 out of 10
Run the current pdfhell mini-v4-sample suite (170 procedurally generated PDFs, 17 trap families) against gpt-4o. It passes most of the suite at 81% overall, but never catches the hidden-OCR mismatch — the visible page text says one number, the buried OCR layer says another, and the model returns the OCR layer every time. The numbers below are a real run — same suite, same seeds, reproducible.