Apache 2.0 · Python · local-first · v0.12.0

multivon-eval is a Python framework for evaluating LLMs, agents, and RAG pipelines.

It puts a confidence interval on every number — a Wilson 95% CI on pass rate and a bootstrap 95% CI on average score — so you can tell whether a change actually moved the metric or just moved the noise. 44 evaluators across 7 tiers, one API. Judge thresholds are calibrated against human-labeled data, not guessed. Runs locally against Ollama, LM Studio, vLLM, or any OpenAI-compatible server. No account, no telemetry.

Read the Quickstart →

No signup · Local-first · First eval in ~3 min

multivon-eval~3s · no API key (offline demo)
$ pip install multivon-eval
$ python -m multivon_eval demo

  multivon-eval demo · customer support bot
  6 cases · deterministic tier · no API key

  Pass Rate: 83.3%  [44%–97%  Wilson 95% CI]
  Avg Score: 0.83  [0.50–1.00  bootstrap 95% CI]
  Score dist:  p10 0.50  ·  p50 1.00  ·  p90 1.00

  ⚡ Power warning: 6 cases — min detectable change
     at 80% power is ~74%. Add cases to catch less.
→ a CI on every numberreal offline run · no key

The track record

An eval vendor should be auditable. Every entry below links to an artifact we can’t edit after the fact — PyPI release history, GitHub issues, dated posts.

  1. 2026-05-13

    Cross-framework disagreement study published: κ = 0.03

    multivon-eval, DeepEval, and RAGAS — same judge, same dataset, same seed (n=50) — disagree on 56% of cases; the worst pair agrees barely better than chance. A result that indicts the category we sell in, published anyway.

  2. 2026-05-24

    pdfhell Opus 4-7 headline finding retracted

    A provider temperature bug had been silently scored as wrong answers; the “0% on all seven trap families” claim was an eval artifact, not a model failure. Retracted, leaderboard re-run and corrected, notice left permanently in the repo README.

  3. 2026-06-03

    Four releases in one day: 0.9.4 → 0.9.7

    A peer review caught the 0.9.4 “held-out” claim being in-distribution — the evaluator's threshold was calibrated on the same dataset it was tested on. 0.9.5 corrected the claim, 0.9.6 fixed runtime bugs the same review caught in the generated bootstrap template, 0.9.7 fixed a threshold-vs-default mismatch in the held-out reproducer. All four releases left on PyPI — yanking them would erase the record.

  4. 2026-06-11

    Determinacy gate run on real repos — failed at 20.9%

    Scanner v3 (0.10.1) measured 278 prompt call sites across aider, gpt-researcher, open-interpreter, letta, and pr-agent: 20.9% statically resolvable, below the 50% gate we set ourselves. Published with the per-repo table on the epic; the runtime recorder was promoted to the priority path past the static ceiling.

  5. 2026-06-11

    Runtime recorder shipped (0.11.0) — the answer to the 20.9% ceiling

    Opt-in pytest --record-prompts captures rendered prompt fingerprints at the same three SDK surfaces the static scanner reads. Three trust tiers, never collapsed: static scan proves prompt text; runtime recordings prove only the k-of-N renderings observed; templates stay honestly deferred. Case-to-site bindings are propose-only — observation replaced fabrication.

  6. 2026-06-12

    Our own pixels modality caught a bug in our own benchmark

    The cross-modality run exposed that two autoresearch trap families (zero_width_space_split, unicode_confusable_total) rendered visible tofu boxes where they claimed visual normality — the “visually identical” premise was false. Both were redesigned in pdfhell 0.6.1 (adjacent-text-run fragmentation; a digit-zero T0TAL confusable) and a sixth glyph_clean validation gate now pins the invariant. Alongside it, multivon-eval 0.11.1 shipped scanner v4 with an explicit UNSCANNABLE tier — honest UNKNOWN over confident wrong.

Five Apache 2.0 packages + one in private beta.

Each package solves a piece of the AI-evaluation problem and composes with the others. All five are Apache 2.0 — three on PyPI, eval-action and eval-framework-benchmark on GitHub; multivon-guard is in private beta. No telemetry.

multivon-evalv0.12.0 Stable

the SDK

44 evaluators across deterministic, LLM-judge (QAG), agent-trace, conversation, compliance, and multimodal tiers, plus a new bootstrap CLI that proposes a tuned eval suite from your product description + sample traces. Calibrated thresholds, hash-chained audit logs, pytest plugin.

pip install multivon-eval
Explore the SDK →
pdfhellv0.6.1 Early Preview

the benchmark

510 adversarial PDFs across 17 trap families that fool AI document readers — like an invoice whose visible total differs from its hidden text layer. The correct answer comes from code, so no AI is grading another AI. Drops straight into CI.

pip install pdfhell
PDF Hell docs →
multivon-mcpv0.3.1 Early Preview

the agent surface

MCP server exposing 22 evaluation tools. Drop into Claude Desktop, Claude Code, Cursor, Cline, or OpenCode and your AI coding agent calls evals directly mid-edit.

pip install multivon-mcp
Wire it up →
eval-actionv1.0.1 Stable

the CI gate

GitHub Action that runs multivon-eval on PRs, posts a diff comment with regressions, and gates merges on safety-class failures.

uses: multivon-ai/eval-action@v1
Install from Marketplace →
eval-framework-benchmarkv0.1.0 Stable

the receipts

Head-to-head benchmark vs DeepEval + RAGAS on hallucination detection. Same judge, same dataset, same seed, fully reproducible. multivon-eval F1 0.79 vs DeepEval 0.0 at defaults.

git clone https://github.com/multivon-ai/eval-framework-benchmark
See how we compare →
multivon-guard Early Access

the runtime gatekeeper

Evaluation alone isn't enough. multivon-guard catches what your AI is about to send before it hits the wire, secrets, PII, API keys. Local proxy, auto-detects Claude Code / Codex / Cursor / Aider / Continue. Now in private beta.

mailto:hello@multivon.ai
Request early access →

Bring your own API key. Local-first by default. Hash-chained audit logs ship with every run.

What's new across the stack

Recent ships

Persona simulator + gated case generation (0.12.0), the runtime prompt recorder (0.11.0), prompt-drift staleness (0.10.0), and the intelligent-eval layer underneath. See the full roadmap →

0.12.0

Persona simulator — adaptive multi-turn eval

multivon-eval simulate drives your bot through persona-driven, adaptive multi-turn conversations: a simulator LLM plays the user, reacts to each reply, and stops at a hard budget ceiling. Honesty contract built in — every transcript carries provenance and is labeled simulated, not real traffic, so simulated results can never blend into production metrics.

Read more →
0.12.0

Scaled, gated case generation

bootstrap --n-seed-cases scales seed-case generation to 500 behind duplicate and hardness gates, and the generation report states exactly what survived — "generated 500, accepted 431 — no silent caps." Rejected cases are counted and named, never silently dropped, so a thin suite can't masquerade as a big one.

Read more →
0.11.0

Runtime prompt recorder

Opt-in pytest --record-prompts captures rendered prompt fingerprints at call time — the honest path past the 20.9% static-resolvability ceiling. Three trust tiers, never collapsed: static scan proves prompt text, runtime recordings prove only the k-of-N renderings observed, templates stay honestly deferred. Case-to-site bindings are propose-only — observation, never fabrication.

Read more →
0.10.0

Evals drift as code changes — now you can see it

multivon-eval staleness diffs a live scan of your repo's prompt call sites against a committed prompt_baseline.json: CHANGED (with before/after fingerprints and the cases bound to that prompt), REMOVED, ADDED, and UNKNOWN for dynamic prompts static analysis can't read — honestly unknown, never fake-fresh. staleness stamp binds cases to the prompts they exercise; bootstrap writes the baseline automatically. Every report opens with a determinacy headline and closes with a blind-spots footer. --fail-on gates CI.

Read more →
bootstrap

Cold-start eval generator

multivon-eval bootstrap --product PRODUCT.md --traces TRACES.jsonl emits a tuned EvalSuite + adversarial seed cases + calibrated thresholds + a forwardable DISCOVERY_REPORT.md. Single LLM call, ~$0.12 per bootstrap. The fastest path from "I don't know what to eval" to a runnable suite — and 0.12.0's --n-seed-cases scales it to 500 gated cases.

Read more →
intelligent-eval

The intelligent-eval layer

The 0.10-era substrate under bootstrap, in one card: auto_evaluators(case) ranks evaluators heuristically in microseconds (zero LLM cost); generate_adversarial_cases targets 10 failure modes with stress_test routing metadata; an N-shot judge-noise filter keeps only validated-hard cases (+0.80 mean failure-rate separation); a local PII/secret scan redacts before any trace leaves your machine; thresholds calibrate from your own traces (p25 of baseline scores).

install-skills

multivon-eval install-skills (Claude Code)

One-command installer for the three bundled Claude Code skills (eval-bootstrap, eval-audit, eval-explain). Symlinks SKILL.md files from the wheel into ~/.claude/skills/ so pip install -U multivon-eval propagates skill edits without re-running the installer.

Read more →
research

pdfhell.research — autoresearch loop for trap discovery

An autoresearch loop (Karpathy pattern) where a rotation of Opus 4-7, GPT-5, and Gemini 2.5 Pro propose adversarial PDF traps against an 8-model eval panel. Six validation gates filter candidates before any vision-eval spend (glyph_clean added in 0.6.1 after the tofu-box bug). $88 total across two overnight runs produced 11 surviving trap families — now in mini-v4 (510 cases). The agent does not merge its own work; a human curator promotes from keep/.

Read more →
mini-v4

mini-v4 leaderboard ships

The full mini-v4 suite is 510 cases across 17 trap families; the public leaderboard runs the 170-case mini-v4-sample (first 10 seeds/family). The sample surfaces three real per-trap blind spots: GPT-4o on hidden-OCR invoices (0/10), Anthropic's premium tier on a 3.5pt-footnote trap (0/10), and three models on text-run fragmentation (0/10, measured pre-0.6.1-redesign). The 2026-06-12 pixels runs traced all three to PDF ingestion, not vision.

Read more →

What are you trying to evaluate?

Document AI

I ship AI that reads PDFs / contracts / claims / medical records.

PDF Hell is an adversarial benchmark — 510 procedurally-generated cases across 17 trap families that break the assumptions document-AI vendors silently rely on. Six hand-authored, 11 discovered by an autoresearch loop. Code-based ground truth, not LLM-as-judge. The current leaderboard exposes per-trap blind spots in GPT-4o (hidden-OCR text-layer trust), Anthropic's premium tier (a 3.5pt-footnote trap that inverts on locally-rasterised pixels — a PDF-ingestion failure, not vision), and three different models (text-run fragmentation).

RAG / faithfulness

My RAG model hallucinates from retrieved context.

multivon-eval ships Faithfulness, ContextPrecision, and ContextRecall as QAG-graded evaluators with calibrated thresholds per judge. Pass-rate gates your CI without you babysitting an LLM judge. Bootstrap CLI generates a runnable suite + 30 adversarial seed cases + calibrated thresholds from your own traces in one command — and when your prompts change later, the staleness report tells you which cases were authored against the old ones.

Agents

I need to grade my agent's tool calls and trajectories.

8 agent-native evaluators — ToolCallAccuracy, ToolCallNecessity, TrajectoryEfficiency, StepFaithfulness, AgentMemoryEval. LangGraph + OpenAI Agents SDK tracers wired natively. Plus 22 MCP-callable tools so Claude Code / Cursor / Cline run the eval mid-edit.

Compliance

Procurement needs an audit pack our enterprise customers will accept.

Hash-chained NDJSON, SHA-256 manifest, EU AI Act / NIST AI RMF / HIPAA / DPDP India mappings. Generate one from any pdfhell or multivon-eval run — free, no signup. JudgeConfig(base_url=…) routes judging through any OpenAI-compatible endpoint (Ollama, vLLM, on-prem) so production data never leaves the VPC.

Agent safety

My LLM coding agent is sending PII / secrets upstream.

multivon-guard is a local proxy that intercepts your agent's outbound traffic and redacts secrets before they leave your machine. Auto-detects Claude Code, Codex, Cursor, Aider, Continue. Now in private beta.

Request early access →

The five public packages above are Apache 2.0 and free forever. They share the same engine (multivon-eval) and the same methodology (QAG + code-based ground truth). multivon-guard is in private beta — email for access.

44 evaluators across 7 categories.

Pick the ones your use case needs, ignore the rest. All Apache 2.0, all available via Python SDK and MCP.

Deterministic

14

No LLM judge needed. Cheap, fast, reproducible.

NotEmptyExactMatchRegexMatchJSONSchema+6 more

LLM-as-judge

13

QAG decomposition. Wilson 95% CIs on every aggregate. Thresholds calibrated against human labels; inter-judge agreement κ 0.60-0.80.

FaithfulnessHallucinationRelevanceAnswerAccuracy+4 more

Agent

8

Trajectory-level scoring for multi-step agents.

ToolCallAccuracyPlanQualityTaskCompletionStepFaithfulness+2 more

Compliance

2

Procurement-grade checks. PII redaction + schema enforcement.

PIIEvaluatorSchemaEvaluator

Multimodal

2

Image-grounded faithfulness + document QA scoring.

VQAFaithfulnessDocumentGrounding

Conversation

4

Multi-turn dialog quality, not just single-turn answers.

ConversationRelevanceKnowledgeRetentionConversationCompletenessTurnConsistency

Consistency

1

Run-to-run stability across N samples.

SelfConsistency

Full list: docs.multivon.ai/evaluators · agent-callable via MCP

What does an actual eval look like?

One real run on the left tab — GPT-4o vs the current pdfhell mini-v4-sample suite (170 cases, 17 trap families) — plus three illustrative shapes for faithfulness, experiment comparison, and plain-English checks. Real numbers and reproducer commands for every model we've scored live on the leaderboard.

Drop into the CI you already have.

JUnit XML output renders natively in every CI runner. A 10-line YAML wires PDF Hell into a pull-request gate. multivon-eval’s pytest plugin slots into your existing test suite without a special harness.

  • JUnit XML renders inline in the GitHub Actions / GitLab / Jenkins PR test panel
  • --fail-threshold 0.85 gates the build on pass-rate; distinct exit codes for quality vs infra failures
  • --audit-pack out.zip bundles every test PDF, answer key, and a SHA-256 manifest into a downloadable ZIP — attach to a procurement diligence appendix without post-processing
  • Cost-controlled — run smoke (3 cases, ~$0.001) on every PR, full mini suite on merge to main, full custom suite nightly
  • multivon-eval staleness --format markdown drops a prompt-drift summary into $GITHUB_STEP_SUMMARY — warn-only by default, --fail-on changed,removed when you want a gate. How staleness works →
  • No telemetry, local-first, zero data egress. See security & data-handling →
# .github/workflows/pdfhell.yml
name: PDF Hell
on: [pull_request]

jobs:
  pdfhell:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: astral-sh/setup-uv@v5
      - run: uvx pdfhell run \
          --model anthropic:claude-sonnet-4-6 \
          --suite mini \
          --junit results.xml \
          --audit-pack audit.zip \
          --fail-threshold 0.85
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
      - uses: actions/upload-artifact@v4
        if: always()
        with:
          name: pdfhell
          path: |
            results.xml
            audit.zip
Open-source momentum7GitHub stars4contributors

Apache 2.0. Free forever. No telemetry.

Every public package is on PyPI today. Stars and issues welcome on GitHub. For custom adversarial trap families, on-prem deployment, or paid integration support, see /commercial — inbound only, no fake tiers.