The open-source SDK that powers PDF Hell — with 44 evaluators for the rest of what your AI ships.
multivon-eval is the Python framework that powers PDF Hell, the public leaderboard at /leaderboard, and the audit packs every paid customer downloads. It ships 44 evaluators across seven categories — deterministic, LLM-judge (QAG), agent-trace, compliance, multimodal, conversation, and consistency — plus a calibration system, cost tracking, hash-chained audit logs, and a pytest plugin. Apache 2.0 on PyPI; no telemetry, no signup.
New in 0.8.0
The fastest path: multivon-eval bootstrap
Hand over a one-paragraph product description and a few sample traces — the bootstrap CLI infers your product shape, picks 4-6 evaluators tuned to it, calibrates thresholds from your traces, and emits a runnable suite plus 30 adversarial seed cases. PII is redacted locally before any LLM call. Total cost: ~$0.01 per bootstrap.
1
Install
2
Bootstrap from your product + traces
3
Review the discovery report, then run the generated suite
Outputs four files: eval_suite.py (runnable suite), seed_cases.jsonl (30 adversarial cases), thresholds.yaml (calibrated from your traces), and DISCOVERY_REPORT.md (a forwardable eval design review). Full bootstrap guide →
Or scaffold from a template
If you'd rather write your own cases from a runnable starter, the init scaffolder gives you a clean blueprint.
1
Scaffold a starter project — pick a template
2
Run it
Templates: quickstart, rag, agent, agent-langgraph, agent-openai-sdk, regulated, conversation. Each is a runnable starter — eval.py, requirements.txt, optional CI workflow.
44 evaluators across 7 categories
Pick the category that fits your output. Deterministic for cheap, fast gates. QAG when you need an LLM judge but don't want vibes-based scoring. Agent for tool-use evaluation. Conversation for dialogue. Compliance + multimodal for regulated / document-AI workflows.
Deterministic
14 evaluators
Pure-Python checks that don't call an LLM. Cheap to run, cheap to gate CI on. Use as a first-pass filter or for outputs that have a single correct answer.
▸ ExactMatch, Contains, RegexMatch, StartsWith
▸ JSONSchemaEval (Pydantic validation)
▸ BLEU, ROUGE, BERTScore, Levenshtein, ChrfScore
▸ Latency, MaxLatency, WordCount, NotEmpty
LLM-judge (QAG)
13 evaluators
Question-Answer-Generation: instead of asking an LLM to rate 1–10 (noisy), generate yes/no questions about the output and score by fraction answered correctly. Calibrated thresholds per judge model.
▸ Faithfulness, Hallucination, Relevance
▸ AnswerAccuracy, Coherence
▸ ContextPrecision, ContextRecall
▸ Toxicity, Bias, Summarization
▸ CustomRubric, GEval, CheckEvaluator
Agent-trace
8 evaluators
Evaluate the trajectory, not just the final answer. Pairs with LangGraph + OpenAI Agents SDK tracers. Did the agent call the right tools, in the right order, with the right arguments?
Multi-turn dialogue evaluators. Useful for support chatbots and any system where context accumulates across turns.
▸ ConversationRelevance, ConversationCompleteness
▸ KnowledgeRetention, TurnConsistency
Compliance
2 evaluators
Audit-grade evaluators that run locally — no LLM judge needed. PII detection with HIPAA, GDPR, CCPA, and DPDP India jurisdictions. JSON Schema validation for structured outputs.
▸ PIIEvaluator (regex, zero egress)
▸ SchemaEvaluator (Pydantic / JSON Schema)
Multimodal
2 evaluators
Image-grounded and document-grounded faithfulness scoring — the eval layer behind the PDF Hell benchmark. Pairs with any vision model that can accept image inputs.
▸ VQAFaithfulness (image-grounded QAG)
▸ DocumentGrounding (multi-page docs)
Consistency
1 evaluators
Run-to-run stability across N samples — flags non-determinism in stochastic models or pipelines.
▸ SelfConsistency
What people use it for
RAG faithfulness
Verify the model didn't invent claims not in the retrieved context.
from multivon_eval import EvalSuite, EvalCase, Faithfulness
suite = EvalSuite("rag-faithfulness")
suite.add_evaluators(Faithfulness())
suite.add_cases([EvalCase(
input="What is the renewal period?",
context="The agreement renews annually unless terminated.",
)])
# fail_threshold exits 1 in CI if pass_rate < 0.95
report = suite.run(my_rag_model, fail_threshold=0.95)
Agent trajectory eval
Did the agent call the right tool, in the right order, with the right args?
from multivon_eval import EvalSuite, ToolCallAccuracy, ToolCallNecessity
from multivon_eval.integrations import LangGraphTracer
suite = EvalSuite("agent-trajectory")
suite.add_evaluators(
ToolCallAccuracy(),
ToolCallNecessity(), # penalises redundant tool calls
)
report = suite.run(my_agent, tracer=LangGraphTracer())
Compliance + audit pack
Generate a hash-chained audit log for procurement / SOC2 / EU AI Act.
from multivon_eval import EvalSuite, Faithfulness
from multivon_eval.compliance import ComplianceReporter
reporter = ComplianceReporter(output_dir="./audit-logs", framework="eu-ai-act")
suite = EvalSuite("regulated-eval")
suite.add_evaluators(Faithfulness())
report = suite.run(my_model)
reporter.record(report, tags={"system": "my-product"})
# Then bundle as audit ZIP via:
# multivon-eval audit-package --logs audit-logs \
# --suite regulated-eval --framework eu-ai-act --out audit-pack.zip
Numbers you can verify in the repo
Every figure below comes from a reproducer in the public benchmarks/ directory. Click any tile to see the raw JSON.
The compliance machinery isn't marketing — it's load-bearing for buyers in healthcare, insurance, legal, and financial services. Every paid PDF Hell engagement ships an audit pack generated by these primitives.
Hash-chained audit log
Every evaluation appended to an NDJSON log with SHA-256 linking to the prior entry. Tamper-evident: changing any entry invalidates every subsequent hash. Auditors verify with `reporter.verify()`.
Compliance framework mappings
EU AI Act Articles 9, 10, 14; NIST AI RMF Govern-2 / Map-5 / Measure-3; HIPAA Safe Harbor (PII redaction); DPDP India (Aadhaar / PAN / GSTIN / IFSC). Each evaluator carries the framework controls it satisfies.
No data egress (on-prem judges)
JudgeConfig(base_url=…) routes any judge call to an OpenAI-compatible endpoint (Ollama, vLLM, LM Studio, on-prem). Production data never leaves the VPC.
Suite lock + evaluator fingerprinting
SuiteLock + EvaluatorFingerprint detect prompt drift and judge config changes between runs. A failing lockfile means an auditor can prove the eval was the eval that was contracted.
Where to next
The SDK docs cover every evaluator, every CLI command, every integration point. The benchmark repo has reproducers for every published number. PDF Hell is the benchmark-as-product layer on top.