multivon-eval

The open-source SDK that powers PDF Hell — with 44 evaluators for the rest of what your AI ships.

multivon-eval is the Python framework that powers PDF Hell, the public leaderboard at /leaderboard, and the audit packs every paid customer downloads. It ships 44 evaluators across seven categories — deterministic, LLM-judge (QAG), agent-trace, compliance, multimodal, conversation, and consistency — plus a calibration system, cost tracking, hash-chained audit logs, and a pytest plugin. Apache 2.0 on PyPI; no telemetry, no signup.

New in 0.8.0

The fastest path: multivon-eval bootstrap

Hand over a one-paragraph product description and a few sample traces — the bootstrap CLI infers your product shape, picks 4-6 evaluators tuned to it, calibrates thresholds from your traces, and emits a runnable suite plus 30 adversarial seed cases. PII is redacted locally before any LLM call. Total cost: ~$0.01 per bootstrap.

1

Install

2

Bootstrap from your product + traces

3

Review the discovery report, then run the generated suite

Outputs four files: eval_suite.py (runnable suite), seed_cases.jsonl (30 adversarial cases), thresholds.yaml (calibrated from your traces), and DISCOVERY_REPORT.md (a forwardable eval design review). Full bootstrap guide →

Or scaffold from a template

If you'd rather write your own cases from a runnable starter, the init scaffolder gives you a clean blueprint.

1

Scaffold a starter project — pick a template

2

Run it

Templates: quickstart, rag, agent, agent-langgraph, agent-openai-sdk, regulated, conversation. Each is a runnable starter — eval.py, requirements.txt, optional CI workflow.

44 evaluators across 7 categories

Pick the category that fits your output. Deterministic for cheap, fast gates. QAG when you need an LLM judge but don't want vibes-based scoring. Agent for tool-use evaluation. Conversation for dialogue. Compliance + multimodal for regulated / document-AI workflows.

Deterministic

14 evaluators

Pure-Python checks that don't call an LLM. Cheap to run, cheap to gate CI on. Use as a first-pass filter or for outputs that have a single correct answer.

  • ExactMatch, Contains, RegexMatch, StartsWith
  • JSONSchemaEval (Pydantic validation)
  • BLEU, ROUGE, BERTScore, Levenshtein, ChrfScore
  • Latency, MaxLatency, WordCount, NotEmpty

LLM-judge (QAG)

13 evaluators

Question-Answer-Generation: instead of asking an LLM to rate 1–10 (noisy), generate yes/no questions about the output and score by fraction answered correctly. Calibrated thresholds per judge model.

  • Faithfulness, Hallucination, Relevance
  • AnswerAccuracy, Coherence
  • ContextPrecision, ContextRecall
  • Toxicity, Bias, Summarization
  • CustomRubric, GEval, CheckEvaluator

Agent-trace

8 evaluators

Evaluate the trajectory, not just the final answer. Pairs with LangGraph + OpenAI Agents SDK tracers. Did the agent call the right tools, in the right order, with the right arguments?

  • ToolCallAccuracy, ToolArgumentAccuracy, ToolCallNecessity
  • PlanQuality, TaskCompletion, StepFaithfulness
  • TrajectoryEfficiency, AgentMemoryEval

Conversation

4 evaluators

Multi-turn dialogue evaluators. Useful for support chatbots and any system where context accumulates across turns.

  • ConversationRelevance, ConversationCompleteness
  • KnowledgeRetention, TurnConsistency

Compliance

2 evaluators

Audit-grade evaluators that run locally — no LLM judge needed. PII detection with HIPAA, GDPR, CCPA, and DPDP India jurisdictions. JSON Schema validation for structured outputs.

  • PIIEvaluator (regex, zero egress)
  • SchemaEvaluator (Pydantic / JSON Schema)

Multimodal

2 evaluators

Image-grounded and document-grounded faithfulness scoring — the eval layer behind the PDF Hell benchmark. Pairs with any vision model that can accept image inputs.

  • VQAFaithfulness (image-grounded QAG)
  • DocumentGrounding (multi-page docs)

Consistency

1 evaluators

Run-to-run stability across N samples — flags non-determinism in stochastic models or pipelines.

  • SelfConsistency

What people use it for

RAG faithfulness

Verify the model didn't invent claims not in the retrieved context.

from multivon_eval import EvalSuite, EvalCase, Faithfulness

suite = EvalSuite("rag-faithfulness")
suite.add_evaluators(Faithfulness())
suite.add_cases([EvalCase(
    input="What is the renewal period?",
    context="The agreement renews annually unless terminated.",
)])
# fail_threshold exits 1 in CI if pass_rate < 0.95
report = suite.run(my_rag_model, fail_threshold=0.95)

Agent trajectory eval

Did the agent call the right tool, in the right order, with the right args?

from multivon_eval import EvalSuite, ToolCallAccuracy, ToolCallNecessity
from multivon_eval.integrations import LangGraphTracer

suite = EvalSuite("agent-trajectory")
suite.add_evaluators(
    ToolCallAccuracy(),
    ToolCallNecessity(),  # penalises redundant tool calls
)
report = suite.run(my_agent, tracer=LangGraphTracer())

Compliance + audit pack

Generate a hash-chained audit log for procurement / SOC2 / EU AI Act.

from multivon_eval import EvalSuite, Faithfulness
from multivon_eval.compliance import ComplianceReporter

reporter = ComplianceReporter(output_dir="./audit-logs", framework="eu-ai-act")
suite = EvalSuite("regulated-eval")
suite.add_evaluators(Faithfulness())
report = suite.run(my_model)
reporter.record(report, tags={"system": "my-product"})
# Then bundle as audit ZIP via:
#   multivon-eval audit-package --logs audit-logs \
#     --suite regulated-eval --framework eu-ai-act --out audit-pack.zip

Built for regulated AI

The compliance machinery isn't marketing — it's load-bearing for buyers in healthcare, insurance, legal, and financial services. Every paid PDF Hell engagement ships an audit pack generated by these primitives.

Hash-chained audit log

Every evaluation appended to an NDJSON log with SHA-256 linking to the prior entry. Tamper-evident: changing any entry invalidates every subsequent hash. Auditors verify with `reporter.verify()`.

Compliance framework mappings

EU AI Act Articles 9, 10, 14; NIST AI RMF Govern-2 / Map-5 / Measure-3; HIPAA Safe Harbor (PII redaction); DPDP India (Aadhaar / PAN / GSTIN / IFSC). Each evaluator carries the framework controls it satisfies.

No data egress (on-prem judges)

JudgeConfig(base_url=…) routes any judge call to an OpenAI-compatible endpoint (Ollama, vLLM, LM Studio, on-prem). Production data never leaves the VPC.

Suite lock + evaluator fingerprinting

SuiteLock + EvaluatorFingerprint detect prompt drift and judge config changes between runs. A failing lockfile means an auditor can prove the eval was the eval that was contracted.

Where to next

The SDK docs cover every evaluator, every CLI command, every integration point. The benchmark repo has reproducers for every published number. PDF Hell is the benchmark-as-product layer on top.