MultivonMultivonGet in touch
Open Source · v0.3.0

Know if your AI actually works.

Evaluate models, agents, and RAG pipelines with statistical rigor, agent trace analysis, and local-first compliance tooling — all in one open source SDK. No data leaves your environment.

$pip install multivon-eval
scroll
Works withClaudeOpenAIGeminiLangChainLlamaIndexAny model

Built differently

Five decisions that separate multivon-eval from every other eval library.

p<0.05or it's noise

Built for non-determinism

Run each case N times, detect flaky cases automatically, and get Wilson score confidence intervals + statistical significance on every comparison. Backed by NAACL 2025.

Statistical rigor guide →
65%fewer false positives

QAG scoring

Binary yes/no questions instead of a 1-10 numeric judge. Same model, better signal. Benchmarked on HaluEval QA with human-labeled ground truth.

See benchmark results →
0config needed to start

No cold start

EvalSuite.for_rag(), .for_agents(), .for_regulated() — pre-built suites for common use cases. One line, right evaluators, sensible defaults.

Factory suites guide →
37built-in evaluators

Agent-native

Purpose-built evaluators for tool call accuracy, tool necessity, trajectory efficiency, and multi-session memory. Works with any framework.

Agent evaluators →
0bytes leave your environment

Local-first & compliant

PII detection with zero API calls. Tamper-evident audit trails mapped to EU AI Act Article 9 and NIST AI RMF. Built for teams that can't send traces to the cloud.

Compliance guide →

Up and running in minutes

No infrastructure. No account. No labeled dataset to start.

01

Generate cases from your docs

No labeled data needed. Point at any text file and get eval cases immediately.

from multivon_eval import generate_from_file cases = generate_from_file( "docs/faq.md", n=20, task="qa", )
02

Pick your evaluators

Mix deterministic checks (free, instant) with LLM judges where it actually matters.

suite.add_evaluators( NotEmpty(), # free Faithfulness(), # LLM judge ToolCallAccuracy(), )
03

Block deploys, track every run

Fail CI on regression. Compare runs with p-values — know if a drop is real or noise.

# exits 1 if pass rate < threshold report = suite.run( model_fn, fail_threshold=0.85, ) # track every run locally exp.record(report, tags={ "model": "gpt-4o", "branch": "main", })

See it in the real world

Three teams, three problems, one SDK.

Customer Support

Catch faithfulness failures before users do

A support bot that answers from a knowledge base. One case fails — it tells the user to 'contact billing' when the answer is right there in the docs. Faithfulness catches it. CI blocks the deploy.

Evaluators used
NotEmpty()Faithfulness()Relevance()
Full walkthrough with code →
multivon-eval
── Support Bot Eval ──
Model: gpt-4o-mini
#InputOutputScoreStatusms
1How do I reset my…Click 'Forgot Pa…0.94PASS743
2What are your hours?We're open Mon–F…0.88PASS812
3Can I change my email?Yes, go to Accou…0.91PASS698
4How do I cancel…Please contact b…0.41FAIL924
5What payment meth…We accept Visa, …0.96PASS651
EvaluatorAvg ScorePass Rate
not_empty1.00100%
faithfulness0.8280%
relevance0.9190%
Total: 5 Passed: 4 Failed: 1 Pass Rate: 80.0% Avg Score: 0.82

Better signal, same model

QAG vs numeric scoring — same judge, different methodology. Binary questions produce fewer false positives.

65%
fewer false positives than a 1-10 numeric judge
0.804
F1 on HaluEval QA · 100 cases · human labels
28%
higher precision with the same claude-haiku judge
Hallucination DetectionHaluEval QA · 100 cases · human-labeled
All runs: claude-haiku-4-5
EvaluatorPrecisionFalse positivesF1
multivon-eval (QAG)0.788110.804
Simple LLM judge (1–10 score)0.617310.763
Keyword overlap (no LLM)0.605150.523
Flaky case detectionSame input · same model · 5 runs · illustrative scenario
Eval approachRun 1Run 2Run 3Run 4Run 5Verdict
Single run (standard)PASS✓ shipped
multivon-eval (5 runs)PASSFAILPASSFAILPASS⚠ flaky — blocked

How we compare

multivon-eval is built for teams that need statistical rigor, agent evaluation, and local-first compliance — not just another evaluator catalog.

Featuremultivon-evalDeepEvalRAGASPromptfoo
Multi-run + flakiness detection
Wilson CI + power analysis
QAG as user-facing primitive
Agent-native evaluators
Local-first, no data egress
OSS compliance audit trail
No account or API key to start
✓ supported  ·  ∼ partial  ·  — not available  ·  Based on public documentation as of April 2025. “QAG as user-facing primitive” — DeepEval and RAGAS use QAG internally; multivon-eval exposes it as a configurable API. See GitHub discussions for corrections.

Built for how AI ships today

One SDK covering every layer of your AI stack — retrieval to agents to safety.

RAG Pipelines

Catch hallucinations before they reach users

Evaluate faithfulness and context precision on every retrieval step. Know when your retriever returns the wrong chunk and when the model ignores the right one.

FaithfulnessContextPrecisionContextRecall
🤖AI Agents

Verify tool calls, trajectories, and memory

Evaluate tool call accuracy, argument quality, step necessity, trajectory efficiency, and multi-session memory. Beyond task completion — did the agent take the right path?

ToolCallNecessityTrajectoryEfficiencyAgentMemoryEval
💬Conversational AI

Measure quality across multi-turn dialogue

Track relevance, knowledge retention, and consistency across turns. Catch drift before your chatbot starts contradicting itself mid-conversation.

ConversationRelevanceKnowledgeRetentionTurnConsistency
🔍LLM Fine-Tuning

Regression testing across model versions

Run the same eval suite before and after every training run. Block deployment automatically when pass rate drops below threshold.

AnswerAccuracyROUGEGEval
📄Document Intelligence

Validate extraction against your schema

Check that extracted fields match Pydantic models or JSON Schema with per-field failure reports. Summaries stay faithful to source. Structured output failures caught before they reach downstream systems.

SchemaEvaluatorFaithfulnessSummarization
🛡️Regulated AI

Audit-ready evals for healthcare, finance, and legal

Local PII detection with zero API calls. Tamper-evident NDJSON audit trails with SHA-256 hashing. EU AI Act Article 9 and NIST AI RMF control mappings per evaluator. No data leaves your environment.

PIIEvaluatorSchemaEvaluatorComplianceReporter
Why We Exist
“Shipping AI without evals is flying blind. We're building the instruments.”

Most teams building with AI have no reliable way to know if their model is getting better or worse across releases. Numeric scores drift, judges hallucinate, and regressions reach users before anyone notices. Multivon exists to fix that — with evaluation tooling that is fast to adopt, cheap to run, and honest about what it measures.

Start evaluating today

Open source and free. Need enterprise support, custom evaluators, or a hosted dashboard? Talk to us.