Open Source · v0.3.0

Know if your AI actually works.

Evaluate models, agents, and RAG pipelines with statistical rigor, agent trace analysis, and local-first compliance tooling — all in one open source SDK. No data leaves your environment.

$pip install multivon-eval

Read the docs →View on GitHub

scroll

Works withClaudeOpenAIGeminiLangChainLlamaIndexAny model

Built differently

Five decisions that separate multivon-eval from every other eval library.

p<0.05or it's noise

Built for non-determinism

Run each case N times, detect flaky cases automatically, and get Wilson score confidence intervals + statistical significance on every comparison. Backed by NAACL 2025.

Statistical rigor guide →

65%fewer false positives

QAG scoring

Binary yes/no questions instead of a 1-10 numeric judge. Same model, better signal. Benchmarked on HaluEval QA with human-labeled ground truth.

See benchmark results →

0config needed to start

No cold start

EvalSuite.for_rag(), .for_agents(), .for_regulated() — pre-built suites for common use cases. One line, right evaluators, sensible defaults.

Factory suites guide →

37built-in evaluators

Agent-native

Purpose-built evaluators for tool call accuracy, tool necessity, trajectory efficiency, and multi-session memory. Works with any framework.

Agent evaluators →

0bytes leave your environment

Local-first & compliant

PII detection with zero API calls. Tamper-evident audit trails mapped to EU AI Act Article 9 and NIST AI RMF. Built for teams that can't send traces to the cloud.

Compliance guide →

Up and running in minutes

No infrastructure. No account. No labeled dataset to start.

Generate cases from your docs

No labeled data needed. Point at any text file and get eval cases immediately.

from multivon_eval import generate_from_file cases = generate_from_file( "docs/faq.md", n=20, task="qa", )

Pick your evaluators

Mix deterministic checks (free, instant) with LLM judges where it actually matters.

suite.add_evaluators( NotEmpty(), # free Faithfulness(), # LLM judge ToolCallAccuracy(), )

Block deploys, track every run

Fail CI on regression. Compare runs with p-values — know if a drop is real or noise.

# exits 1 if pass rate < threshold report = suite.run( model_fn, fail_threshold=0.85, ) # track every run locally exp.record(report, tags={ "model": "gpt-4o", "branch": "main", })

Full quickstart guide →·Try in Colab →

See it in the real world

Three teams, three problems, one SDK.

Customer Support

Catch faithfulness failures before users do

A support bot that answers from a knowledge base. One case fails — it tells the user to 'contact billing' when the answer is right there in the docs. Faithfulness catches it. CI blocks the deploy.

Evaluators used

NotEmpty()Faithfulness()Relevance()

Full walkthrough with code →

multivon-eval

── Support Bot Eval ──

Model: gpt-4o-mini

#InputOutputScoreStatusms

1How do I reset my…Click 'Forgot Pa…0.94PASS743

2What are your hours?We're open Mon–F…0.88PASS812

3Can I change my email?Yes, go to Accou…0.91PASS698

4How do I cancel…Please contact b…0.41FAIL924

5What payment meth…We accept Visa, …0.96PASS651

EvaluatorAvg ScorePass Rate

not_empty1.00100%

faithfulness0.8280%

relevance0.9190%

Total: 5 Passed: 4 Failed: 1 Pass Rate: 80.0% Avg Score: 0.82

Better signal, same model

QAG vs numeric scoring — same judge, different methodology. Binary questions produce fewer false positives.

65%

fewer false positives than a 1-10 numeric judge

0.804

F1 on HaluEval QA · 100 cases · human labels

28%

higher precision with the same claude-haiku judge

Hallucination DetectionHaluEval QA · 100 cases · human-labeled

All runs: claude-haiku-4-5

Evaluator	Precision	False positives	F1
multivon-eval (QAG)	0.788	11	0.804
Simple LLM judge (1–10 score)	0.617	31	0.763
Keyword overlap (no LLM)	0.605	15	0.523

Flaky case detectionSame input · same model · 5 runs · illustrative scenario

Eval approach	Run 1	Run 2	Run 3	Run 4	Run 5	Verdict
Single run (standard)	PASS	—	—	—	—	✓ shipped
multivon-eval (5 runs)	PASS	FAIL	PASS	FAIL	PASS	⚠ flaky — blocked

Full methodology, all datasets, reproduction instructions →

How we compare

multivon-eval is built for teams that need statistical rigor, agent evaluation, and local-first compliance — not just another evaluator catalog.

Feature	multivon-eval	DeepEval	RAGAS	Promptfoo
Multi-run + flakiness detection	✓	—	—	—
Wilson CI + power analysis	✓	—	—	—
QAG as user-facing primitive	✓	∼	∼	—
Agent-native evaluators	✓	✓	∼	∼
Local-first, no data egress	✓	✓	∼	∼
OSS compliance audit trail	✓	—	—	∼
No account or API key to start	✓	—	—	∼

✓ supported · ∼ partial · — not available · Based on public documentation as of April 2025. “QAG as user-facing primitive” — DeepEval and RAGAS use QAG internally; multivon-eval exposes it as a configurable API. See GitHub discussions for corrections.

Built for how AI ships today

One SDK covering every layer of your AI stack — retrieval to agents to safety.

⚡RAG Pipelines

Catch hallucinations before they reach users

Evaluate faithfulness and context precision on every retrieval step. Know when your retriever returns the wrong chunk and when the model ignores the right one.

FaithfulnessContextPrecisionContextRecall

🤖AI Agents

Verify tool calls, trajectories, and memory

Evaluate tool call accuracy, argument quality, step necessity, trajectory efficiency, and multi-session memory. Beyond task completion — did the agent take the right path?

ToolCallNecessityTrajectoryEfficiencyAgentMemoryEval

💬Conversational AI

Measure quality across multi-turn dialogue

Track relevance, knowledge retention, and consistency across turns. Catch drift before your chatbot starts contradicting itself mid-conversation.

ConversationRelevanceKnowledgeRetentionTurnConsistency

🔍LLM Fine-Tuning

Regression testing across model versions

Run the same eval suite before and after every training run. Block deployment automatically when pass rate drops below threshold.

AnswerAccuracyROUGEGEval

📄Document Intelligence

Validate extraction against your schema

Check that extracted fields match Pydantic models or JSON Schema with per-field failure reports. Summaries stay faithful to source. Structured output failures caught before they reach downstream systems.

SchemaEvaluatorFaithfulnessSummarization

🛡️Regulated AI

Audit-ready evals for healthcare, finance, and legal

Local PII detection with zero API calls. Tamper-evident NDJSON audit trails with SHA-256 hashing. EU AI Act Article 9 and NIST AI RMF control mappings per evaluator. No data leaves your environment.

PIIEvaluatorSchemaEvaluatorComplianceReporter

Why We Exist

“Shipping AI without evals is flying blind. We're building the instruments.”

Most teams building with AI have no reliable way to know if their model is getting better or worse across releases. Numeric scores drift, judges hallucinate, and regressions reach users before anyone notices. Multivon exists to fix that — with evaluation tooling that is fast to adopt, cheap to run, and honest about what it measures.

Start evaluating today

Open source and free. Need enterprise support, custom evaluators, or a hosted dashboard? Talk to us.

Read the docs →View on GitHub hello@multivon.ai