Know if your AI actually works.
Evaluate models, agents, and RAG pipelines with statistical rigor, agent trace analysis, and local-first compliance tooling — all in one open source SDK. No data leaves your environment.
Built differently
Five decisions that separate multivon-eval from every other eval library.
Built for non-determinism
Run each case N times, detect flaky cases automatically, and get Wilson score confidence intervals + statistical significance on every comparison. Backed by NAACL 2025.
Statistical rigor guide →QAG scoring
Binary yes/no questions instead of a 1-10 numeric judge. Same model, better signal. Benchmarked on HaluEval QA with human-labeled ground truth.
See benchmark results →No cold start
EvalSuite.for_rag(), .for_agents(), .for_regulated() — pre-built suites for common use cases. One line, right evaluators, sensible defaults.
Factory suites guide →Agent-native
Purpose-built evaluators for tool call accuracy, tool necessity, trajectory efficiency, and multi-session memory. Works with any framework.
Agent evaluators →Local-first & compliant
PII detection with zero API calls. Tamper-evident audit trails mapped to EU AI Act Article 9 and NIST AI RMF. Built for teams that can't send traces to the cloud.
Compliance guide →Up and running in minutes
No infrastructure. No account. No labeled dataset to start.
Generate cases from your docs
No labeled data needed. Point at any text file and get eval cases immediately.
Pick your evaluators
Mix deterministic checks (free, instant) with LLM judges where it actually matters.
Block deploys, track every run
Fail CI on regression. Compare runs with p-values — know if a drop is real or noise.
See it in the real world
Three teams, three problems, one SDK.
Catch faithfulness failures before users do
A support bot that answers from a knowledge base. One case fails — it tells the user to 'contact billing' when the answer is right there in the docs. Faithfulness catches it. CI blocks the deploy.
Better signal, same model
QAG vs numeric scoring — same judge, different methodology. Binary questions produce fewer false positives.
| Evaluator | Precision | False positives | F1 |
|---|---|---|---|
| multivon-eval (QAG) | 0.788 | 11 | 0.804 |
| Simple LLM judge (1–10 score) | 0.617 | 31 | 0.763 |
| Keyword overlap (no LLM) | 0.605 | 15 | 0.523 |
| Eval approach | Run 1 | Run 2 | Run 3 | Run 4 | Run 5 | Verdict |
|---|---|---|---|---|---|---|
| Single run (standard) | PASS | — | — | — | — | ✓ shipped |
| multivon-eval (5 runs) | PASS | FAIL | PASS | FAIL | PASS | ⚠ flaky — blocked |
How we compare
multivon-eval is built for teams that need statistical rigor, agent evaluation, and local-first compliance — not just another evaluator catalog.
| Feature | multivon-eval | DeepEval | RAGAS | Promptfoo |
|---|---|---|---|---|
| Multi-run + flakiness detection | ✓ | — | — | — |
| Wilson CI + power analysis | ✓ | — | — | — |
| QAG as user-facing primitive | ✓ | ∼ | ∼ | — |
| Agent-native evaluators | ✓ | ✓ | ∼ | ∼ |
| Local-first, no data egress | ✓ | ✓ | ∼ | ∼ |
| OSS compliance audit trail | ✓ | — | — | ∼ |
| No account or API key to start | ✓ | — | — | ∼ |
Built for how AI ships today
One SDK covering every layer of your AI stack — retrieval to agents to safety.
Catch hallucinations before they reach users
Evaluate faithfulness and context precision on every retrieval step. Know when your retriever returns the wrong chunk and when the model ignores the right one.
Verify tool calls, trajectories, and memory
Evaluate tool call accuracy, argument quality, step necessity, trajectory efficiency, and multi-session memory. Beyond task completion — did the agent take the right path?
Measure quality across multi-turn dialogue
Track relevance, knowledge retention, and consistency across turns. Catch drift before your chatbot starts contradicting itself mid-conversation.
Regression testing across model versions
Run the same eval suite before and after every training run. Block deployment automatically when pass rate drops below threshold.
Validate extraction against your schema
Check that extracted fields match Pydantic models or JSON Schema with per-field failure reports. Summaries stay faithful to source. Structured output failures caught before they reach downstream systems.
Audit-ready evals for healthcare, finance, and legal
Local PII detection with zero API calls. Tamper-evident NDJSON audit trails with SHA-256 hashing. EU AI Act Article 9 and NIST AI RMF control mappings per evaluator. No data leaves your environment.
“Shipping AI without evals is flying blind. We're building the instruments.”
Most teams building with AI have no reliable way to know if their model is getting better or worse across releases. Numeric scores drift, judges hallucinate, and regressions reach users before anyone notices. Multivon exists to fix that — with evaluation tooling that is fast to adopt, cheap to run, and honest about what it measures.
Start evaluating today
Open source and free. Need enterprise support, custom evaluators, or a hosted dashboard? Talk to us.