Benchmark · reproducible · 2026-06-05
multivon-eval vs DeepEval vs RAGAS
on hallucination detection.
Every framework claims accuracy on faithfulness and hallucination detection. None publish a side-by-side comparison with the same judge, same dataset, same seed. We do. Same model. Same 100 cases. Same random seed. Code in the repo, re-run it yourself.
Results at default thresholds
Dataset: ragtruth-sum (n=100) (RAG-Truth summarization split, 100-case stratified sample with human labels). Each row is one framework's faithfulness/hallucination metric scored against the human labels at the framework's default threshold.
Judge: claude-haiku-4-5
| Framework | Threshold | F1 | Precision | Recall | Latency (ms) | Errors |
|---|---|---|---|---|---|---|
| multivon-eval | 0.90 | 0.673 | 0.613 | 0.745 | 9731 | 0 |
| DeepEval | 0.50 (default) | 0.000 | 0.000 | 0.000 | — | 100 |
| RAGAS | — | — | — | — | — | 0 |
Judge: gpt-4o-mini
| Framework | Threshold | F1 | Precision | Recall | Latency (ms) | Errors |
|---|---|---|---|---|---|---|
| multivon-eval | 0.90 | 0.744 | 0.914 | 0.628 | 9027 | 0 |
| DeepEval | 0.50 (default) | 0.038 | 1.000 | 0.020 | 11973 | 0 |
| RAGAS | — | — | — | — | — | 0 |
Reading the table. At default thresholds, DeepEval scores F1 0.000 with the claude-haiku judge (every case errors) and F1 0.038 with gpt-4o-mini (recall 0.02 — it flags almost none of the labeled hallucinations). multivon-eval's F1 is 0.673 (claude-haiku) and 0.744 (gpt-4o-mini). Default-vs-default is the comparison most users get when they install each framework and run with the documented configuration.
What happens if you tune the threshold?
Some of DeepEval's poor performance at default settings is a threshold issue. Below, F1 across a threshold sweep on the gpt-4o-mini judge. multivon-eval's best F1 is 0.833 at threshold 0.95. DeepEval's best F1 is 0.631 at threshold 0.95. Even at best-tuned thresholds, multivon-eval has a ~32% F1 advantage. Threshold sweeps are computed on the test set — read them as upper bounds, not held-out estimates.
| Threshold | multivon-eval F1 | DeepEval F1 |
|---|---|---|
| 0.30 | 0.000 | 0.000 |
| 0.50 | 0.000 | 0.038 |
| 0.60 | 0.111 | 0.071 |
| 0.70 | 0.210 | 0.222 |
| 0.80 | 0.355 | 0.329 |
| 0.90 | 0.744 | 0.566 |
| 0.95 | 0.833 | 0.631 |
Run it yourself
Code in the repo. One git clone, one pip install, one OpenAI (or Anthropic) key. The benchmark runs in roughly 20 minutes on the default sample. Reproduce the numbers above, change the dataset, swap the judge, or add another framework.
git clone https://github.com/multivon-ai/eval-framework-benchmark
cd eval-framework-benchmark
pip install -r requirements.txt
export OPENAI_API_KEY=sk-...
python run.py
python analyze.py results/Datasets: HaluEval QA and Summarization (100-case stratified samples each), plus the ragtruth-sum split. Judges tested: Claude Haiku 4.5, GPT-4o-mini. Same random seed across all runs. Full methodology in the repo README.
Calls we made
- Same judge for all frameworks. We don't let each framework use a different judge model. If we did, the comparison would measure judge quality, not framework quality.
- Same threshold semantics. Each framework's documented "default threshold" was used as-is, then thresholds were swept in the second section so you can see how each behaves at its best.
- RAGAS not yet included in this run. The RAGAS harness needed a different test-data adapter that we haven't merged yet. Will run in the next iteration.
- Where multivon-eval loses, it's documented. See COMMENTARY.md in the repo for cases where multivon-eval flagged a hallucination the human label said was correct, or vice versa. We don't hide them.