Three open-source eval frameworks disagree on 56% of cases. Here's what we found running our own benchmark.

Numbers in this post come from a 50-case HaluEval Summarization pilot with gpt-4o-mini as a shared judge, 3 repeated runs per case (1 for RAGAS — see why). Reproduce them yourself with the eval-framework-benchmark repo. Wilson 95% CI on F1 at n=50 is roughly ±10–12pp; sub-5pp differences are inside the noise margin. Important caveat: multivon-eval's shipped calibration was measured on a different sample of this same dataset. Read why that matters before drawing conclusions.

We build multivon-eval. We claim accuracy on hallucination detection. DeepEval and RAGAS make similar claims. None of us had ever published a side-by-side comparison with the same judge on the same dataset. We did one. The most useful findings, in order:

multivon-eval and DeepEval disagree on the verdict for 28 of 50 cases (56%). Cohen's κ = 0.03 — barely above chance. Switching frameworks is not a drop-in. If you change evaluator stacks, re-baseline your CI.
At each framework's default threshold, multivon-eval's F1 is 0.63, DeepEval's is 0.08, RAGAS's is 0.50. Most of that gap is calibration, not detection quality: DeepEval at threshold 0.9 reaches F1 = 0.71 — slightly ahead of multivon at any threshold. multivon-eval's contribution is shipping the threshold table, not a better prompt.
Single-run scores are unreliable for every framework at temperature=0. multivon-eval's verdict flipped across runs on 4 of 50 cases. DeepEval's flipped on 1.
Calibrated thresholds × statistical CIs × hash-chained audit logs are the substrate the synthesis ranked as the actual product. The detection prompts are commoditizing; the production evidence layer isn't.

The circularity disclosure

multivon-eval ships a per-judge threshold table at _calibration_data/v1.json. The row for faithfulness × gpt-4o-mini was measured on a 60-case stratified sample of HaluEval Summarization.

This benchmark uses a different 50-case stratified sample of the same dataset. The two samples don't share any specific case IDs (we re-seed), but they're drawn from the same distribution.

That means multivon's 0.9 threshold is in-distribution for this benchmark by construction. If we'd benchmarked on RAGTruth or TruthfulQA — datasets multivon-eval's calibration table has never seen — our default-threshold F1 would almost certainly drop. We'd expect DeepEval and RAGAS to be unaffected (their defaults are uniform, so they don't benefit from in-distribution calibration).

We are publishing this benchmark anyway because (a) you can run it against any dataset you choose yourself in three minutes, (b) the threshold sweep below makes the circularity visible, and (c) the v2 benchmark on the roadmap will use RAGTruth specifically to remove it.

What we held constant

	Value
Judge	`gpt-4o-mini`, temperature=0, max_tokens=1024
Dataset	HaluEval Summarization, 50-case stratified pilot (25 faithful, 25 hallucinated), seed=42
Frameworks	multivon-eval 0.5.0 (Faithfulness, QAG), DeepEval 4.0 (FaithfulnessMetric), RAGAS 0.4 (faithfulness)
Default thresholds	multivon-eval 0.90 (calibrated for gpt-4o-mini), DeepEval 0.50 (uniform default), RAGAS 0.50 (uniform default)
Runs per case	3 (multivon, DeepEval), 1 (RAGAS)

What we did not do:

Let each framework use its own preferred judge. That would conflate framework accuracy with judge accuracy.
Pick a benchmark that didn't appear in multivon's calibration. (See the disclosure above.)
Use the optimal threshold per framework in the headline table — see the sweep below.

Results at each framework's default threshold

Framework	Threshold	F1	Precision	Recall	Cross-run score std	Flaky verdict rate	Median latency
multivon-eval	0.90	0.63	0.59	0.68	0.027	8%	6.6s
DeepEval	0.50	0.08	1.00	0.04	0.054	2%	11.7s
RAGAS	0.50	0.50	0.82	0.36	n/a	0%	129s

multivon-eval looks dominant. It isn't. The 0.5 default is just a bad threshold for gpt-4o-mini on this task — DeepEval's scores cluster in the 0.6–1.0 range, so 0.5 misses almost everything. Look at the sweep:

Threshold sweep — F1 at each threshold

Threshold	multivon-eval	DeepEval	RAGAS
0.30	0.148	0.000	0.214
0.50 (DeepEval & RAGAS default)	0.267	0.077	0.500
0.60	0.424	0.258	0.537
0.70	0.564	0.465	0.583
0.80	0.638	0.667	0.604
0.90 (multivon-eval default)	0.630	0.706	0.610
0.95	0.679	0.696	0.610

The framework rankings change dramatically depending on the threshold. The honest single-number summary:

Framework	Best F1	At threshold	Default-threshold F1
multivon-eval	0.679	0.95	0.630 (at 0.9)
DeepEval	0.706	0.90	0.077 (at 0.5)
RAGAS	0.610	0.90	0.500 (at 0.5)

DeepEval beats us at best-F1. By a small amount, well inside the noise margin at n=50, but in the direction of "their detection prompt is at least as good as ours."

The 0.706 vs 0.679 gap is ~2.7pp, within Wilson CI overlap at this n. The point isn't that DeepEval is conclusively better — it's that multivon-eval is not detectably better, and any narrative that says otherwise is reading more into the data than it can support.

What we do contribute that DeepEval and RAGAS don't:

A per-judge, per-evaluator calibrated threshold. Without this, DeepEval scores 0.077 on the same data. With multivon's threshold applied to DeepEval, DeepEval scores 0.706. The calibration is the value-add, not the prompt.
A documented calibration provenance. The 0.9 we ship comes from a 60-case HaluEval Sum sweep, with dataset hash, N, and F1 in _calibration_data/v1.json. Auditors can verify it.
Multi-run flakiness detection out of the box. See below.

Frameworks disagree on which cases are hallucinated

Pair	Cases that flip verdict	Cohen's κ
multivon-eval ↔ DeepEval	28 / 50 (56%)	0.03 (poor)
multivon-eval ↔ RAGAS	20 / 50 (40%)	0.27 (fair)
DeepEval ↔ RAGAS	10 / 50 (20%)	0.13 (poor)

Cohen's κ of 0.03 means the two frameworks agree on the binary verdict at a rate barely above chance. κ scales: κ < 0.20 is "poor agreement," 0.21–0.40 is "fair," 0.41–0.60 is "moderate."

If you replace one framework with another, you should expect ~half your existing verdicts to change. That's a much bigger signal than the F1 differences. Re-baseline your CI when you switch.

Single-run scores are unreliable

Framework	Cross-run score std	Flaky verdict rate
multivon-eval	0.027	8% (4 / 50 cases)
DeepEval	0.054	2% (1 / 50 cases)

Even at temperature=0, the same judge on the same input doesn't always return the same number. This validates the NAACL 2025 paper "Evaluation of LLMs Should Not Ignore Non-Determinism" as a property of every framework tested, not a multivon-eval-specific finding.

Note the asymmetry: multivon has higher flakiness (8%), DeepEval has higher within-case score std (0.054 vs 0.027). DeepEval's scores move more across runs but the threshold of 0.5 is so far from where its scores live that it rarely flips a verdict. multivon's scores move less but live near its 0.9 threshold, so small moves flip the verdict.

This is one place where multivon-eval has a concrete production-CI recommendation: use runs=3 (the SDK has runs_per_case=N and SPRT early stopping built in). Other frameworks require you to build that loop yourself.

Latency varies 19× across frameworks

Median per-case latency on gpt-4o-mini:

multivon-eval: 6.6s
DeepEval: 11.7s
RAGAS: 129s

Same judge, same case. RAGAS is more thorough by design (more sub-calls per case, by inspection of its code); it pays for that in throughput. At production scale, if you're running 1k cases × 5 runs on every PR, even a 2× difference is significant.

Where multivon-eval loses

Now that we're not pretending we won:

Best-case F1 is not the highest of the three. DeepEval at its optimal threshold (0.706) edges multivon-eval at its optimal threshold (0.679). Inside the noise margin, but in the wrong direction for our marketing.
Highest flaky-verdict rate (8%). Your CI will see occasional spurious flags from multivon on borderline cases. The runs=3 mitigation works but adds cost.
In-distribution advantage at default thresholds. Our shipped 0.9 threshold is calibrated on HaluEval Sum. It would be honest to report a cross-dataset comparison too. We will, in v2.

Where DeepEval shines

Best detection prompt at the right threshold. F1 = 0.706 at threshold 0.9 is the highest single result in this entire benchmark. Their faithfulness extraction is solid.
100% precision at default threshold. When DeepEval flags something, on this dataset, it was always actually hallucinated. That's an excellent property for "trust the flag" workflows — if you're okay missing 96% of the hallucinations the default doesn't catch.

Where RAGAS shines

Highest precision at default threshold (0.82) for any framework catching a meaningful number of hallucinations. Roughly 1.4× our precision at our default. If you have to triage flagged cases manually, RAGAS gives you the smallest, most enriched flag pile.
Most stable scores in our pilot — zero flaky verdicts at n=50. Caveat: only 1 run, so we can't compute cross-run std.

Why only one run for RAGAS

RAGAS's faithfulness runs more sub-calls per case than the others. At 129s median per case in our setup, 3 runs × 50 cases would have been ~3.2 hours. We dropped to 1 run for RAGAS and explicitly noted that the cross-run std and flaky-rate columns are based on a single run.

This is not a knock on RAGAS — they decompose more thoroughly by design. Faster runs are achievable (RAGAS supports asyncio.gather upstream; we did not wire it in for this pilot). We'll re-test with a multi-run RAGAS in the next iteration.

What this benchmark is not

Not a verdict. A 50-case pilot on one dataset with one judge is direction, not conclusion. With n=50, F1 differences under ~5pp are inside the noise margin.
Not a cross-dataset test. multivon-eval's calibration was measured on the same dataset family — that's circular by construction. See The circularity disclosure.
Not the only thing each framework does. DeepEval has G-Eval, red-teaming, conversation eval. RAGAS has retrieval-recall and context-precision. multivon-eval has agent trajectory evaluators and statistical primitives (Wilson CIs, SPRT, power analysis). A faithfulness-only comparison is the cleanest apples-to-apples test we could design.

Reproduce in three minutes

git clone https://github.com/multivon-ai/eval-framework-benchmark
cd eval-framework-benchmark
pip install -r requirements.txt
export OPENAI_API_KEY=sk-...
python run.py --task sum --n 50 --runs 3 --only multivon-eval deepeval
python run.py --task sum --n 50 --runs 1 --only ragas
python analyze.py

Or open colab.ipynb and run all cells. Roughly $0.20 in OpenAI spend.

If you re-run and get different numbers, please open an issue with your versions, OS, and the diff.

What's next

A larger benchmark — 1,000 cases per task on RAGTruth (so multivon isn't calibrated on the test distribution), multiple judges (claude-haiku-4-5, gpt-4o, llama-3.3-70b), threshold sweeps already implemented, and HaluEval QA with answer-similarity evaluators since faithfulness is the wrong metric for short-form QA across all three frameworks. The repo is open; contributions and judge configurations welcome.

multivon-eval is the open-source LLM eval framework behind this post. We build it. We have an obvious interest in how this comparison lands. We tried to make it fair anyway, including by disclosing where it doesn't make us look good.