Numbers come from a 100-case RAGTruth Summary test split, shared judge (gpt-4o-mini), 1 run per framework. Reproduce with the eval-framework-benchmark repo. What's different from v1: RAGTruth is a dataset multivon-eval's calibration table has never seen — the v1 circularity caveat is now removed.
The v1 benchmark post had to disclose a circularity caveat: multivon-eval's 0.9 default threshold for gpt-4o-mini was measured on HaluEval Summarization, and v1 also tested on HaluEval Summarization. The numbers were in- distribution by construction.
v2 fixes that. RAGTruth is a real RAG-trace hallucination dataset (Niu et al., 2024) where human annotators marked hallucination spans inside model-generated summaries with the source document attached. multivon's calibration has never been measured against it. If our 0.9 threshold transfers, the "calibration is the value-add" claim survives the harder test. If it collapses, we have honest data and need to rethink.
Spoiler: it transferred. multivon-eval's F1 on RAGTruth at the HaluEval-derived threshold is higher than on HaluEval — 0.787 vs 0.630.
Results (judge: gpt-4o-mini)
At each framework's default threshold:
| Framework | Default threshold | F1 | Precision | Recall | Median latency |
|---|---|---|---|---|---|
| multivon-eval | 0.90 (calibrated) | 0.787 | 0.921 | 0.686 | 10.7s |
| DeepEval | 0.50 (uniform default) | 0.000 | 0.000 | 0.000 | 13.5s |
DeepEval's score distribution clusters above 0.5 on RAGTruth too (the same calibration problem from v1) — at the default it flags nothing.
The threshold sweep:
| Threshold | multivon-eval F1 | DeepEval F1 |
|---|---|---|
| 0.30 | 0.000 | 0.000 |
| 0.50 | 0.000 | 0.000 |
| 0.70 | 0.179 | 0.188 |
| 0.80 | 0.375 | 0.368 |
| 0.90 | 0.786 | 0.525 |
| 0.95 | 0.854 | 0.587 |
At each framework's best threshold:
| Framework | Best F1 | At threshold | Precision | Recall |
|---|---|---|---|---|
| multivon-eval | 0.854 | 0.95 | 0.911 | 0.804 |
| DeepEval | 0.587 | 0.95 | 0.552 | 0.627 |
multivon-eval and DeepEval flip on the verdict for 38 of 100 cases on RAGTruth (down from 56% on HaluEval — the frameworks agree more on this distribution).
What the numbers tell us
1. Calibration generalizes — that was the bet
multivon-eval's 0.9 threshold was measured on HaluEval Summarization
(60 cases) and shipped in _calibration_data/v1.json. We applied it
unchanged to a different dataset (RAGTruth, 100 cases) and got
F1 = 0.787 with precision 0.921.
On HaluEval Sum (the in-distribution test from v1), the same threshold got F1 = 0.630 with precision 0.586. The threshold transfers — and RAGTruth's structure is apparently easier for the QAG approach to handle: precision jumped from 0.59 to 0.92, recall held.
This is the actual product. The detection prompt itself is not mysterious — DeepEval's at threshold 0.95 is competitive (F1 0.587). The differentiator is the shipped threshold table that lets a user get F1 0.787 out of the box vs 0.000 at DeepEval's default.
2. DeepEval still ships the wrong default
DeepEval at threshold 0.5 caught 0 of 49 hallucinated cases. Same root
cause as v1: their FaithfulnessMetric produces a score distribution
that clusters above 0.5, so the default cutoff misses everything.
DeepEval at 0.95 (its empirical optimum here): F1 0.587. At its own optimum, DeepEval is meaningfully behind multivon-eval at its own optimum (0.854). This is a different story from v1, where DeepEval edged us. The QAG approach apparently transfers better to RAGTruth's real-RAG-trace data than DeepEval's prompt does. Honest take: this is one dataset; we'd need v3 on TruthfulQA or similar before declaring a general detection-quality gap.
3. The "we slightly lost the benchmark" framing from v1 doesn't apply here
multivon-eval's RAGTruth F1 of 0.854 is the headline. v1's HaluEval result (multivon 0.679, DeepEval 0.706) was inside the noise band. v2's RAGTruth result is 26 pp apart at the optima — well outside n=100 Wilson noise (~±10pp).
We didn't update the v1 numbers — they're still true on that dataset. What changes is the cross-dataset story.
4. Cross-framework agreement is task-dependent
multivon ↔ DeepEval flip:
- HaluEval Sum (v1): 56% disagreement (κ = 0.03)
- RAGTruth (v2): 38% disagreement (κ = 0.0)
Less disagreement on RAGTruth. Possibly because annotated spans provide a stricter task definition. Don't read too much into the κ similarity — at 38% disagreement with the base rates we have, κ ≈ 0 just means agreement is what you'd expect by chance.
Where multivon-eval still loses
Honest section, again:
- DeepEval's adapter doesn't speak Anthropic by default. When we
passed
--judge claude-haiku-4-5, DeepEval routed it to OpenAI and got 404. Real DeepEval limitation, not our bug. Multivon-eval handles it correctly via the auto-provider detection in v0.6. - multivon-eval at threshold 0.5 still gets F1 = 0.0 on RAGTruth. Our edge is the shipped threshold, not the prompt. Without the calibration table, we lose with DeepEval. Worth saying out loud.
- Single-judge result. We attempted claude-haiku-4-5 too. Hit the DeepEval Anthropic-routing issue above; multivon-eval ran cleanly but a single judge isn't enough to claim the result holds across providers. Multi-judge v2 results will be in a follow-up.
What this changes for the v1 disclosure
The v1 post said:
"multivon-eval's 0.9 threshold is in-distribution for this benchmark by construction. If we'd benchmarked on RAGTruth, our default-threshold F1 would almost certainly drop."
It didn't drop — it went up. F1 at threshold 0.9: 0.630 on HaluEval Sum, 0.787 on RAGTruth. Precision went up even more dramatically (0.586 → 0.921). RAGTruth's distribution apparently plays to QAG's strengths.
The right framing now:
"multivon-eval ships a calibrated threshold derived from HaluEval Sum. On a different distribution (RAGTruth Summary), the same threshold got F1 0.787 — higher than the in-distribution result. The calibration generalizes; this is the actual product."
What's next
A multi-judge calibration sweep landed alongside this benchmark. The shipped table now covers 6 judges including gpt-5.5 (F1 0.898 on HaluEval QA hallucination at threshold 0.55) and claude-opus-4-7 (F1 0.903, the best in the catalog). See the v0.6 release notes.
The v3 benchmark adds:
- A second cross-dataset test (TruthfulQA, RAGTruth-QA)
- A llama-3.3-70b judge via vLLM (the on-prem story)
- DeepEval against its own native Anthropic adapter so the claude-haiku comparison runs cleanly
- n=500 with multi-run flakiness measurement
Reproduce
git clone https://github.com/multivon-ai/eval-framework-benchmark
cd eval-framework-benchmark
pip install -r requirements.txt
export OPENAI_API_KEY=sk-...
python run.py --task ragtruth-sum --n 100 --runs 1 \
--only multivon-eval deepeval \
--judge gpt-4o-mini
python analyze.py --task ragtruth-sum --n 100
Cost about $3 on OpenAI. The Anthropic side requires a multivon-eval
0.6.0+ install (pip install multivon-eval>=0.6.0) for the provider
auto-detection to route claude-* model ids to Anthropic.
multivon-eval v0.6.0 is the open-source LLM eval framework behind this post. It ships calibrated thresholds per judge × evaluator (with dataset hash and F1 as evidence), an audit-package CLI, suite.lock for drift detection, and a pytest plugin. We're trying to make our claims falsifiable — this benchmark exists so you can do that yourself.