v2 benchmark: calibration transfers across datasets. multivon-eval F1 went up, not down.

Numbers come from a 100-case RAGTruth Summary test split, shared judge (gpt-4o-mini), 1 run per framework. Reproduce with the eval-framework-benchmark repo. What's different from v1: RAGTruth is a dataset multivon-eval's calibration table has never seen — the v1 circularity caveat is now removed.

The v1 benchmark post had to disclose a circularity caveat: multivon-eval's 0.9 default threshold for gpt-4o-mini was measured on HaluEval Summarization, and v1 also tested on HaluEval Summarization. The numbers were in- distribution by construction.

v2 fixes that. RAGTruth is a real RAG-trace hallucination dataset (Niu et al., 2024) where human annotators marked hallucination spans inside model-generated summaries with the source document attached. multivon's calibration has never been measured against it. If our 0.9 threshold transfers, the "calibration is the value-add" claim survives the harder test. If it collapses, we have honest data and need to rethink.

Spoiler: it transferred. multivon-eval's F1 on RAGTruth at the HaluEval-derived threshold is higher than on HaluEval — 0.787 vs 0.630.

Results (judge: gpt-4o-mini)

At each framework's default threshold:

Framework	Default threshold	F1	Precision	Recall	Median latency
multivon-eval	0.90 (calibrated)	0.787	0.921	0.686	10.7s
DeepEval	0.50 (uniform default)	0.000	0.000	0.000	13.5s

DeepEval's score distribution clusters above 0.5 on RAGTruth too (the same calibration problem from v1) — at the default it flags nothing.

The threshold sweep:

Threshold	multivon-eval F1	DeepEval F1
0.30	0.000	0.000
0.50	0.000	0.000
0.70	0.179	0.188
0.80	0.375	0.368
0.90	0.786	0.525
0.95	0.854	0.587

At each framework's best threshold:

Framework	Best F1	At threshold	Precision	Recall
multivon-eval	0.854	0.95	0.911	0.804
DeepEval	0.587	0.95	0.552	0.627

multivon-eval and DeepEval flip on the verdict for 38 of 100 cases on RAGTruth (down from 56% on HaluEval — the frameworks agree more on this distribution).

What the numbers tell us

1. Calibration generalizes — that was the bet

multivon-eval's 0.9 threshold was measured on HaluEval Summarization (60 cases) and shipped in _calibration_data/v1.json. We applied it unchanged to a different dataset (RAGTruth, 100 cases) and got F1 = 0.787 with precision 0.921.

On HaluEval Sum (the in-distribution test from v1), the same threshold got F1 = 0.630 with precision 0.586. The threshold transfers — and RAGTruth's structure is apparently easier for the QAG approach to handle: precision jumped from 0.59 to 0.92, recall held.

This is the actual product. The detection prompt itself is not mysterious — DeepEval's at threshold 0.95 is competitive (F1 0.587). The differentiator is the shipped threshold table that lets a user get F1 0.787 out of the box vs 0.000 at DeepEval's default.

2. DeepEval still ships the wrong default

DeepEval at threshold 0.5 caught 0 of 49 hallucinated cases. Same root cause as v1: their FaithfulnessMetric produces a score distribution that clusters above 0.5, so the default cutoff misses everything.

DeepEval at 0.95 (its empirical optimum here): F1 0.587. At its own optimum, DeepEval is meaningfully behind multivon-eval at its own optimum (0.854). This is a different story from v1, where DeepEval edged us. The QAG approach apparently transfers better to RAGTruth's real-RAG-trace data than DeepEval's prompt does. Honest take: this is one dataset; we'd need v3 on TruthfulQA or similar before declaring a general detection-quality gap.

3. The "we slightly lost the benchmark" framing from v1 doesn't apply here

multivon-eval's RAGTruth F1 of 0.854 is the headline. v1's HaluEval result (multivon 0.679, DeepEval 0.706) was inside the noise band. v2's RAGTruth result is 26 pp apart at the optima — well outside n=100 Wilson noise (~±10pp).

We didn't update the v1 numbers — they're still true on that dataset. What changes is the cross-dataset story.

4. Cross-framework agreement is task-dependent

multivon ↔ DeepEval flip:

HaluEval Sum (v1): 56% disagreement (κ = 0.03)
RAGTruth (v2): 38% disagreement (κ = 0.0)

Less disagreement on RAGTruth. Possibly because annotated spans provide a stricter task definition. Don't read too much into the κ similarity — at 38% disagreement with the base rates we have, κ ≈ 0 just means agreement is what you'd expect by chance.

Where multivon-eval still loses

Honest section, again:

DeepEval's adapter doesn't speak Anthropic by default. When we passed --judge claude-haiku-4-5, DeepEval routed it to OpenAI and got 404. Real DeepEval limitation, not our bug. Multivon-eval handles it correctly via the auto-provider detection in v0.6.
multivon-eval at threshold 0.5 still gets F1 = 0.0 on RAGTruth. Our edge is the shipped threshold, not the prompt. Without the calibration table, we lose with DeepEval. Worth saying out loud.
Single-judge result. We attempted claude-haiku-4-5 too. Hit the DeepEval Anthropic-routing issue above; multivon-eval ran cleanly but a single judge isn't enough to claim the result holds across providers. Multi-judge v2 results will be in a follow-up.

What this changes for the v1 disclosure

The v1 post said:

"multivon-eval's 0.9 threshold is in-distribution for this benchmark by construction. If we'd benchmarked on RAGTruth, our default-threshold F1 would almost certainly drop."

It didn't drop — it went up. F1 at threshold 0.9: 0.630 on HaluEval Sum, 0.787 on RAGTruth. Precision went up even more dramatically (0.586 → 0.921). RAGTruth's distribution apparently plays to QAG's strengths.

The right framing now:

"multivon-eval ships a calibrated threshold derived from HaluEval Sum. On a different distribution (RAGTruth Summary), the same threshold got F1 0.787 — higher than the in-distribution result. The calibration generalizes; this is the actual product."

What's next

A multi-judge calibration sweep landed alongside this benchmark. The shipped table now covers 6 judges including gpt-5.5 (F1 0.898 on HaluEval QA hallucination at threshold 0.55) and claude-opus-4-7 (F1 0.903, the best in the catalog). See the v0.6 release notes.

The v3 benchmark adds:

A second cross-dataset test (TruthfulQA, RAGTruth-QA)
A llama-3.3-70b judge via vLLM (the on-prem story)
DeepEval against its own native Anthropic adapter so the claude-haiku comparison runs cleanly
n=500 with multi-run flakiness measurement

Reproduce

git clone https://github.com/multivon-ai/eval-framework-benchmark
cd eval-framework-benchmark
pip install -r requirements.txt
export OPENAI_API_KEY=sk-...

python run.py --task ragtruth-sum --n 100 --runs 1 \
    --only multivon-eval deepeval \
    --judge gpt-4o-mini

python analyze.py --task ragtruth-sum --n 100

Cost about $3 on OpenAI. The Anthropic side requires a multivon-eval 0.6.0+ install (pip install multivon-eval>=0.6.0) for the provider auto-detection to route claude-* model ids to Anthropic.

multivon-eval v0.6.0 is the open-source LLM eval framework behind this post. It ships calibrated thresholds per judge × evaluator (with dataset hash and F1 as evidence), an audit-package CLI, suite.lock for drift detection, and a pytest plugin. We're trying to make our claims falsifiable — this benchmark exists so you can do that yourself.