multivon-bench · v0

Reproducible benchmarks for LLM eval

Every number on this page is reproducible from the open-source repo. JSON results live in benchmarks/results/; this page reads them directly. Add a benchmark or judge via PR — see thesubmission noteat the bottom.

0.844

best judge F1

gemini-2.5-flash

0.804

multivon QAG F1

head-to-head, HaluEval

$0.00127

cost / case

4 LLM-judge evaluators

8.7×

cache hit speedup

cold → warm rerun

Hallucination detection · head-to-head

halueval_qa · N=100 cases · claude-haiku-4-5 judge

Evaluator	Precision	Recall	False positives	Latency	F1
multivon_eval (QAG)	0.788	0.820	11	2955ms	0.804
simple_judge (1-10)	0.617	1.000	31	708ms	0.763
deepeval (GPT-4o-mini)	0.456	0.820	49	1421ms	0.586
keyword_overlap	0.605	0.460	15	0ms	0.523

All four evaluators run against the same 50 HaluEval QA samples (×2 variants = 100 cases, balanced positive/negative). multivon-eval uses QAG (binary yes/no questions) instead of numeric scales — see the methodology page. Raw JSON: hallucination.json.

Multi-judge agreement · per-judge accuracy

halueval_qa · N=50 pairs · temperature=0

#	Judge model	Accuracy	Precision	Recall	F1
1	gemini-2.5-flash	0.860	0.950	0.760	0.844
2	gpt-4o-mini	0.820	0.900	0.720	0.800
3	gpt-4o	0.780	0.792	0.760	0.776
4	claude-haiku-4-5	0.800	0.895	0.680	0.773
5	claude-sonnet-4-6	0.720	0.720	0.720	0.720

Pairwise judge agreement (Cohen's κ)

Judge pair	κ	Strength
claude-sonnet-4-6 ↔ gpt-4o	0.800	substantial
claude-haiku-4-5 ↔ gemini-2.5-flash	0.790	substantial
gpt-4o-mini ↔ gpt-4o	0.758	substantial
gpt-4o ↔ gemini-2.5-flash	0.758	substantial
gpt-4o-mini ↔ gemini-2.5-flash	0.750	substantial
claude-sonnet-4-6 ↔ gpt-4o-mini	0.720	substantial
claude-sonnet-4-6 ↔ gemini-2.5-flash	0.720	substantial
claude-haiku-4-5 ↔ gpt-4o	0.717	substantial
claude-haiku-4-5 ↔ gpt-4o-mini	0.706	substantial
claude-haiku-4-5 ↔ claude-sonnet-4-6	0.600	moderate

κ interpretation per Landis & Koch (1977). Pairs with κ < 0.61 indicate genuine judge-disagreement on hard cases — the examples that benefit most from cross-judge calibration.

Cost & latency

50 cases · 4 LLM-judge evaluators · workers=1

Cost per case	$0.00127	17.1 judge calls per case
Total cost for the run	$0.0635	46,294 input + 6,605 output tokens
Wall clock	15.0 min	18.0s per case avg
Extrapolation to 5,000 cases	$6.35	Linear extrapolation; workers=1 (single LLM call at a time)
Deterministic tier (no LLM)	$0.00	3 evaluators, instant

Workers=1 is enforced for the cost benchmark so per-call usage records reach the active CostTracker. The 5,000-case extrapolation is linear and ignores judge-cache hits (see the next section for what caching does to re-runs). Raw JSON: cost_latency.json.

Reproducibility & cache

halueval_qa (first 10) · 10 cases × 10 reps · claude-haiku-4-5

Mode	Wall clock	pass_rate σ	avg_score σ	N reps
Cache ON (rep 1 cold, rest warm)	8.5s	0.0000	0.0000	10
Cache OFF (every call hits the API)	74.3s	0.0422	0.0084	10

Cache hit speedup: 8.7× (cold → warm rerun). Cache ON reproduces identical scores rep-to-rep. Cache OFF exposes the irreducible judge variance at temperature=0. Raw JSON: reproducibility.json.

Reproduce locally

git clone https://github.com/multivon-ai/multivon-eval
cd multivon-eval/benchmarks
pip install -e .. anthropic openai google-genai
export ANTHROPIC_API_KEY=...
export OPENAI_API_KEY=...
export GOOGLE_API_KEY=...

# pick one (each writes to benchmarks/results/*.json):
python run_hallucination_benchmark.py
python run_multi_judge_agreement_benchmark.py
python run_cost_latency_benchmark.py
python run_reproducibility_benchmark.py

All four scripts are deterministic at temperature=0 modulo API non-determinism. Wall-clock estimates: ~50–120s each for the in-house judges, longer when external judges are added.

Submit a new judge

Open-source judges (Patronus Lynx, Prometheus-2, Vectara HHEM, others) are actively being added. PRs welcome: add a JudgeConfig entry in run_multi_judge_agreement_benchmark.py or wire an external-judge adapter in run_external_judges_benchmark.py, run the benchmark, commit the updated JSON. This page auto-updates on the next deploy.

Open the external-judge harness →