multivon-bench · v0

Reproducible benchmarks for LLM eval

Every number on this page is reproducible from the open-source repo. JSON results live in benchmarks/results/; this page reads them directly. Add a benchmark or judge via PR — see thesubmission noteat the bottom.

Hallucination detection · head-to-head

halueval_qa · N=100 cases · claude-haiku-4-5 judge
EvaluatorPrecisionRecallFalse positivesLatencyF1
multivon_eval (QAG)0.7880.820112955ms0.804
simple_judge (1-10)0.6171.00031708ms0.763
deepeval (GPT-4o-mini)0.4560.820491421ms0.586
keyword_overlap0.6050.460150ms0.523

All four evaluators run against the same 50 HaluEval QA samples (×2 variants = 100 cases, balanced positive/negative). multivon-eval uses QAG (binary yes/no questions) instead of numeric scales — see the methodology page. Raw JSON: hallucination.json.

Multi-judge agreement · per-judge accuracy

halueval_qa · N=50 pairs · temperature=0
#Judge modelAccuracyPrecisionRecallF1
1gemini-2.5-flash0.8600.9500.7600.844
2gpt-4o-mini0.8200.9000.7200.800
3gpt-4o0.7800.7920.7600.776
4claude-haiku-4-50.8000.8950.6800.773
5claude-sonnet-4-60.7200.7200.7200.720

Pairwise judge agreement (Cohen's κ)

Judge pairκStrength
claude-sonnet-4-6 ↔ gpt-4o0.800substantial
claude-haiku-4-5 ↔ gemini-2.5-flash0.790substantial
gpt-4o-mini ↔ gpt-4o0.758substantial
gpt-4o ↔ gemini-2.5-flash0.758substantial
gpt-4o-mini ↔ gemini-2.5-flash0.750substantial
claude-sonnet-4-6 ↔ gpt-4o-mini0.720substantial
claude-sonnet-4-6 ↔ gemini-2.5-flash0.720substantial
claude-haiku-4-5 ↔ gpt-4o0.717substantial
claude-haiku-4-5 ↔ gpt-4o-mini0.706substantial
claude-haiku-4-5 ↔ claude-sonnet-4-60.600moderate

κ interpretation per Landis & Koch (1977). Pairs with κ < 0.61 indicate genuine judge-disagreement on hard cases — the examples that benefit most from cross-judge calibration.

Cost & latency

50 cases · 4 LLM-judge evaluators · workers=1
Cost per case$0.0012717.1 judge calls per case
Total cost for the run$0.063546,294 input + 6,605 output tokens
Wall clock15.0 min18.0s per case avg
Extrapolation to 5,000 cases$6.35Linear extrapolation; workers=1 (single LLM call at a time)
Deterministic tier (no LLM)$0.003 evaluators, instant

Workers=1 is enforced for the cost benchmark so per-call usage records reach the active CostTracker. The 5,000-case extrapolation is linear and ignores judge-cache hits (see the next section for what caching does to re-runs). Raw JSON: cost_latency.json.

Reproducibility & cache

halueval_qa (first 10) · 10 cases × 10 reps · claude-haiku-4-5
ModeWall clockpass_rate σavg_score σN reps
Cache ON (rep 1 cold, rest warm)8.5s0.00000.000010
Cache OFF (every call hits the API)74.3s0.04220.008410

Cache hit speedup: 8.7× (cold → warm rerun). Cache ON reproduces identical scores rep-to-rep. Cache OFF exposes the irreducible judge variance at temperature=0. Raw JSON: reproducibility.json.

Reproduce locally

git clone https://github.com/multivon-ai/multivon-eval
cd multivon-eval/benchmarks
pip install -e .. anthropic openai google-genai
export ANTHROPIC_API_KEY=...
export OPENAI_API_KEY=...
export GOOGLE_API_KEY=...

# pick one (each writes to benchmarks/results/*.json):
python run_hallucination_benchmark.py
python run_multi_judge_agreement_benchmark.py
python run_cost_latency_benchmark.py
python run_reproducibility_benchmark.py

All four scripts are deterministic at temperature=0 modulo API non-determinism. Wall-clock estimates: ~50–120s each for the in-house judges, longer when external judges are added.

Submit a new judge

Open-source judges (Patronus Lynx, Prometheus-2, Vectara HHEM, others) are actively being added. PRs welcome: add a JudgeConfig entry in run_multi_judge_agreement_benchmark.py or wire an external-judge adapter in run_external_judges_benchmark.py, run the benchmark, commit the updated JSON. This page auto-updates on the next deploy.

Open the external-judge harness →