Reproducible benchmarks for LLM eval
Every number on this page is reproducible from the open-source repo. JSON results live in benchmarks/results/; this page reads them directly. Add a benchmark or judge via PR — see thesubmission noteat the bottom.
Hallucination detection · head-to-head
halueval_qa · N=100 cases · claude-haiku-4-5 judge| Evaluator | Precision | Recall | False positives | Latency | F1 |
|---|---|---|---|---|---|
| multivon_eval (QAG) | 0.788 | 0.820 | 11 | 2955ms | 0.804 |
| simple_judge (1-10) | 0.617 | 1.000 | 31 | 708ms | 0.763 |
| deepeval (GPT-4o-mini) | 0.456 | 0.820 | 49 | 1421ms | 0.586 |
| keyword_overlap | 0.605 | 0.460 | 15 | 0ms | 0.523 |
All four evaluators run against the same 50 HaluEval QA samples (×2 variants = 100 cases, balanced positive/negative). multivon-eval uses QAG (binary yes/no questions) instead of numeric scales — see the methodology page. Raw JSON: hallucination.json.
Multi-judge agreement · per-judge accuracy
halueval_qa · N=50 pairs · temperature=0| # | Judge model | Accuracy | Precision | Recall | F1 |
|---|---|---|---|---|---|
| 1 | gemini-2.5-flash | 0.860 | 0.950 | 0.760 | 0.844 |
| 2 | gpt-4o-mini | 0.820 | 0.900 | 0.720 | 0.800 |
| 3 | gpt-4o | 0.780 | 0.792 | 0.760 | 0.776 |
| 4 | claude-haiku-4-5 | 0.800 | 0.895 | 0.680 | 0.773 |
| 5 | claude-sonnet-4-6 | 0.720 | 0.720 | 0.720 | 0.720 |
Pairwise judge agreement (Cohen's κ)
| Judge pair | κ | Strength |
|---|---|---|
| claude-sonnet-4-6 ↔ gpt-4o | 0.800 | substantial |
| claude-haiku-4-5 ↔ gemini-2.5-flash | 0.790 | substantial |
| gpt-4o-mini ↔ gpt-4o | 0.758 | substantial |
| gpt-4o ↔ gemini-2.5-flash | 0.758 | substantial |
| gpt-4o-mini ↔ gemini-2.5-flash | 0.750 | substantial |
| claude-sonnet-4-6 ↔ gpt-4o-mini | 0.720 | substantial |
| claude-sonnet-4-6 ↔ gemini-2.5-flash | 0.720 | substantial |
| claude-haiku-4-5 ↔ gpt-4o | 0.717 | substantial |
| claude-haiku-4-5 ↔ gpt-4o-mini | 0.706 | substantial |
| claude-haiku-4-5 ↔ claude-sonnet-4-6 | 0.600 | moderate |
κ interpretation per Landis & Koch (1977). Pairs with κ < 0.61 indicate genuine judge-disagreement on hard cases — the examples that benefit most from cross-judge calibration.
Cost & latency
50 cases · 4 LLM-judge evaluators · workers=1| Cost per case | $0.00127 | 17.1 judge calls per case |
| Total cost for the run | $0.0635 | 46,294 input + 6,605 output tokens |
| Wall clock | 15.0 min | 18.0s per case avg |
| Extrapolation to 5,000 cases | $6.35 | Linear extrapolation; workers=1 (single LLM call at a time) |
| Deterministic tier (no LLM) | $0.00 | 3 evaluators, instant |
Workers=1 is enforced for the cost benchmark so per-call usage records reach the active CostTracker. The 5,000-case extrapolation is linear and ignores judge-cache hits (see the next section for what caching does to re-runs). Raw JSON: cost_latency.json.
Reproducibility & cache
halueval_qa (first 10) · 10 cases × 10 reps · claude-haiku-4-5| Mode | Wall clock | pass_rate σ | avg_score σ | N reps |
|---|---|---|---|---|
| Cache ON (rep 1 cold, rest warm) | 8.5s | 0.0000 | 0.0000 | 10 |
| Cache OFF (every call hits the API) | 74.3s | 0.0422 | 0.0084 | 10 |
Cache hit speedup: 8.7× (cold → warm rerun). Cache ON reproduces identical scores rep-to-rep. Cache OFF exposes the irreducible judge variance at temperature=0. Raw JSON: reproducibility.json.
Reproduce locally
git clone https://github.com/multivon-ai/multivon-eval cd multivon-eval/benchmarks pip install -e .. anthropic openai google-genai export ANTHROPIC_API_KEY=... export OPENAI_API_KEY=... export GOOGLE_API_KEY=... # pick one (each writes to benchmarks/results/*.json): python run_hallucination_benchmark.py python run_multi_judge_agreement_benchmark.py python run_cost_latency_benchmark.py python run_reproducibility_benchmark.py
All four scripts are deterministic at temperature=0 modulo API non-determinism. Wall-clock estimates: ~50–120s each for the in-house judges, longer when external judges are added.
Submit a new judge
Open-source judges (Patronus Lynx, Prometheus-2, Vectara HHEM, others) are actively being added. PRs welcome: add a JudgeConfig entry in run_multi_judge_agreement_benchmark.py or wire an external-judge adapter in run_external_judges_benchmark.py, run the benchmark, commit the updated JSON. This page auto-updates on the next deploy.