Why Your LLM Eval Results Are Probably Wrong

If you run your LLM test suite once and ship based on the result, your evaluation is lying to you.

This is not a theoretical concern. A paper published at NAACL 2025 — "Evaluation of LLMs Should Not Ignore Non-Determinism" — formally demonstrated what practitioners have quietly known for a while: single-run evaluation scores are so noisy they routinely misrank models and miss regressions. The authors ran the same evaluation 30 times per benchmark and found that model rankings swap depending on the run. A model that "wins" Monday might "lose" Thursday. Same models, same prompts, same benchmark. Different day.

At temperature=0, this is especially counterintuitive. Atil et al. (Eval4NLP 2025) found that even with temperature set to zero, different GPU configurations, batching strategies, and numerical precision choices produce different outputs for the same input. The seed you pass to your eval runner doesn't control what happens inside the model.

This post explains what's causing it, how bad the problem actually is, and how to detect it in your own eval pipeline.

The Problem in One Chart

Illustrative example — numbers chosen to demonstrate the pattern; the NAACL paper documents this phenomenon with real benchmark data across multiple model pairs.

Imagine you're comparing two versions of a RAG pipeline. Version A passes 82% of your faithfulness test cases. Version B passes 79%. You ship Version A.

But if you'd run the eval 10 times each:

Version A: mean 79%, 95% CI [74%, 84%]
Version B: mean 81%, 95% CI [76%, 86%]

The confidence intervals overlap completely. The 3-point difference you acted on was measurement noise, not a real improvement.

This is what NAACL 2025 found systematically across real benchmarks. Single-run evals don't have the statistical power to detect small-to-medium differences — which is exactly the range most real-world improvements fall into.

Three Sources of Non-Determinism

1. Temperature > 0 (obvious)

Temperature controls how much randomness is injected into token sampling. Even temperature=0.1 introduces substantial variance across many tokens. Most production evals run between 0.1 and 0.7 — all of them are affected.

2. Temperature = 0 (not obvious)

At temperature=0, greedy decoding should be deterministic. In practice it isn't. The NAACL paper confirmed this, and an Eval4NLP 2025 analysis traced the causes:

GPU non-determinism: floating point addition is not commutative when parallelized across cores
Batching effects: results change when the same prompt is in a batch of 1 vs a batch of 8
API-level non-determinism: cloud providers route requests to different hardware configurations

OpenAI's own documentation acknowledges that identical parameters do not guarantee identical outputs across different calls.

3. LLM Judges (compounding)

If you use an LLM-as-judge evaluator — which most modern evals do — you have introduced a second source of non-determinism on top of your model's outputs. The judge itself varies. In practice, LLM judges using open-ended numeric scales (1–10) produce inconsistent scores across runs; our hallucination benchmark on HaluEval found that a simple 1–10 judge produced 31 false positives on 100 cases — flagging faithfully-answered questions as hallucinations. Most eval frameworks don't detect or retry these failures.

How to Detect It

The simplest test: run your eval suite 5 times. Compute the standard deviation of your pass rate across runs. If it's above 3 percentage points, your eval is too noisy to detect meaningful regressions.

from multivon_eval import EvalSuite, Faithfulness
import statistics

suite = EvalSuite("Faithfulness Check")
suite.add_cases(cases)
suite.add_evaluators(Faithfulness())

pass_rates = []
for _ in range(5):
    report = suite.run(my_pipeline)
    pass_rates.append(report.pass_rate)

print(f"Mean: {statistics.mean(pass_rates):.1%}")
print(f"Std dev: {statistics.stdev(pass_rates):.1%}")
print(f"Range: {min(pass_rates):.1%} – {max(pass_rates):.1%}")

If your range spans more than 6–8 percentage points across 5 runs, single-run comparisons between experiments are unreliable.

The Fix: Multi-Run Evaluation with Confidence Intervals

The NAACL paper's recommendation is to run evaluations multiple times and use statistical estimates rather than point estimates. Specifically, they recommend reporting confidence intervals for pass rates rather than raw percentages.

The right tool for this is the Wilson score interval — not the normal approximation you might remember from statistics class. The normal approximation breaks down when your pass rate is near 0% or 100%, or when your sample is small. The Wilson interval handles these cases correctly.

Here's the formula, and why it matters:

from multivon_eval import wilson_interval, runs_needed

# You ran 20 cases, 16 passed (80%)
lo, hi = wilson_interval(pass_count=16, n=20)
print(f"True pass rate is likely between {lo:.0%} and {hi:.0%}")
# → True pass rate is likely between 59% and 92%

# How many cases do you actually need to detect a 10-point regression?
n = runs_needed(delta=0.10, baseline=0.80)
print(f"You need {n} cases to detect a 10% drop with 80% power")
# → You need 194 cases

That 80% pass rate on 20 cases? The confidence interval spans 33 percentage points. You can't see anything useful at that sample size.

Applying It to Experiment Comparison

When you compare two experiments, use the Experiment class to record and compare runs with statistical rigor:

from multivon_eval import EvalSuite, Faithfulness, Experiment

suite = EvalSuite("RAG Pipeline")
suite.add_cases(cases)
suite.add_evaluators(Faithfulness())

exp = Experiment("rag-pipeline")

# Run experiment A — 10 runs per case for statistical power
report_a = suite.run(pipeline_v1, runs=10)
run_id_a = exp.record(report_a, tags={"version": "v1"})

# Run experiment B
report_b = suite.run(pipeline_v2, runs=10)
run_id_b = exp.record(report_b, tags={"version": "v2"})

# Compare — uses two-proportion z-test with Wilson CIs
exp.compare(run_id_a, run_id_b)

Output:

Experiment comparison: v1 → v2

Metric            Before              After
─────────────────────────────────────────────────────
Pass rate         78.0% [71–84%]     71.0% [64–78%]  ↓ -7.0%
Avg score         0.7900             0.7200           ↓ -0.0700

Statistical significance: p=0.08 (not significant at α=0.05)

Verdict: INCONCLUSIVE — difference is not statistically significant
To detect a 7% delta reliably, you need ~320 cases (currently 85)

That "not significant" result is doing important work. You're not shipping a regression you didn't know you had. And you're not blocking a valid improvement on noise.

The Flakiness Problem for Agents

Non-determinism is especially acute for agentic systems, because agents make multiple sequential decisions. A single non-deterministic step can cascade through the whole trajectory.

When a test case "passes sometimes and fails others," you have a flaky case. Not a bug — a measurement problem. The agent isn't wrong, your evaluation methodology is.

from multivon_eval import EvalSuite, ToolCallAccuracy, TaskCompletion

suite = EvalSuite("Coding Agent")
suite.add_cases(cases)
suite.add_evaluators(ToolCallAccuracy(), TaskCompletion())

# Run each case 5 times — flaky cases are flagged automatically
report = suite.run(my_agent, runs=5)

Output:

⚠ 3 flaky case(s) — passed inconsistently across 5 runs:
  'Send a Slack message'   (3/5 runs passed)
  'Query the database…'    (4/5 runs passed)
  'Schedule a meeting…'    (2/5 runs passed)

Stability: 40%  Flaky: 3

"Schedule a meeting" passes 40% of the time. That's not a pass and it's not a fail — it's a signal that the agent's behavior for this case is non-deterministic, and you need to understand why before you can assess it.

What This Means for CI/CD

If you block CI on LLM eval pass rate thresholds and you're only running once, you will:

Block valid improvements (noisy low score on a genuinely better pipeline)
Ship regressions (noisy high score on a genuinely worse pipeline)

Both of these happen. The NAACL paper quantified it: single-run ranking disagreements are common enough to affect real deployment decisions.

A more robust CI pattern:

# eval/run_ci_eval.py
suite = EvalSuite("Faithfulness Suite")
suite.add_cases(cases)
suite.add_evaluators(Faithfulness())

# 5 runs per case — majority result, flakiness flagged
report = suite.run(my_pipeline, runs=5, fail_threshold=0.75)

lo, hi = wilson_interval(
    pass_count=int(report.pass_rate * len(cases)),
    n=len(cases)
)
print(f"Pass rate: {report.pass_rate:.0%} [{lo:.0%}–{hi:.0%}]")

The Numbers You Should Know

What you want to detect	Cases needed	Runs per case
±15% regression	~75	3
±10% regression	~170	5
±5% regression	~675	10

Assumes 80% power, α=0.05, baseline pass rate 75%.

Most LLM eval suites have 20–50 cases. They can reliably detect only regressions larger than 15–20 percentage points. Prompt changes, retrieval tweaks, and model updates that cause 5–10 point shifts are invisible at this scale.

What NAACL Got Right (and What's Still Open)

The NAACL 2025 paper made three concrete recommendations:

Report confidence intervals, not point estimates
Use multi-run aggregation before computing pass rates
Report statistical significance when comparing experiments

These are correct, and they're now standard in multivon-eval. What the paper doesn't address — and what remains an open problem — is LLM judge variance specifically. When your judge model is itself non-deterministic, you're measuring noise with a noisy ruler. Calibration research (aligning judge scores to human scores on your specific domain) and structured judge outputs (QAG binary questions rather than numeric 1–10 scales) both help. More on that in the next post.