How to Evaluate AI Agents Without Getting Fooled

An agent that passes your test suite on Monday might fail your users on Tuesday. Not because something changed — because you ran the test once.

Agent evaluation has a non-determinism problem that's worse than text evaluation, for a simple reason: agents make multiple decisions in sequence. Each decision is non-deterministic. The errors compound. A single-run pass rate for an agentic task is nearly meaningless.

This post is about what you should measure instead, and the three evaluation gaps that show up most often in production agent pipelines.

Why Agents Are Harder to Evaluate

For a text generation task, non-determinism affects one output. For an agent, it affects every step of the trajectory:

The agent decides whether to call a tool
The agent decides which tool
The agent decides what arguments to pass
The tool returns a result (which may itself be non-deterministic)
The agent decides what to do next based on that result
...repeat

Each step has variance. Steps depend on prior steps. A wrong tool call in step 2 can cause a cascade of reasonable-looking but incorrect subsequent steps.

The NAACL 2025 paper on non-determinism found that multi-step tasks have significantly higher variance than single-step tasks. An agent that "passes" a task 60% of the time isn't broken — it's non-deterministic, and you need to understand which steps are causing the variance before you can fix it.

The Three Evaluation Gaps

Gap 1: Did the agent do the right things?

ToolCallAccuracy checks that the agent called the expected tools. It doesn't check:

Whether the agent called unnecessary tools
Whether the order mattered and the order was correct
Whether the agent called the same tool twice when once would have sufficed

Example: an agent tasked with "summarize the last 7 days of logs" calls search_logs twice with slightly different queries, then calls summarize. It gets the right answer. ToolCallAccuracy scores 1.0. But it made a redundant call that added latency and cost.

The fix: ToolCallNecessity

from multivon_eval import ToolCallNecessity

suite.add_evaluators(ToolCallNecessity())

For each tool call, the judge sees the full prior context and asks: was this strictly necessary? The score is the fraction of tool calls that were judged necessary.

A ToolCallNecessity score below 0.8 means the agent is making redundant calls. Common causes: the agent doesn't track what it already knows (no working memory), or the prompt doesn't encourage efficiency.

Gap 2: Was the trajectory efficient?

Even if every tool call was necessary, the order might be wrong. The agent might have:

Fetched data it didn't need yet
Made a tool call that failed, then tried the same approach again instead of recovering
Taken 8 steps for a task that should take 3

TrajectoryEfficiency evaluates the quality of the agent's path:

from multivon_eval import TrajectoryEfficiency

suite.add_evaluators(TrajectoryEfficiency())

It assesses:

Did the agent avoid unnecessary detours?
Is the step count proportionate to task complexity?
If a tool returned an error, did the agent recover correctly?

The recovery check matters. An agent that tries the same failed tool call twice without modification is a reliability problem in production. TrajectoryEfficiency applies a 0.2 penalty for poor error recovery — a signal that your agent lacks retry logic or error handling.

Gap 3: Does the agent remember what happened before?

Single-session evaluation doesn't test multi-session memory. If your agent is supposed to learn from prior sessions — "last time we talked, you asked me to prioritize the auth module" — a standard eval suite will never catch memory failures.

AMA-Bench (2026) is the first formal benchmark for multi-session agent memory. Its key finding: agents consistently fail three specific memory tasks:

Retrieval accuracy: Does the agent correctly recall what was said in a prior session?
Hallucination avoidance: Does the agent avoid inventing facts not in the prior session?
Staleness handling: Does the agent ignore superseded information? ("I said to prioritize auth last week, but this week I changed that")

AgentMemoryEval tests all three:

from multivon_eval import AgentMemoryEval, EvalCase

case = EvalCase(
    input="What did I ask you to prioritize last session?",
    context="Prior session (2025-11-10): User asked to prioritize the auth module. Deadline is end of November.",
    expected_output="auth module",
)

suite.add_evaluators(AgentMemoryEval())

If expected_output is provided, the judge checks that the response includes it. It also checks for hallucinated facts and inappropriate use of stale context.

Full Agent Eval Example

from multivon_eval import (
    EvalSuite, EvalCase, AgentStep, ToolCall,
    ToolCallAccuracy, ToolArgumentAccuracy,
    ToolCallNecessity, TrajectoryEfficiency,
    PlanQuality, TaskCompletion,
)

suite = EvalSuite("Coding Agent")
suite.add_cases(cases)
suite.add_evaluators(
    ToolCallAccuracy(require_order=False),
    ToolArgumentAccuracy(),
    ToolCallNecessity(),
    TrajectoryEfficiency(),
    PlanQuality(),
    TaskCompletion(threshold=0.85),
)

# Run 5 times — agent tasks are the most non-deterministic in your pipeline
report = suite.run(my_agent, runs=5)

Running each case 5 times is important. A single run will misclassify flaky behavior as pass or fail depending on luck.

Reading the Flakiness Report

⚠ 3 flaky case(s) — passed inconsistently across 5 runs:
  'Send a Slack message'   (3/5 runs passed)
  'Query the database…'    (4/5 runs passed)
  'Schedule a meeting…'    (2/5 runs passed)

Three cases to investigate, all with different severity:

"Schedule a meeting" at 2/5 is a reliability problem — this task is essentially broken
"Query the database" at 4/5 is marginal — probably a deterministic bug triggered by specific inputs
"Send a Slack message" at 3/5 is coin-flip territory — likely a tool-level non-determinism (API timeout, rate limit, etc.)

Each flaky case tells you something different about where to focus debugging effort. This is information you completely miss with a single-run eval.

Agent Evals in CI

For agents, a robust CI pattern separates two thresholds: overall pass rate and flakiness:

# eval/run_agent_eval.py
suite = EvalSuite("Agent Suite")
suite.add_cases(cases)
suite.add_evaluators(ToolCallAccuracy(), TaskCompletion())

report = suite.run(my_agent, runs=5, fail_threshold=0.85)

# Flag if more than 20% of cases are flaky
flaky_rate = report.flaky_count / len(report.case_results)
if flaky_rate > 0.20:
    print(f"WARNING: {flaky_rate:.0%} of cases are flaky — agent is non-deterministic")
    raise SystemExit(1)

An agent that passes 95% of cases but where 30% of cases are flaky is not a production-ready agent — it just got lucky on the runs you saw.

Connecting to the Research

Tool call evaluation: The distinction between ToolCallAccuracy (did you call the right tools?) and ToolCallNecessity (did you call only the right tools?) comes from agent efficiency research in tool-augmented LLM literature.

Trajectory evaluation: WebArena and AgentBench established trajectory-level evaluation for web agents. The efficiency metric in TrajectoryEfficiency is adapted from normalized task completion path length — how many steps did the agent take vs the minimum required.

Memory evaluation: AMA-Bench (2026) is the primary reference. The paper introduces a formal taxonomy of memory failure modes and the first reproducible benchmark for measuring them. If you're building a stateful agent, this paper is worth reading in full.