Part 3 · Evaluating Behavior

Verifying Agent Work · ~8 min

Testing What It Decides

The same task admits many valid paths. Assert on the exact output and you get false negatives on correct work and false positives on lucky runs. Test the decision and the end-state instead.

Why this, for you: agents are non-deterministic — identical inputs produce different tool calls, phrasings, and routes. This lesson is how to build an eval that survives that variance, so a green suite means the agent is actually good, not that it got lucky this run.

Traditional tests assert exact outputs. Agents produce different valid outputs for the same input. Behavioral testing replaces "did it produce output X?" with "did it make good decisions and reach a valid end-state?" — because one task has many valid paths, and end-state evaluation removes the path constraint.

1 Split deterministic from agentic

Not every part needs behavioral testing. A capability matrix isolates what to test and how:

ComponentMethodExample
DeterministicTraditional unit/integration testsInput parsing, output formatting, API call construction
AgenticBehavioral evaluationDecision-making, tool selection, multi-step reasoning

Mock tools to test reasoning without external dependencies — and remember tool output quality (concise, filtered, well-formatted) is itself worth evaluating, because it shapes the context the agent reasons over downstream.

2 Three grading methods — use the lightest

MethodBest forTrade-off
Code-basedExact match, regex, test pass/failFastest, most reliable; only verifiable outputs
LLM-as-judgeOpen-ended outputs, style, completenessScalable; needs calibration
Human gradingAmbiguous edges, novel failuresMost flexible; slowest — avoid when possible

For free-form output, a calibrated LLM judge with an explicit rubric (factual accuracy, completeness, tool efficiency) approximates human judgment — but track its precision and recall against human assessments, and avoid class-imbalanced sets that distort headline accuracy.

3 The three-part foundation, and your variance budget

Start with ~20 representative queries. Small samples catch dramatic effect sizes — a prompt change moving pass rate 30% → 80% — without a large dataset upfront.

Every eval system is a feedback loop: a representative dataset, a scorer library (reusable grading functions), and a feedback loop where every model/prompt/tool change runs the same dataset and scorers. And the pass-rate threshold is a product decision, not an engineering target:

When behavioral testing is the wrong tool

It pays off only when outputs are genuinely non-deterministic. Structured JSON with a fixed schema needs equality checks — LLM grading adds cost without signal. High-volume regression suites run LLM-as-judge at thousands of cases slowly and expensively; reserve it for the agentic layer and code-check structured output at scale. And an uncalibrated judge — or an uncalibrated threshold — introduces systematic bias that invalidates the whole pipeline.

↪ Your win: evals that survive non-determinism

Retrieval practice — recall, don't peek

Question 1Asserting exact agent output gives false negatives because the agent…

Question 2The lightest grading method you should reach for first is…

Question 3A starter eval set of ~20 queries is enough to catch…

Question 4A pass-rate threshold of 99.5% for a security agent reflects that the threshold is…

Question 5 · spaced recall from Lesson 7Factored chain-of-verification works by answering each question…

Ask me anything. Want a 20-query starter eval set plus a scorer library that mixes code checks and an LLM-judge rubric, or help picking a variance threshold for a specific agent? Next: Golden Journeys — naming the end-to-end paths and gating on a clean restart.
✎ Feedback