Testing What It Decides

The same task admits many valid paths. Assert on the exact output and you get false negatives on correct work and false positives on lucky runs. Test the decision and the end-state instead.

Why this, for you: agents are non-deterministic — identical inputs produce different tool calls, phrasings, and routes. This lesson is how to build an eval that survives that variance, so a green suite means the agent is actually good, not that it got lucky this run.

Traditional tests assert exact outputs. Agents produce different valid outputs for the same input. Behavioral testing replaces "did it produce output X?" with "did it make good decisions and reach a valid end-state?" — because one task has many valid paths, and end-state evaluation removes the path constraint.

1 Split deterministic from agentic

Not every part needs behavioral testing. A capability matrix isolates what to test and how:

Component	Method	Example
Deterministic	Traditional unit/integration tests	Input parsing, output formatting, API call construction
Agentic	Behavioral evaluation	Decision-making, tool selection, multi-step reasoning

Mock tools to test reasoning without external dependencies — and remember tool output quality (concise, filtered, well-formatted) is itself worth evaluating, because it shapes the context the agent reasons over downstream.

2 Three grading methods — use the lightest

Method	Best for	Trade-off
Code-based	Exact match, regex, test pass/fail	Fastest, most reliable; only verifiable outputs
LLM-as-judge	Open-ended outputs, style, completeness	Scalable; needs calibration
Human grading	Ambiguous edges, novel failures	Most flexible; slowest — avoid when possible

For free-form output, a calibrated LLM judge with an explicit rubric (factual accuracy, completeness, tool efficiency) approximates human judgment — but track its precision and recall against human assessments, and avoid class-imbalanced sets that distort headline accuracy.

3 The three-part foundation, and your variance budget

Start with ~20 representative queries. Small samples catch dramatic effect sizes — a prompt change moving pass rate 30% → 80% — without a large dataset upfront.

Every eval system is a feedback loop: a representative dataset, a scorer library (reusable grading functions), and a feedback loop where every model/prompt/tool change runs the same dataset and scorers. And the pass-rate threshold is a product decision, not an engineering target:

File-editing agent — 95% acceptable (formatting differences tolerable).
Security-scanning agent — 99.5% minimum (missed vulnerabilities are not tolerable).
Research summarization — 85% acceptable (phrasing variance expected).

When behavioral testing is the wrong tool

It pays off only when outputs are genuinely non-deterministic. Structured JSON with a fixed schema needs equality checks — LLM grading adds cost without signal. High-volume regression suites run LLM-as-judge at thousands of cases slowly and expensively; reserve it for the agentic layer and code-check structured output at scale. And an uncalibrated judge — or an uncalibrated threshold — introduces systematic bias that invalidates the whole pipeline.

↪ Your win: evals that survive non-determinism

Test decisions and end-state, not exact paths — many valid routes reach one valid state.
Capability matrix — unit-test the deterministic parts, behaviorally test the agentic parts.
Lightest grader that covers it — code-based first, LLM-judge for open-ended, human last.
Start with ~20 queries — small sets catch big effect sizes before you scale the dataset.
Threshold is a product call — 99.5% for security, 85% for summaries; calibrate on real failures.

Retrieval practice — recall, don't peek

Question 1Asserting exact agent output gives false negatives because the agent…

Question 2The lightest grading method you should reach for first is…

Question 3A starter eval set of ~20 queries is enough to catch…

Question 4A pass-rate threshold of 99.5% for a security agent reflects that the threshold is…

Question 5 · spaced recall from Lesson 7Factored chain-of-verification works by answering each question…

Ask me anything. Want a 20-query starter eval set plus a scorer library that mixes code checks and an LLM-judge rubric, or help picking a variance threshold for a specific agent? Next: Golden Journeys — naming the end-to-end paths and gating on a clean restart.