Part 3 · Evaluating Behavior

Verifying Agent Work · ~8 min

Grade the Outcome

Check that the agent called tool X before tool Y, and you penalize every valid solution its author didn't anticipate. Ask instead: is the system in the correct state?

Why this, for you: path-based grading makes capable agents look worse than they are — frontier models keep finding routes the eval author didn't expect. This lesson is how to grade what matters (the final state) and how to keep the LLM judge you reach for on subjective parts from quietly lying to you.

Path-based evals check that the agent edited file A before file B. That penalizes valid alternative solutions — and the number of valid paths grows combinatorially with task complexity, so path graders get more misleading as agents improve. Outcome grading asks "is the system in the correct state?" instead.

1 Anchor correctness to state, not procedure

A passing test suite is a correct outcome whether the agent took two steps or twenty. For coding agents, deterministic tests are the most reliable outcome graders — objective, fast, and path-agnostic.

Outcome-based checks: unit tests pass after the change, the DB schema matches the expected state, the API response contains the expected fields, the file was created with the correct content. When correctness is subjective — readability, style — combine the outcome check with an LLM rubric grader that specifies the criteria, rather than prescribing the steps.

# path-based — rejects valid solutions tool_calls == ["read_file", "write_file", "bash"] # fails if it reads two files first # outcome-based — accepts any correct implementation pytest tests/ -q → returncode == 0 # green is green, however it got there

2 Over-specification is the bug

The common grading bugs are all over-specification causing false negatives: exact-string matching numeric output that can be formatted many ways; checking for a specific function name when any equivalent name is correct; asserting an import order or comment placement; requiring one file to change when the agent correctly refactored across several. Each embeds the author's implementation assumptions into the definition of "correct" and narrows the agent's solution space.

3 Keep the judge honest

When you reach for an LLM judge on the subjective dimension, it carries documented positional, self-preference, and stylistic biases. A pre-registered study found a code-generation "skill"'s apparent quality gain — read off an LLM-as-judge — vanished once outputs were graded by passing tests instead of by a model. So: keep model graders off mechanical checks (use code-based assertions for valid JSON, row counts, file existence), reserve human spot-checks for genuinely subjective quality, and calibrate any model grader against human labels. When comparing two versions, present both outputs blind — sequential grading anchors the second to the first.

When outcome-only grading breaks

Side-effecting tasks — an agent that fires irreversible API calls or emails en route to a correct final state passes the grader while causing damage; add intermediate-step constraints. Compliance paths — finance, healthcare, security require specific procedural steps regardless of outcome; skipping an audit gate is non-compliant even if the end state is right. Trace-as-deliverable — when the reasoning chain is the output, path quality is the correctness criterion. The 2025–26 norm: outcome metrics decide correctness, trajectory data stays as diagnostic signal, not a gate.

↪ Your win: grade state, judge honestly

Retrieval practice — recall, don't peek

Question 1Path-based grading gets more misleading as agents improve because…

Question 2The most reliable outcome grader for a coding agent is…

Question 3A pre-registered study found an LLM-judged "quality gain" vanished when outputs were graded by…

Question 4Outcome-only grading is unsafe for a task that…

Question 5 · spaced recall from Lesson 9A Golden Journey gates completion on the system…

Ask me anything. Want me to spot over-specification in an existing grader, or wire a hybrid scorer that uses tests for correctness and a calibrated, blind LLM rubric for the subjective dimension? Next, Part 3 closes with Evals at Scale — macro patterns across a corpus, and evaluating skills as units.
✎ Feedback