Verifying Agent Work · ~8 min
Check that the agent called tool X before tool Y, and you penalize every valid solution its author didn't anticipate. Ask instead: is the system in the correct state?
Path-based evals check that the agent edited file A before file B. That penalizes valid alternative solutions — and the number of valid paths grows combinatorially with task complexity, so path graders get more misleading as agents improve. Outcome grading asks "is the system in the correct state?" instead.
Outcome-based checks: unit tests pass after the change, the DB schema matches the expected state, the API response contains the expected fields, the file was created with the correct content. When correctness is subjective — readability, style — combine the outcome check with an LLM rubric grader that specifies the criteria, rather than prescribing the steps.
The common grading bugs are all over-specification causing false negatives: exact-string matching numeric output that can be formatted many ways; checking for a specific function name when any equivalent name is correct; asserting an import order or comment placement; requiring one file to change when the agent correctly refactored across several. Each embeds the author's implementation assumptions into the definition of "correct" and narrows the agent's solution space.
When you reach for an LLM judge on the subjective dimension, it carries documented positional, self-preference, and stylistic biases. A pre-registered study found a code-generation "skill"'s apparent quality gain — read off an LLM-as-judge — vanished once outputs were graded by passing tests instead of by a model. So: keep model graders off mechanical checks (use code-based assertions for valid JSON, row counts, file existence), reserve human spot-checks for genuinely subjective quality, and calibrate any model grader against human labels. When comparing two versions, present both outputs blind — sequential grading anchors the second to the first.
Side-effecting tasks — an agent that fires irreversible API calls or emails en route to a correct final state passes the grader while causing damage; add intermediate-step constraints. Compliance paths — finance, healthcare, security require specific procedural steps regardless of outcome; skipping an audit gate is non-compliant even if the end state is right. Trace-as-deliverable — when the reasoning chain is the output, path quality is the correctness criterion. The 2025–26 norm: outcome metrics decide correctness, trajectory data stays as diagnostic signal, not a gate.
Retrieval practice — recall, don't peek
Question 1Path-based grading gets more misleading as agents improve because…
Question 2The most reliable outcome grader for a coding agent is…
Question 3A pre-registered study found an LLM-judged "quality gain" vanished when outputs were graded by…
Question 4Outcome-only grading is unsafe for a task that…
Question 5 · spaced recall from Lesson 9A Golden Journey gates completion on the system…