Capstone

Verifying Agent Work · ~10 min

The Verification Decision Table

Eleven lessons, one reflex: never accept a self-reported "done" — reach for the check that fits the claim. This is the whole course as a lookup table, and a mixed review to prove it stuck.

Why this, for you: verification is a habit, not a checklist you run once. When an agent says it's finished, the instinct is to believe the polished summary. This table retrains that instinct: every claim has a matching check, and the right check is almost never "ask the agent if it's sure."

The through-line of the whole course: treat every claim as unverified until evidence external to the agent confirms it. Pick the check by the shape of the claim — deterministic where you can, scoped judgment where you must, population statistics where the failure is statistical.

1 Symptom → move

SymptomMoveFrom
The agent's polished summary says "done"Verify against external ground truth — run it, fetch it, diff it; never re-read the proseL1
A property must always hold and can be checkedPromote it from a prompt to a deterministic guardrail (hook, CI gate, schema)L2
"Build passed, tests green" — but did they?A verification ledger: every check an INSERT (tool, command, exit code); evidence is a SELECTL3
A wrong assumption cascaded through 500 linesVerify per unit — checkpoint between steps; save known-good and restore on failL4
The agent declared done on partial, unverified workA pre-completion checklist as a hook — block the completion signal until specific items passL5
You need correctness as a deterministic exit conditionRed-green-refactor in separate invocations; "do not change the tests"L6
A claim no test or type checker reaches (API recall, citation)Factored chain-of-verification — answer each question without the draft in contextL7
Exact-match tests fail correct-but-different agent outputBehavioral testing — grade decisions and end-state; threshold as a product callL8
Unit tests pass but the running system is left brokenGolden Journeys with per-step failure signals; gate on a clean restartL9
Path-based grader marks valid alternative solutions as failsGrade the outcome (state), not the path; keep the LLM judge off mechanical checksL10
Per-trace evals pass, yet a pattern across runs is brokenMacro evals over the corpus — if volume, judge precision, and structure clear the floorL11
A skill you keep editing has no signal that it still worksSkill evals — two axes (output, trigger), paired isolated runs, ship on the deltaL11

2 The one rule under all of it

Evidence beats narration, and the strongest evidence is the cheapest reliable check the claim admits. Deterministic where possible (tests, types, exit codes); scoped human or calibrated-judge where the property is subjective; population statistics where the failure only exists across runs. The agent never grades its own work.

The parts build the same reflex at different layers. The Trust Problem (L1–L3): polish isn't proof, so enforce properties with guardrails and record evidence as data, not prose. Verifying As You Build (L4–L7): catch errors at the seam, gate the "done," let tests be the spec, and self-verify only the claims no oracle reaches. Evaluating Behavior (L8–L11): test decisions not paths, gate on a clean restart, grade state with an honest judge, and add the population and unit layers when the workload supplies them. Every move is the same move — replace self-report with an external check — applied where it bites.

Don't over-verify

Every lesson had a backfire box, and they rhyme. Verification is investment that pays off on recurring, high-stakes work — not on a throwaway script. Watch for verification theater (tests that don't cover the change), alert fatigue (noisy checks training bypass), deadlocks (unsatisfiable checklist items), and floors (macro evals under ~1,000 traces, a judge under ~70% precision). Calibrate to stakes; automate the cheap checks; reserve the expensive ones for what's irreversible or security-critical.

↪ Your win: a verification reflex

Mixed review — across all eleven lessons

Question 1 · from L1The reframe the whole course rests on is that a self-reported "done" is…

Question 2 · from L3In a verification ledger, the rule "no INSERT, no verification" means evidence must be…

Question 3 · from L6"Do not change the tests" exists in the green phase to stop the agent from…

Question 4 · from L7Naive intrinsic self-correction on code was measured to flip correct solutions to wrong about…

Question 5 · from L11Macro evals are theatre below their floor — roughly when trace volume is under…

You finished the course. Ask me to apply the decision table to a real repo — wire a pre-completion hook, a verification ledger, and a starter eval set — or to audit where your current workflow still trusts a self-reported "done." Or revisit any lesson; the checks compound when you stack them.
✎ Feedback