Verifying Agent Work · ~10 min
Eleven lessons, one reflex: never accept a self-reported "done" — reach for the check that fits the claim. This is the whole course as a lookup table, and a mixed review to prove it stuck.
The through-line of the whole course: treat every claim as unverified until evidence external to the agent confirms it. Pick the check by the shape of the claim — deterministic where you can, scoped judgment where you must, population statistics where the failure is statistical.
| Symptom | Move | From |
|---|---|---|
| The agent's polished summary says "done" | Verify against external ground truth — run it, fetch it, diff it; never re-read the prose | L1 |
| A property must always hold and can be checked | Promote it from a prompt to a deterministic guardrail (hook, CI gate, schema) | L2 |
| "Build passed, tests green" — but did they? | A verification ledger: every check an INSERT (tool, command, exit code); evidence is a SELECT | L3 |
| A wrong assumption cascaded through 500 lines | Verify per unit — checkpoint between steps; save known-good and restore on fail | L4 |
| The agent declared done on partial, unverified work | A pre-completion checklist as a hook — block the completion signal until specific items pass | L5 |
| You need correctness as a deterministic exit condition | Red-green-refactor in separate invocations; "do not change the tests" | L6 |
| A claim no test or type checker reaches (API recall, citation) | Factored chain-of-verification — answer each question without the draft in context | L7 |
| Exact-match tests fail correct-but-different agent output | Behavioral testing — grade decisions and end-state; threshold as a product call | L8 |
| Unit tests pass but the running system is left broken | Golden Journeys with per-step failure signals; gate on a clean restart | L9 |
| Path-based grader marks valid alternative solutions as fails | Grade the outcome (state), not the path; keep the LLM judge off mechanical checks | L10 |
| Per-trace evals pass, yet a pattern across runs is broken | Macro evals over the corpus — if volume, judge precision, and structure clear the floor | L11 |
| A skill you keep editing has no signal that it still works | Skill evals — two axes (output, trigger), paired isolated runs, ship on the delta | L11 |
The parts build the same reflex at different layers. The Trust Problem (L1–L3): polish isn't proof, so enforce properties with guardrails and record evidence as data, not prose. Verifying As You Build (L4–L7): catch errors at the seam, gate the "done," let tests be the spec, and self-verify only the claims no oracle reaches. Evaluating Behavior (L8–L11): test decisions not paths, gate on a clean restart, grade state with an honest judge, and add the population and unit layers when the workload supplies them. Every move is the same move — replace self-report with an external check — applied where it bites.
Every lesson had a backfire box, and they rhyme. Verification is investment that pays off on recurring, high-stakes work — not on a throwaway script. Watch for verification theater (tests that don't cover the change), alert fatigue (noisy checks training bypass), deadlocks (unsatisfiable checklist items), and floors (macro evals under ~1,000 traces, a judge under ~70% precision). Calibrate to stakes; automate the cheap checks; reserve the expensive ones for what's irreversible or security-critical.
Mixed review — across all eleven lessons
Question 1 · from L1The reframe the whole course rests on is that a self-reported "done" is…
Question 2 · from L3In a verification ledger, the rule "no INSERT, no verification" means evidence must be…
Question 3 · from L6"Do not change the tests" exists in the green phase to stop the agent from…
Question 4 · from L7Naive intrinsic self-correction on code was measured to flip correct solutions to wrong about…
Question 5 · from L11Macro evals are theatre below their floor — roughly when trace volume is under…