The Verification Decision Table

Eleven lessons, one reflex: never accept a self-reported "done" — reach for the check that fits the claim. This is the whole course as a lookup table, and a mixed review to prove it stuck.

Why this, for you: verification is a habit, not a checklist you run once. When an agent says it's finished, the instinct is to believe the polished summary. This table retrains that instinct: every claim has a matching check, and the right check is almost never "ask the agent if it's sure."

The through-line of the whole course: treat every claim as unverified until evidence external to the agent confirms it. Pick the check by the shape of the claim — deterministic where you can, scoped judgment where you must, population statistics where the failure is statistical.

1 Symptom → move

Symptom	Move	From
The agent's polished summary says "done"	Verify against external ground truth — run it, fetch it, diff it; never re-read the prose	L1
A property must always hold and can be checked	Promote it from a prompt to a deterministic guardrail (hook, CI gate, schema)	L2
"Build passed, tests green" — but did they?	A verification ledger: every check an INSERT (tool, command, exit code); evidence is a SELECT	L3
A wrong assumption cascaded through 500 lines	Verify per unit — checkpoint between steps; save known-good and restore on fail	L4
The agent declared done on partial, unverified work	A pre-completion checklist as a hook — block the completion signal until specific items pass	L5
You need correctness as a deterministic exit condition	Red-green-refactor in separate invocations; "do not change the tests"	L6
A claim no test or type checker reaches (API recall, citation)	Factored chain-of-verification — answer each question without the draft in context	L7
Exact-match tests fail correct-but-different agent output	Behavioral testing — grade decisions and end-state; threshold as a product call	L8
Unit tests pass but the running system is left broken	Golden Journeys with per-step failure signals; gate on a clean restart	L9
Path-based grader marks valid alternative solutions as fails	Grade the outcome (state), not the path; keep the LLM judge off mechanical checks	L10
Per-trace evals pass, yet a pattern across runs is broken	Macro evals over the corpus — if volume, judge precision, and structure clear the floor	L11
A skill you keep editing has no signal that it still works	Skill evals — two axes (output, trigger), paired isolated runs, ship on the delta	L11

2 The one rule under all of it

Evidence beats narration, and the strongest evidence is the cheapest reliable check the claim admits. Deterministic where possible (tests, types, exit codes); scoped human or calibrated-judge where the property is subjective; population statistics where the failure only exists across runs. The agent never grades its own work.

The parts build the same reflex at different layers. The Trust Problem (L1–L3): polish isn't proof, so enforce properties with guardrails and record evidence as data, not prose. Verifying As You Build (L4–L7): catch errors at the seam, gate the "done," let tests be the spec, and self-verify only the claims no oracle reaches. Evaluating Behavior (L8–L11): test decisions not paths, gate on a clean restart, grade state with an honest judge, and add the population and unit layers when the workload supplies them. Every move is the same move — replace self-report with an external check — applied where it bites.

Don't over-verify

Every lesson had a backfire box, and they rhyme. Verification is investment that pays off on recurring, high-stakes work — not on a throwaway script. Watch for verification theater (tests that don't cover the change), alert fatigue (noisy checks training bypass), deadlocks (unsatisfiable checklist items), and floors (macro evals under ~1,000 traces, a judge under ~70% precision). Calibrate to stakes; automate the cheap checks; reserve the expensive ones for what's irreversible or security-critical.

↪ Your win: a verification reflex

Read the symptom, reach for the check — the table above is the whole course in one glance.
Evidence over narration — run it, diff it, query it; the agent never grades itself.
Cheapest reliable check first — deterministic where you can, judgment only where you must.
Match the layer to the failure — per-unit, per-trace, per-corpus, per-skill.
Calibrate to stakes — full verification for the irreversible; spot-checks for the cheap.

Mixed review — across all eleven lessons

Question 1 · from L1The reframe the whole course rests on is that a self-reported "done" is…

Question 2 · from L3In a verification ledger, the rule "no INSERT, no verification" means evidence must be…

Question 3 · from L6"Do not change the tests" exists in the green phase to stop the agent from…

Question 4 · from L7Naive intrinsic self-correction on code was measured to flip correct solutions to wrong about…

Question 5 · from L11Macro evals are theatre below their floor — roughly when trace volume is under…

You finished the course. Ask me to apply the decision table to a real repo — wire a pre-completion hook, a verification ledger, and a starter eval set — or to audit where your current workflow still trusts a self-reported "done." Or revisit any lesson; the checks compound when you stack them.

The Verification Decision Table

1 Symptom → move

2 The one rule under all of it

Don't over-verify

↪ Your win: a verification reflex

Go deeper