Verifying Agent Work · ~7 min
Agents optimize for completion, not correctness. Left alone, one declares "done" after a partial build and a test run it chose not to investigate. So block the completion signal until a fixed sequence has passed.
Without an explicit gate, an agent declares success after partial implementation, a failing test it didn't investigate, or code that compiles but doesn't satisfy the requirement. A pre-completion checklist intercepts the completion signal and forces a verification sequence first.
Each phase must complete before the next begins. The checklist is not a suggestion — it is a gate. Forcing the agent to re-engage with the original requirement after output is generated creates a second pass that catches drift between intent and implementation (the same backward-checking that Weng et al., 2022, found improves accuracy across arithmetic, commonsense, and logic).
Implement the gate three ways, strongest last: as a mandatory final step in the system prompt; as a
PostToolUse hook that detects completion signals and injects the checklist as a continuation prompt
(exit 2 to block); or as dedicated middleware that wraps the completion path and blocks until the checklist returns
PASS. The combination — hook plus explicit prompt instruction — is what makes it run even on a long session.
Effective items are specific and verifiable. "Run the test suite and confirm all tests pass" — not "check your work." "Confirm each acceptance criterion is met" — not "review the requirement." A vague item nominally passes without verifying anything; the agent satisfies the surface form of the instruction, not its intent.
Unsatisfiable items deadlock. If the agent can't make a failing test pass — flawed test, contradictory requirement, missing capability — the checklist becomes an infinite loop; add a retry cap and an escalation path. Vague items give false confidence — "check your work" passes without checking. And latency compounds: each pass is a full round-trip, so a gate at every stage of a long pipeline can cost more than just running the end-to-end tests directly.
Retrieval practice — recall, don't peek
Question 1Agents declare "done" prematurely because they optimize for…
Question 2A checklist enforced as a hook beats one in the prompt because it…
Question 3A checklist item like "check your work" fails because it is…
Question 4An unsatisfiable checklist item risks creating…
Question 5 · spaced recall from Lesson 1The danger zone for agent errors is when the answer is…
PostToolUse pre-completion hook tuned to your stack, or help
turning a vague "looks good" gate into specific, observable items? Next: Red-Green for Agents — tests as
the spec, run as separate invocations.