Verifying Agent Work · ~7 min
Prose claims are unfalsifiable. Turn every check into a row — tool, command, exit code — and the evidence becomes a query, not a sentence you have to trust.
pytest exited 0" — and how you detect a regression the agent introduced versus one already there.Agent workflows lean on prose assertions: "Build passed. Tests green. No issues found." These claims are unfalsifiable within the conversation. The agent may hallucinate a pass, skip a step silently, or assert a result without running the tool at all.
A verification ledger records each check as structured data — never prose. The core rule: every verification step is an INSERT with the tool name, command, exit code, and output snippet. The evidence bundle is a SELECT, not agent-written narration.
Before making any change, the agent captures current state — IDE diagnostics, build exit code, test results — and
INSERTs them with phase = 'baseline'. After changes, it re-runs the same checks as phase =
'after'. Any check that was passed=1 before and passed=0 after reveals a
regression the agent introduced — not a pre-existing failure you'd otherwise have blamed it for.
Gates prevent the agent from skipping ahead, enforced through queries: "Do NOT proceed to implementation until
baseline INSERTs are complete." "Do NOT present evidence until SELECT COUNT(*) ... WHERE phase='after'
returns sufficient rows." If a check is skipped, the missing after row makes the gap visible and the
gate blocks completion. Confidence is derived: "High" means all tiers passed and reviewers found zero
issues; "Low" means a check failed or a concern is unresolved.
You don't need SQL. The same principle runs on a JSON file (append a record per check, read it back for the bundle), inline structured output (a fixed schema per check, parseable by gates), or CI integration (pipe records into existing reporting). The constraint is the same: evidence must be produced by tool calls, not asserted by the agent.
If the same agent runs the tool and writes the row, it can fake exit codes or skip the INSERT when a check fails. The ledger only holds when execution and recording are separated — CI, a harness, or a hook does the writing. Two more traps: flaky checks faithfully record noise (green rows mislead), and complete rows for the wrong surface — unit tests when the change breaks an integration contract — produce a clean bundle for a broken change. Ledger completeness is not verification completeness.
Retrieval practice — recall, don't peek
Question 1In a verification ledger, the evidence bundle is produced by…
Question 2Capturing a baseline phase before changes mainly enables…
Question 3The ledger can be gamed when the same agent…
Question 4"Ledger completeness is not verification completeness" warns that…
Question 5 · spaced recall from Lesson 2A guardrail differs from a prompt in that it…