Part 1 · The Trust Problem

Verifying Agent Work · ~7 min

The Verification Ledger

Prose claims are unfalsifiable. Turn every check into a row — tool, command, exit code — and the evidence becomes a query, not a sentence you have to trust.

Why this, for you: guardrails (Lesson 2) enforce properties; the ledger makes the agent's account of what it checked auditable. It's how you get from "the agent said the build passed" to "row 14 shows pytest exited 0" — and how you detect a regression the agent introduced versus one already there.

Agent workflows lean on prose assertions: "Build passed. Tests green. No issues found." These claims are unfalsifiable within the conversation. The agent may hallucinate a pass, skip a step silently, or assert a result without running the tool at all.

1 Every check is a row

A verification ledger records each check as structured data — never prose. The core rule: every verification step is an INSERT with the tool name, command, exit code, and output snippet. The evidence bundle is a SELECT, not agent-written narration.

If the INSERT did not happen, the verification did not happen. The bundle is generated from a query, which eliminates hallucinated results — the agent cannot present a passing row that does not exist.
# Burke Holland's Anvil agent — the ledger as a SQL table CREATE TABLE anvil_checks ( task_id TEXT, phase TEXT CHECK(phase IN ('baseline','after','review')), check_name TEXT, tool TEXT, command TEXT, exit_code INTEGER, output_snippet TEXT, passed INTEGER CHECK(passed IN (0,1)), ts DATETIME );

2 Baseline first — so you can spot the regression

Before making any change, the agent captures current state — IDE diagnostics, build exit code, test results — and INSERTs them with phase = 'baseline'. After changes, it re-runs the same checks as phase = 'after'. Any check that was passed=1 before and passed=0 after reveals a regression the agent introduced — not a pre-existing failure you'd otherwise have blamed it for.

# evidence bundle is a SELECT, not a sentence | Phase | Check | Tool | Exit | Passed | | baseline | unit-tests | pytest | 0 | yes | | after | unit-tests | pytest | 0 | yes | Regressions: 0. Confidence: High # derived from data, not judgment

3 Gate on the data, not on trust

Gates prevent the agent from skipping ahead, enforced through queries: "Do NOT proceed to implementation until baseline INSERTs are complete." "Do NOT present evidence until SELECT COUNT(*) ... WHERE phase='after' returns sufficient rows." If a check is skipped, the missing after row makes the gap visible and the gate blocks completion. Confidence is derived: "High" means all tiers passed and reviewers found zero issues; "Low" means a check failed or a concern is unresolved.

You don't need SQL. The same principle runs on a JSON file (append a record per check, read it back for the bundle), inline structured output (a fixed schema per check, parseable by gates), or CI integration (pipe records into existing reporting). The constraint is the same: evidence must be produced by tool calls, not asserted by the agent.

The ledger's fatal flaw: who writes the rows

If the same agent runs the tool and writes the row, it can fake exit codes or skip the INSERT when a check fails. The ledger only holds when execution and recording are separated — CI, a harness, or a hook does the writing. Two more traps: flaky checks faithfully record noise (green rows mislead), and complete rows for the wrong surface — unit tests when the change breaks an integration contract — produce a clean bundle for a broken change. Ledger completeness is not verification completeness.

↪ Your win: evidence as data, not narration

Retrieval practice — recall, don't peek

Question 1In a verification ledger, the evidence bundle is produced by…

Question 2Capturing a baseline phase before changes mainly enables…

Question 3The ledger can be gamed when the same agent…

Question 4"Ledger completeness is not verification completeness" warns that…

Question 5 · spaced recall from Lesson 2A guardrail differs from a prompt in that it…

Ask me anything. Want a lightweight JSON-file ledger with a baseline/after gate you can drop into a coding agent, or a hook that writes rows so the agent can't fake them? Next, Part 2 opens with Check at Each Step — verifying as you build instead of all at the end.
✎ Feedback