Reference · Canonical Language

Verifying Agent Work — Glossary

The working vocabulary for this course. Once a term lives here, every lesson uses this word for it. Grows as we go.

The Trust Problem

Trust without verify · the trust trap: Accepting agent output as correct because it looks polished — fluent prose, inline citations, code that compiles. None of those surface signals correlate reliably with correctness; fluency is a separate objective from accuracy, and an agent is most dangerous when almost right.; Avoid: "it looks right" as a verification method — looking right is the failure mode, not the check.; Source: trust-without-verify · Steyvers et al., 2025
Self-reported verification: An agent's own prose claim that a check passed ("Build passed. Tests green."). Unfalsifiable within the conversation — the agent may hallucinate a pass, skip a step silently, or assert a result without running the tool. A checkpoint that reads the agent's narration is not a checkpoint.; Source: verification-ledger · trust-without-verify
Calibrated verification: Matching verification effort to stakes rather than applying universal paranoia: always verify high-stakes, irreversible, or security-critical output; spot-check proven configurations; automate the cheap checks. The two failure modes it avoids are skipping checks entirely and verification theater.; Avoid: "verify everything equally" — that destroys the productivity benefit on throwaway work.; Source: trust-without-verify

Guardrails & Ledgers

Deterministic guardrail · hard check: A check that passes or fails for every output, runs regardless of the model's choices, and cannot be reasoned around — a linter, schema validator, CI gate, or hook. Prompts guide behavior; guardrails enforce output properties. They check properties, not intent, so make them specific.; Avoid: putting a checkable, must-hold property in a prompt — that's a suggestion, not a guarantee.; Source: deterministic-guardrails
Mechanical enforcement: Making a rule violation impossible or immediately visible rather than asked-for. The harness — not the model — runs the check, at a fixed point (pre-commit, CI, PostToolUse). The model gets no vote.; Source: deterministic-guardrails
Verification ledger · verification log, evidence log: A structured record where every verification step is an INSERT (tool, command, exit code, output) and the evidence bundle is a SELECT — never agent prose. If the INSERT did not happen, the verification did not happen. Holds only when execution and recording are separated, or the agent can fake the rows.; Avoid: letting the same agent run the check and write its own row — separate the two (CI, harness, hook).; Source: verification-ledger
Baseline capture: Recording the system's check state before any change (phase = 'baseline') so a check that passed before and fails after is attributable as a regression the agent introduced — not a pre-existing failure.; Source: verification-ledger

Verifying As You Build

Incremental verification · check at each step: Verifying after each meaningful unit of work, not once at the end. Error cost grows with distance from the error — a wrong assumption at line 10 is a one-line fix early, a cascade audit late. The verifier must be more reliable than the generator (tests, compilers qualify; a flaky judge does not).; Avoid: checking after every line — too small a unit suppresses exploration.; Source: incremental-verification
Pre-completion checklist · completion gate: A mandatory verification sequence that intercepts the agent's completion signal and blocks "done" until specific, observable items pass — planning, building, verification, fixing. Implemented as a hook (~near-100% compliance) rather than a prompt (~70–90%), so context pressure can't skip it. Cap retries or an unsatisfiable item deadlocks.; Avoid: vague items like "check your work" — they pass by surface form without verifying anything.; Source: pre-completion-checklists
Red-green-refactor with agents · tests as the spec: The TDD cycle run as separate agent invocations: write failing tests (red), pass them with minimal code (green), improve without changing behavior (refactor). Exit conditions are deterministic. "Do not change the tests" is the load-bearing constraint against reward-hacking; separate invocations prevent the implementation bleeding into the tests (context pollution).; Avoid: "write tests and implement" in one session — mixed-phase instructions produce tautological tests.; Source: red-green-refactor-agents
Chain-of-verification · CoVe, factored variant: A self-correction loop — draft, plan verification questions, answer each independently, revise — that helps coding agents only in the factored variant (each question answered in its own prompt, without the draft), and only over claims no external oracle covers. The mechanism is anti-anchoring. Naive self-correction flips 22–28% of correct code to wrong.; Avoid: joint or two-step variants on code — the verifier sees the draft and re-emits its hallucination.; Source: chain-of-verification-coding-agents

Evaluating Behavior

Behavioral testing: Testing an agent's decision quality and end-state rather than its exact execution path, because identical inputs produce different valid outputs. Uses a capability matrix (unit-test deterministic parts, behaviorally test agentic parts), three grading methods, and a pass-rate threshold set as a product decision.; Avoid: equality checks on non-deterministic output — false negatives on correct work, false positives on lucky runs.; Source: behavioral-testing-agents
LLM-as-judge: Using a calibrated model with a structured rubric to grade open-ended output. Scalable, but carries documented positional, self-preference, and stylistic biases — keep it off mechanical checks (use code-based assertions), calibrate against human labels, and present version comparisons blind.; Avoid: an uncalibrated judge on mechanical properties — apparent quality gains vanish under test-based grading.; Source: behavioral-testing-agents · Scaffold, Not Vocabulary?
Golden Journey · restart-clean gate: A named, repeatable path through the running system with a per-step failure signal (a grep-able log line or exit code, not "test fails"), gating completion on the system restarting cleanly afterward. Maintain 3–7 per surface — representative, not exhaustive.; Avoid: conflating with "happy path" (nominal flow) or "golden path" (paved dev workflow) — neither enforces restart-clean.; Source: golden-journeys
Outcome grading · state-based grading: Evaluating an agent by the final state it produces ("is the system correct?"), not the sequence of steps it took. Deterministic tests are the most reliable outcome graders. Over-specification (pinning names, orderings, formatting) causes false negatives by embedding the author's implementation assumptions.; Avoid: outcome-only grading on side-effecting, compliance, or trace-as-deliverable tasks — the path matters there.; Source: grade-agent-outcomes
Macro evals: The population-level layer above per-trace evals: aggregate per-trace findings across a corpus to surface recurring behavior patterns no single run shows. Requires thousands of traces, judge precision above ~70%, and cross-trace structure; below that, a frequency table does the same job. Clusters are hypotheses, not diagnoses.; Avoid: reading clusters as full-system behavior — the analysis pool is selection-biased toward flagged traces.; Source: macro-evals-agentic-systems
Skill evals: Evaluating each skill as a dataset-graded unit on two axes — output quality and trigger precision — with paired with-skill/baseline runs in isolated contexts, shipping on the measured delta (pass-rate, time, tokens). Store evals/evals.json beside SKILL.md; split skills into capability uplift (retire if the model catches up) and encoded preference (check workflow fidelity).; Avoid: evals for single-user, highly subjective, or mid-rewrite skills — the harness cost exceeds the signal.; Source: skill-evals