The working vocabulary for this course. Once a term lives here, every lesson uses this word for it. Grows as we go.
The Trust Problem
Trust without verify · the trust trap
Accepting agent output as correct because it looks polished — fluent prose, inline citations, code that compiles. None of those surface signals correlate reliably with correctness; fluency is a separate objective from accuracy, and an agent is most dangerous when almost right.
Avoid: "it looks right" as a verification method — looking right is the failure mode, not the check.
An agent's own prose claim that a check passed ("Build passed. Tests green."). Unfalsifiable within the conversation — the agent may hallucinate a pass, skip a step silently, or assert a result without running the tool. A checkpoint that reads the agent's narration is not a checkpoint.
Matching verification effort to stakes rather than applying universal paranoia: always verify high-stakes, irreversible, or security-critical output; spot-check proven configurations; automate the cheap checks. The two failure modes it avoids are skipping checks entirely and verification theater.
Avoid: "verify everything equally" — that destroys the productivity benefit on throwaway work.
A check that passes or fails for every output, runs regardless of the model's choices, and cannot be reasoned around — a linter, schema validator, CI gate, or hook. Prompts guide behavior; guardrails enforce output properties. They check properties, not intent, so make them specific.
Avoid: putting a checkable, must-hold property in a prompt — that's a suggestion, not a guarantee.
Making a rule violation impossible or immediately visible rather than asked-for. The harness — not the model — runs the check, at a fixed point (pre-commit, CI, PostToolUse). The model gets no vote.
A structured record where every verification step is an INSERT (tool, command, exit code, output) and the evidence bundle is a SELECT — never agent prose. If the INSERT did not happen, the verification did not happen. Holds only when execution and recording are separated, or the agent can fake the rows.
Avoid: letting the same agent run the check and write its own row — separate the two (CI, harness, hook).
Recording the system's check state before any change (phase = 'baseline') so a check that passed before and fails after is attributable as a regression the agent introduced — not a pre-existing failure.
Verifying after each meaningful unit of work, not once at the end. Error cost grows with distance from the error — a wrong assumption at line 10 is a one-line fix early, a cascade audit late. The verifier must be more reliable than the generator (tests, compilers qualify; a flaky judge does not).
Avoid: checking after every line — too small a unit suppresses exploration.
A mandatory verification sequence that intercepts the agent's completion signal and blocks "done" until specific, observable items pass — planning, building, verification, fixing. Implemented as a hook (~near-100% compliance) rather than a prompt (~70–90%), so context pressure can't skip it. Cap retries or an unsatisfiable item deadlocks.
Avoid: vague items like "check your work" — they pass by surface form without verifying anything.
Red-green-refactor with agents · tests as the spec
The TDD cycle run as separate agent invocations: write failing tests (red), pass them with minimal code (green), improve without changing behavior (refactor). Exit conditions are deterministic. "Do not change the tests" is the load-bearing constraint against reward-hacking; separate invocations prevent the implementation bleeding into the tests (context pollution).
Avoid: "write tests and implement" in one session — mixed-phase instructions produce tautological tests.
A self-correction loop — draft, plan verification questions, answer each independently, revise — that helps coding agents only in the factored variant (each question answered in its own prompt, without the draft), and only over claims no external oracle covers. The mechanism is anti-anchoring. Naive self-correction flips 22–28% of correct code to wrong.
Avoid: joint or two-step variants on code — the verifier sees the draft and re-emits its hallucination.
Testing an agent's decision quality and end-state rather than its exact execution path, because identical inputs produce different valid outputs. Uses a capability matrix (unit-test deterministic parts, behaviorally test agentic parts), three grading methods, and a pass-rate threshold set as a product decision.
Avoid: equality checks on non-deterministic output — false negatives on correct work, false positives on lucky runs.
Using a calibrated model with a structured rubric to grade open-ended output. Scalable, but carries documented positional, self-preference, and stylistic biases — keep it off mechanical checks (use code-based assertions), calibrate against human labels, and present version comparisons blind.
Avoid: an uncalibrated judge on mechanical properties — apparent quality gains vanish under test-based grading.
A named, repeatable path through the running system with a per-step failure signal (a grep-able log line or exit code, not "test fails"), gating completion on the system restarting cleanly afterward. Maintain 3–7 per surface — representative, not exhaustive.
Avoid: conflating with "happy path" (nominal flow) or "golden path" (paved dev workflow) — neither enforces restart-clean.
Evaluating an agent by the final state it produces ("is the system correct?"), not the sequence of steps it took. Deterministic tests are the most reliable outcome graders. Over-specification (pinning names, orderings, formatting) causes false negatives by embedding the author's implementation assumptions.
Avoid: outcome-only grading on side-effecting, compliance, or trace-as-deliverable tasks — the path matters there.
The population-level layer above per-trace evals: aggregate per-trace findings across a corpus to surface recurring behavior patterns no single run shows. Requires thousands of traces, judge precision above ~70%, and cross-trace structure; below that, a frequency table does the same job. Clusters are hypotheses, not diagnoses.
Avoid: reading clusters as full-system behavior — the analysis pool is selection-biased toward flagged traces.
Evaluating each skill as a dataset-graded unit on two axes — output quality and trigger precision — with paired with-skill/baseline runs in isolated contexts, shipping on the measured delta (pass-rate, time, tokens). Store evals/evals.json beside SKILL.md; split skills into capability uplift (retire if the model catches up) and encoded preference (check workflow fidelity).
Avoid: evals for single-user, highly subjective, or mid-rewrite skills — the harness cost exceeds the signal.