The working vocabulary for this course. Once a term lives here, every lesson uses this word for it. Grows as we go.
The Discipline
Harness · aliases: agent harness, scaffold
The non-model code around the model — the loop, tools, context management, delegation, safety, and orchestration. The model decides; the harness decides what the model can decide.
Avoid: "the agent" (that conflates model and harness) and "the framework" when you mean a specific permission mode.
The discipline of designing agent environments — legibility, mechanical enforcement, constrained solution spaces — so agents succeed by default. It subsumes prompt engineering; environment quality outweighs model choice.
Making a rule violation impossible or immediately visible, rather than asked-for. Linters, structural tests, CI gates, and hooks run regardless of the model's choices.
Avoid: "validation" alone — the point is that the harness, not the model, runs it.
A ~100-line AGENTS.md that indexes into a versioned docs/ directory rather than inlining knowledge — the fix for context crowding, attention dilution, and instant rot.
Avoid: "the instruction file" as if size doesn't matter — it's a context-budget decision.
The level a rule is written at. The right altitude says how to reason, not what to decide per case — strong heuristics that generalize. Too brittle enumerates; too vague constrains nothing.
Avoid: "be more specific" — the brittle failure is over-specific; the lever is principle vs. case.
Aggregate instruction load — not file count — drives degradation: more simultaneous rules makes the agent less likely to follow any one. Accuracy reaches only ~68% at a density of 500 instructions.
Deterministic automation the harness runs at a fixed lifecycle point (PreToolUse, PostToolUse, Stop…). Input arrives as JSON on stdin; exit code 2 blocks the action for blocking events.
Avoid: "a hook is like an instruction" — an instruction is a should-do, a hook is a must-do.
An ephemeral, isolated agent that runs a focused task in a fresh context window and returns only its final result. Isolation is structural — the parent never sees its intermediate reasoning.
Avoid: using sub-agents where coordination is needed — they can't talk to each other; that's an agent team.
An operation the agent takes outside the authorized scope on a benign task — deleting a file you never mentioned. An authorization failure, not a capability failure. Driven more by permission framework than by model.
Avoid: "the model went rogue" — it's pattern-matching on consent text, not malice.
A permission framework that interposes a deterministic consent checkpoint between proposal and execution. It doesn't improve the model's judgment — it denies a misjudgment the chance to act.
A deterministic check that decides whether "done" is true — typically tests passing — wired into a Stop hook or CI. Gate on outcome evidence (diffs, exit codes, test output), never on the agent's self-report.
Avoid: "the agent says it's done" — a checkpoint that reads the agent's narration is not a checkpoint.
Verifying after each meaningful unit of work, not once at the end. Error cost grows with distance from the error — a wrong assumption at line 10 is a one-line fix early, a cascade audit late.
An agent whose run survives session boundaries, sandbox crashes, and human pauses by moving state out of the context window into durable artifacts. Forced by three walls: finite context, no persistent state, unreliable self-grading.
Stateless harness, disposable sandbox · cattle, not pets
The harness holds no run state and the per-session sandbox is destroyed after use, so crash recovery becomes architectural — any instance resumes from the durable session log.
Authoring scaffolding assuming a future model will subsume the capability — architect for cheap removal, not elegance. Structural mechanisms (sandboxing, permissions, gates) are exempt; they stay valuable as capability rises.
Structuring an agent definition in two layers — a small always-loaded definition (identity, scope, quality bar, skill references) plus skills loaded on demand — so irrelevant knowledge never enters the context window. Cuts per-task tokens and compounds across sub-agent fan-out.
Avoid: "just trim the prompt" — the lever is where knowledge lives, not how terse it reads.
A self-contained, on-demand unit of how-to knowledge — procedures, checklists, templates — loaded only when a task needs it, via a portable SKILL.md entrypoint. The detailed layer beneath a minimal agent definition.
Avoid: a skill that implicitly depends on another being loaded first — that breaks self-containment.
Two separable concerns: a command owns the workflow (what steps run, in what order, to whom they delegate); an agent owns the expertise (role, quality bar, skills). Separated, one agent serves many commands and either side changes without touching the other.
Avoid: a monolithic command that inlines every agent's instructions — the anti-pattern this split removes.
A read-only permission mode that blocks all writes during exploration, forcing the agent to explore and propose a reviewable plan before it modifies anything. Fixing a plan costs minutes; fixing a bad implementation costs context, tokens, and reverts.
Allocating extra-high reasoning compute to planning and verification, reduced compute to execution, rather than a uniform level. Scored highest on Terminal-Bench 2.0 (66.5%), beating uniform-high (63.6%) and continuous-max (53.9%, timeout-penalized).
Avoid: "more thinking is always better" — uniform max compute on execution causes timeouts.
An orchestrator decomposes a task into independent subtasks, dispatches them to parallel workers with scoped tools, and synthesizes results. Workers don't coordinate. Pays off on genuinely independent work; costs ~15× the tokens of chat.
Avoid: fanning out a sequentially-dependent task — that needs chaining, not parallelism.
A generator produces output and a separate evaluator returns a structured verdict, looping until PASS or a round cap. Effective when the bar is machine-checkable and the generator is weak; on a near-perfect baseline the critic invents flaws (the self-critique paradox).
Treating recovery cost as a first-class constraint — choose the one-command undo before choosing the action, and keep work on reversible primitives (branches, draft PRs, comments). External side effects (email, payments) can't be undone; gate them instead.
An operation whose second run produces the same end state as the first — no duplicate branches, comments, or compounded errors. Built from check-before-act, upsert-over-create, and unique keys; guard each artifact, not the whole workflow.
A mid-run message that redirects a live agent without discarding the context it has built — distinct from a restart, which throws that context away. Use for recoverable drift; a queued follow-up adjusts the next step instead of interrupting.
Avoid: steering a fundamentally-wrong run repeatedly — restart with a cleaner prompt is cheaper.
Blast-radius containment · least privilege, permission scoping
Granting an agent only the permissions its task requires, so the damage a mistake or injection can cause is bounded by construction. Frames risk as risk = likelihood × damage: permission rules push down likelihood, the sandbox pushes down damage. Tool restrictions are runtime-enforced — the model cannot invoke a tool that was never wired in.
Avoid: "the model is safe now" — scoping bounds per-action damage, not the model's judgment or time-integrated harm.
The runtime boundary limiting what an agent process can reach — filesystem, network, kernel. Three families trade isolation against startup cost: containers (kernel-shared, fast, weakest), microVMs (hypervisor-isolated, ~125 ms boot, strong), OS-level isolators (no daemon, fastest, weak on escape). Necessary but not sufficient — a capable agent can reason around it.
Avoid: treating the sandbox as a complete defense — it is the outermost layer, not the only one.
Layering multiple independent safety mechanisms so no single failure compromises behavior — prompt guardrails, schema restrictions, runtime approvals, tool validation, lifecycle hooks. Each layer assumes the others will fail and catches what they miss.
Replacing accumulated conversation history with a dense summary to free the context window while preserving task intent and state. Done manually at phase seams and before hard reasoning — earlier than the ~95% auto-trigger, which fires after the agent has spent most of the session in the dumb zone. Offload large payloads to disk, then summarize, to keep it recoverable rather than lossy.
Avoid: "compaction is just cleanup" — it is reasoning-quality preservation, not memory hygiene.
The region of context fill where output quality degrades — a gradient, not a cliff, appearing across all models. Onset is closer to an absolute token threshold (~32K–100K) than a fixed percentage; reasoning tasks effectively use only 10–20% of a long window.
Routing each task to the cheapest model tier that meets its complexity — fast for exploration, balanced for implementation, powerful for architecture — and escalating only when a cheap deterministic gate (tests, linter, type check) fails. Cascade routing approximates FrugalGPT savings without native tooling.
A stop triggered when an agent loop stalls — iteration limit, repeated failure, repetition, context budget, or cost threshold. Runtime enforcement (maxTurns, cost budgets) cannot be overridden by the model; instruction-level checks can. On trip, degrade gracefully: return partial results and explain the stop.
Avoid: setting it so aggressively it trips on legitimate multi-step work — the signal is cost without progress, not cost alone.
Pinning the model, removing one harness subsystem at a time (instructions, tools, environment, state, feedback), rerunning the benchmark, and recording the drop. The per-subsystem drop table ranks investment priority; near-zero drops mark simplification candidates. The same-model constraint converts the score delta into a measure of environmental marginal product.
Avoid: reading a drop as a precise quantity — components interact non-additively, so ablation ranks rather than measures.
Local search over harness configuration: run a baseline eval, change one variable, re-score, keep the change if the score improves, repeat — the eval score as the gradient, no model change. One change per iteration keeps the delta attributable and rollback unambiguous. Tune on one set, validate on a held-out set, treat production as ground truth.
Avoid: tuning against the validation set — that measures the fixture, not real capability, and overfits the harness.