Harness Engineering · ~10 min
Nineteen lessons, one reflex: read the symptom, reach for the harness move. This is the whole course as a lookup table — and a mixed review to prove it stuck.
The through-line of the whole course: agent failure is a signal about the environment. Diagnose the gap, make the smallest durable change, and bank it in the repo. Here's the diagnostic map.
| Symptom | Move | From |
|---|---|---|
| Agent can't find a convention it should know | Add a pointer in AGENTS.md to a docs/ file — don't inline the whole thing | L2 |
| Your instruction file keeps growing, one rule per mistake | Tag rules with source/applicability/expiry; raise altitude; prune on audit | L2 |
| A rule is "usually followed" but must always hold | Promote it to a PreToolUse hook — exit 2 to block | L3 |
| A long sub-task is flooding the main thread with noise | Delegate to a scoped sub-agent; only the result comes back | L4 |
| The agent edited/deleted something outside its scope | Set the permission framework: allow/deny rules or a hard deny floor | L5 |
| The agent declared "done" on broken code | Add a Stop-hook completion gate keyed to tests passing | L6 |
| A wrong assumption cascaded through 500 lines | Verify per unit — fast PostToolUse typecheck between steps | L6 |
| The task can't finish in one session | External done-condition + durable log + stateless harness; resume, don't restart | L7 |
| A bloated agent definition loads on every run, mostly irrelevant | Progressive disclosure — tiny definition, detailed skills loaded on demand | L8 |
| Pipeline edits keep forcing edits to the expert agent | Split the workflow into a command; keep expertise in the agent | L9 |
| The agent dives into a multi-file change on a wrong assumption | Plan Mode — read-only explore + reviewable plan before any write | L10 |
| Uniform max reasoning times out; uniform low misses risks | Reasoning sandwich — extra-high at plan/verify, high at execution | L11 |
| One agent is slow on an independently-decomposable task | Orchestrator-worker — parallel workers, then synthesize; budget ~15× tokens | L12 |
| Output needs iterative refinement against a checkable bar | Evaluator-optimizer loop with a round cap — but skip it on a strong baseline | L13 |
| An agent mistake is expensive to undo, or a re-run duplicates state | Rollback-first + idempotency — one-command undo, check-before-act | L14 |
| A live run has drifted onto the wrong file | Steer if recoverable; restart with a cleaner prompt if fundamentally wrong | L15 |
| A permission rule fails, an injection lands, or the model misbehaves | Sandbox + scoped grant — bound the damage, not just the likelihood; microVM for untrusted code | L16 |
| A long session keeps reasoning in the context dumb zone | Compact at the seams with a focus directive; lower the auto-trigger to 50–60% | L17 |
| Spend runs up, or a stuck agent loops without progress | Route by complexity; trip a runtime circuit breaker (maxTurns, cost budget) | L18 |
| You changed the harness but can't tell whether it helped | Ablate to rank subsystems, then hill-climb one variable against a held-out eval | L19 |
That's the difference between mechanical enforcement and hope. Hooks, permission rules, and verification gates are run by the harness at fixed points — the model gets no vote. Instruction files and altitude shape what the model tends to do; the deterministic layer decides what it can do.
The later parts add the same reflex at new layers. Composition (L8–L9): put each piece of knowledge where it's loaded only when needed — skills under an agent, workflow split from expertise. Reasoning & planning (L10–L11): spend exploration and compute where ambiguity is highest, before execution locks in. Multi-agent loops (L12–L13) and reversibility & control (L14–L15): scale out only when the task decomposes or the baseline is weak, and keep every action undoable, re-runnable, and steerable. Containment & limits (L16–L18): three boundaries the harness enforces by construction — a sandbox caps the blast radius, compaction caps context before it rots, a circuit breaker caps the spend. Measuring the harness (L19): pin the model, ablate to rank subsystems, hill-climb one variable against an eval — the loop that turns "environment beats model" from a claim into a number. Each is the same move — shape the environment, not just the prompt — applied at a different layer.
Every lesson had a backfire box, and they rhyme: the harness is investment that pays off across many sessions, not one. For prototypes, short-lived tools, and throwaway code, custom linters, layered docs, and resume machinery cost more than they return. Build the guardrail when the failure is recurring and the codebase is durable — not on reflex.
Mixed review — across all nineteen lessons
Question 1 · from L1The core reframe of harness engineering is that agent failure is…
Question 2 · from L16Sandboxing contains an agent by bounding the…
Question 3 · from L17Manual compaction beats waiting for the 95% auto-trigger because by then the agent has…
Question 4 · from L18For a safety-critical stop, the most reliable circuit breaker is enforced by the…
Question 5 · from L19To measure a harness change cleanly, isometric ablation requires that you hold fixed the…
settings.json (hooks + permission rules), or wire up a long-running harness with an initializer
and a coding agent. Or revisit any lesson — the mechanics compound when you use them together.