Capstone

Harness Engineering · ~10 min

The Harness Decision Table

Nineteen lessons, one reflex: read the symptom, reach for the harness move. This is the whole course as a lookup table — and a mixed review to prove it stuck.

Why this, for you: harness engineering is a habit, not a checklist. When an agent misbehaves, the instinct is to rewrite the prompt. This table retrains that instinct: most agent symptoms map to a specific, durable change in the environment — one that fixes the problem for every future session, not just this one.

The through-line of the whole course: agent failure is a signal about the environment. Diagnose the gap, make the smallest durable change, and bank it in the repo. Here's the diagnostic map.

1 Symptom → move

SymptomMoveFrom
Agent can't find a convention it should knowAdd a pointer in AGENTS.md to a docs/ file — don't inline the whole thingL2
Your instruction file keeps growing, one rule per mistakeTag rules with source/applicability/expiry; raise altitude; prune on auditL2
A rule is "usually followed" but must always holdPromote it to a PreToolUse hook — exit 2 to blockL3
A long sub-task is flooding the main thread with noiseDelegate to a scoped sub-agent; only the result comes backL4
The agent edited/deleted something outside its scopeSet the permission framework: allow/deny rules or a hard deny floorL5
The agent declared "done" on broken codeAdd a Stop-hook completion gate keyed to tests passingL6
A wrong assumption cascaded through 500 linesVerify per unit — fast PostToolUse typecheck between stepsL6
The task can't finish in one sessionExternal done-condition + durable log + stateless harness; resume, don't restartL7
A bloated agent definition loads on every run, mostly irrelevantProgressive disclosure — tiny definition, detailed skills loaded on demandL8
Pipeline edits keep forcing edits to the expert agentSplit the workflow into a command; keep expertise in the agentL9
The agent dives into a multi-file change on a wrong assumptionPlan Mode — read-only explore + reviewable plan before any writeL10
Uniform max reasoning times out; uniform low misses risksReasoning sandwich — extra-high at plan/verify, high at executionL11
One agent is slow on an independently-decomposable taskOrchestrator-worker — parallel workers, then synthesize; budget ~15× tokensL12
Output needs iterative refinement against a checkable barEvaluator-optimizer loop with a round cap — but skip it on a strong baselineL13
An agent mistake is expensive to undo, or a re-run duplicates stateRollback-first + idempotency — one-command undo, check-before-actL14
A live run has drifted onto the wrong fileSteer if recoverable; restart with a cleaner prompt if fundamentally wrongL15
A permission rule fails, an injection lands, or the model misbehavesSandbox + scoped grant — bound the damage, not just the likelihood; microVM for untrusted codeL16
A long session keeps reasoning in the context dumb zoneCompact at the seams with a focus directive; lower the auto-trigger to 50–60%L17
Spend runs up, or a stuck agent loops without progressRoute by complexity; trip a runtime circuit breaker (maxTurns, cost budget)L18
You changed the harness but can't tell whether it helpedAblate to rank subsystems, then hill-climb one variable against a held-out evalL19

2 The one rule under all of it

Reach for the environment before the prompt — and pick the mechanism by how non-negotiable the rule is. Bendable guidance → an instruction. Must-hold, checkable → a deterministic hook, allow/deny rule, or gate. A prompt fix helps one session; a committed harness fix helps every session after it.

That's the difference between mechanical enforcement and hope. Hooks, permission rules, and verification gates are run by the harness at fixed points — the model gets no vote. Instruction files and altitude shape what the model tends to do; the deterministic layer decides what it can do.

The later parts add the same reflex at new layers. Composition (L8–L9): put each piece of knowledge where it's loaded only when needed — skills under an agent, workflow split from expertise. Reasoning & planning (L10–L11): spend exploration and compute where ambiguity is highest, before execution locks in. Multi-agent loops (L12–L13) and reversibility & control (L14–L15): scale out only when the task decomposes or the baseline is weak, and keep every action undoable, re-runnable, and steerable. Containment & limits (L16–L18): three boundaries the harness enforces by construction — a sandbox caps the blast radius, compaction caps context before it rots, a circuit breaker caps the spend. Measuring the harness (L19): pin the model, ablate to rank subsystems, hill-climb one variable against an eval — the loop that turns "environment beats model" from a claim into a number. Each is the same move — shape the environment, not just the prompt — applied at a different layer.

Don't over-build the harness

Every lesson had a backfire box, and they rhyme: the harness is investment that pays off across many sessions, not one. For prototypes, short-lived tools, and throwaway code, custom linters, layered docs, and resume machinery cost more than they return. Build the guardrail when the failure is recurring and the codebase is durable — not on reflex.

↪ Your win: a harness-engineering reflex

Mixed review — across all nineteen lessons

Question 1 · from L1The core reframe of harness engineering is that agent failure is…

Question 2 · from L16Sandboxing contains an agent by bounding the…

Question 3 · from L17Manual compaction beats waiting for the 95% auto-trigger because by then the agent has…

Question 4 · from L18For a safety-critical stop, the most reliable circuit breaker is enforced by the…

Question 5 · from L19To measure a harness change cleanly, isometric ablation requires that you hold fixed the…

You finished the course. Ask me to apply the decision table to a real repo of yours, draft a starter settings.json (hooks + permission rules), or wire up a long-running harness with an initializer and a coding agent. Or revisit any lesson — the mechanics compound when you use them together.
✎ Feedback