Harness Engineering · ~8 min
The whole course rests on one claim: the harness moves output quality more than the model. This is the lesson that makes the claim falsifiable — pin the model, change the environment, and read the score.
Two paired methods. Isometric ablation pins the model and removes one subsystem at a time to rank where the harness's value lives. Harness hill-climbing then takes the top-ranked subsystem and tunes it: change one variable, re-score, keep it if the score improved. The eval score is the gradient — no model change, no retraining.
Before you tune anything, find out what's doing the work. Isometric ablation keeps the model fixed, removes one of five harness subsystems, reruns the benchmark, and records the drop. The "isometric" qualifier is load-bearing: changing model and harness together confounds the delta, so you can't attribute it to either. Pin the model and the delta measures environmental marginal product alone.
| Removed subsystem | Score | Drop |
|---|---|---|
| (none — baseline) | 80% | — |
| Tools (shell, edit, test runner) | 0% | 80 pp |
Instructions (AGENTS.md) | 35% | 45 pp |
| Feedback (verify, lint, tests) | 50% | 30 pp |
| Environment (lockfiles, services) | 60% | 20 pp |
State (PROGRESS.md, commits) | 75% | 5 pp |
The drop table is the deliverable. Upgrade the highest-drop subsystem first; a near-zero drop marks a simplification candidate — scaffolding the model already subsumes, consuming maintenance budget without earning its place. That's the leading indicator for the harness-impermanence reflex of Part 3: build to delete.
Ablation tells you which subsystem to invest in; hill-climbing finds the best configuration of it. Run the baseline, generate one candidate change, score it on the task suite, keep it if the score improves, repeat.
The cardinal rule is one change at a time. Change system-prompt wording and tool descriptions in the same iteration and you conflate two signals — you can't attribute the delta to either, and a regression takes untangling instead of a clean revert. Same principle as the incremental verification of Lesson 6: small, checkpointed, each reversible.
Both methods need one thing: a representative, graded task suite held out from production. Get it wrong and you measure the fixture, not the capability.
| Discipline | Why |
|---|---|
| Isolation | Tune on one set, validate on a second held-out set — never tune against validation |
| Breadth | Include cases where the behavior should trigger and where it shouldn't, or it over-triggers |
| Grading | Prefer deterministic outcome graders over LLM-as-judge — cheaper to rerun, no evaluator variance |
Treat the eval score as a leading indicator; production outcomes are ground truth. Rotate tuning tasks with fresh ones drawn from real traces, and run a final check on a set that never touched the tuning loop before promoting any change.
Hill-climbing finds a local optimum, not a global one — a poor baseline converges to the nearest local peak. Overfitting is real: a harness tuned to the suite scores high there while degrading on real workloads; the tell is a rising tuning score with flat or worsening production error. And ablation ranks, it doesn't quantify — the Agentic Harness Engineering paper found "harness components interact non-additively, so stacking effective edits caps the aggregate gain." A near-zero drop isn't proof a subsystem is useless; another can compensate when it's removed, masking its true contribution. On a mature harness, use multi-trial scoring (pass^k) before trusting any drop below your noise floor.
Retrieval practice — recall, don't peek
Question 1The "isometric" constraint in ablation means you hold fixed the…
Question 2In an ablation drop table, the subsystem to upgrade first is the one with the…
Question 3Hill-climbing changes how many harness variables per iteration?
Question 4A near-zero ablation drop flags a subsystem as a candidate to…
Question 5 · spaced recall from Lesson 1LangChain's Terminal-Bench jump from 52.8% to 66.5% came from…