Part 9 · Measuring the Harness

Harness Engineering · ~8 min

Eval-Driven Harness Improvement

The whole course rests on one claim: the harness moves output quality more than the model. This is the lesson that makes the claim falsifiable — pin the model, change the environment, and read the score.

Why this, for you: eighteen lessons told you to add a hook, scope a sandbox, lower a compaction threshold. But which change actually helped? Without a measurement loop, "harness beats model" is faith and every edit is a guess. This lesson closes the loop the very first lesson opened: it turns the feedback signal from a vibe into a number, so you can rank where to invest and prove a change earned its place.

Two paired methods. Isometric ablation pins the model and removes one subsystem at a time to rank where the harness's value lives. Harness hill-climbing then takes the top-ranked subsystem and tunes it: change one variable, re-score, keep it if the score improved. The eval score is the gradient — no model change, no retraining.

1 Ablate to find the load-bearing subsystem

Before you tune anything, find out what's doing the work. Isometric ablation keeps the model fixed, removes one of five harness subsystems, reruns the benchmark, and records the drop. The "isometric" qualifier is load-bearing: changing model and harness together confounds the delta, so you can't attribute it to either. Pin the model and the delta measures environmental marginal product alone.

Removed subsystemScoreDrop
(none — baseline)80%
Tools (shell, edit, test runner)0%80 pp
Instructions (AGENTS.md)35%45 pp
Feedback (verify, lint, tests)50%30 pp
Environment (lockfiles, services)60%20 pp
State (PROGRESS.md, commits)75%5 pp

The drop table is the deliverable. Upgrade the highest-drop subsystem first; a near-zero drop marks a simplification candidate — scaffolding the model already subsumes, consuming maintenance budget without earning its place. That's the leading indicator for the harness-impermanence reflex of Part 3: build to delete.

2 Hill-climb the dimension the table picked

Ablation tells you which subsystem to invest in; hill-climbing finds the best configuration of it. Run the baseline, generate one candidate change, score it on the task suite, keep it if the score improves, repeat.

This is exactly how LangChain moved Terminal-Bench 2.0 from 52.8% to 66.5% — the result Lesson 1 opened with — through harness-only changes, one variable per iteration. The reasoning sandwich was one such row: maximum reasoning at planning and verification, moderate at execution, scored 63.6% versus 53.9% for uniform maximum. A measurable delta from a single configuration change.

The cardinal rule is one change at a time. Change system-prompt wording and tool descriptions in the same iteration and you conflate two signals — you can't attribute the delta to either, and a regression takes untangling instead of a clean revert. Same principle as the incremental verification of Lesson 6: small, checkpointed, each reversible.

# the loop — the eval score is the gradient signal baseline eval run change ONE variable re-score on the held-out task suite improved? adopt as new baseline : discard re-ablate to confirm the rank shifted, then repeat

3 The eval suite is the whole game

Both methods need one thing: a representative, graded task suite held out from production. Get it wrong and you measure the fixture, not the capability.

DisciplineWhy
IsolationTune on one set, validate on a second held-out set — never tune against validation
BreadthInclude cases where the behavior should trigger and where it shouldn't, or it over-triggers
GradingPrefer deterministic outcome graders over LLM-as-judge — cheaper to rerun, no evaluator variance

Treat the eval score as a leading indicator; production outcomes are ground truth. Rotate tuning tasks with fresh ones drawn from real traces, and run a final check on a set that never touched the tuning loop before promoting any change.

Where the score lies to you

Hill-climbing finds a local optimum, not a global one — a poor baseline converges to the nearest local peak. Overfitting is real: a harness tuned to the suite scores high there while degrading on real workloads; the tell is a rising tuning score with flat or worsening production error. And ablation ranks, it doesn't quantify — the Agentic Harness Engineering paper found "harness components interact non-additively, so stacking effective edits caps the aggregate gain." A near-zero drop isn't proof a subsystem is useless; another can compensate when it's removed, masking its true contribution. On a mature harness, use multi-trial scoring (pass^k) before trusting any drop below your noise floor.

↪ Your win: measure the change, don't just make it

  • Ablate to rank — pin the model, remove one subsystem, record the drop; the table picks your target.
  • Hill-climb to tune — one variable per iteration, keep it only if the eval score improves.
  • Build the eval suite first — held out, both-directional, deterministically graded.
  • Watch for overfitting — rotate tasks, validate on an untouched set, treat production as ground truth.
  • Retire near-zero-drop scaffolds — they're the build-to-delete candidates of harness impermanence.

Retrieval practice — recall, don't peek

Question 1The "isometric" constraint in ablation means you hold fixed the…

Question 2In an ablation drop table, the subsystem to upgrade first is the one with the…

Question 3Hill-climbing changes how many harness variables per iteration?

Question 4A near-zero ablation drop flags a subsystem as a candidate to…

Question 5 · spaced recall from Lesson 1LangChain's Terminal-Bench jump from 52.8% to 66.5% came from…

Ask me anything. Want a starter ablation table for your own agent's subsystems, or a held-out eval split to hill-climb against without overfitting? Next, the Capstone: a symptom→move table that folds all nineteen lessons into one decision tool.
✎ Feedback