Part 5 · Reasoning & Planning

Harness Engineering · ~7 min

Reasoning Budget — The Sandwich

Not every step needs the same depth of thought. Planning and verification are high-stakes; execution is mostly mechanical. Spend the compute where the ambiguity actually is.

Why this, for you: Lesson 1's headline result — LangChain taking Terminal-Bench from 52.8% to 66.5% on pure harness changes — was partly this lever. Reasoning budget is a harness control: where you allocate thinking compute across the loop moves the score, the same model throughout. This lesson is how that knob turns.

The reasoning sandwich allocates maximum reasoning compute to planning and verification, reduced compute to execution — rather than one fixed level throughout. Concentrate thinking where ambiguity is highest; don't burn it on mechanical steps.

1 The benchmark that names the shape

LangChain's deep-agent experiments tested an extra-high → high → extra-high budget (xhigh at planning, high at execution, xhigh at verification). It scored highest on Terminal-Bench 2.0 at 66.5% — beating continuous maximum reasoning (53.9%, penalized by timeouts) and uniform high reasoning (63.6%).

The loser is instructive: continuous maximum reasoning scored worst, not because thinking hurts but because uniform max compute on mechanical execution causes timeouts that drag completion rates down. More thinking everywhere is not the goal; thinking where it counts is.

# the sandwich — budget follows the ambiguity PLAN xhigh # explore the space; errors here propagate everywhere EXECUTE high # follow the decided plan; largely mechanical VERIFY xhigh # compare to requirements; a miss = false completion

2 Why the phases differ

The three phases impose structurally different cognitive demands:

PhaseDemandCost of underthinking
PlanningExplore requirements, approaches, edge cases, risksErrors propagate through every later step
ExecutionFollow a decided plan; write code, run commandsLow — the hard decisions are already made
VerificationCompare output to requirements preciselyA missed failure becomes false completion

Uniform max compute wastes budget on execution and, per the benchmark, times out. Concentrating compute where ambiguity is highest balances cost against quality — the same logic as Lesson 6's verification gates, applied to how hard the model thinks rather than whether you trust the result.

3 How to apply the knob

The lever has different names per tool, same effect:

Dual-mode operation enforces the sandwich architecturally: the OPENDEV paper runs a Plan Mode whose Planner sub-agent has only read-only tools, then a Normal Mode with full access — which is exactly Lesson 10's read-only planning phase doubling as the extra-high-compute slice. One useful corollary: maxing thinking on a balanced model can cost less than upgrading to a higher tier — worth evaluating before you pay for a bigger model on reasoning-heavy work.

When the sandwich is just routing overhead

The gap between the sandwich (66.5%) and uniform high (63.6%) is only 3% — not always worth the complexity. It loses when phases aren't cleanly separable (exploratory debugging interleaves planning and execution, so routing misclassifies and degrades to noisy uniform compute); when mode-switching adds more bugs than it prevents (teams without reliable planner/executor/verifier routing do better at a single uniform-high tier); when verification is already cheap (tests and types check correctness, so extra-high model-based verification just duplicates the harness); and when execution dominates (bulk migrations spend most tokens executing, so cutting compute there saves little).

↪ Your win: spend compute where ambiguity lives

Retrieval practice — recall, don't peek

Question 1The reasoning sandwich allocates the most compute to…

Question 2Continuous maximum reasoning scored worst on the benchmark because…

Question 3An underthought planning phase is costly because its errors…

Question 4In Claude Code, you raise a skill's reasoning budget by…

Question 5 · spaced recall from Lesson 10Plan Mode catches a wrong approach early by restricting the agent to…

Ask me anything. Want a phase-routed implementation sketched for your stack, or the cost math on maxing thinking vs. a tier upgrade? Next, Part 6 opens with Orchestrator-Worker — decomposing one task across parallel workers and synthesizing the results.
✎ Feedback