Harness Engineering · ~7 min
Not every step needs the same depth of thought. Planning and verification are high-stakes; execution is mostly mechanical. Spend the compute where the ambiguity actually is.
The reasoning sandwich allocates maximum reasoning compute to planning and verification, reduced compute to execution — rather than one fixed level throughout. Concentrate thinking where ambiguity is highest; don't burn it on mechanical steps.
The loser is instructive: continuous maximum reasoning scored worst, not because thinking hurts but because uniform max compute on mechanical execution causes timeouts that drag completion rates down. More thinking everywhere is not the goal; thinking where it counts is.
The three phases impose structurally different cognitive demands:
| Phase | Demand | Cost of underthinking |
|---|---|---|
| Planning | Explore requirements, approaches, edge cases, risks | Errors propagate through every later step |
| Execution | Follow a decided plan; write code, run commands | Low — the hard decisions are already made |
| Verification | Compare output to requirements precisely | A missed failure becomes false completion |
Uniform max compute wastes budget on execution and, per the benchmark, times out. Concentrating compute where ambiguity is highest balances cost against quality — the same logic as Lesson 6's verification gates, applied to how hard the model thinks rather than whether you trust the result.
The lever has different names per tool, same effect:
ultrathink in a planning or verification skill to enable extended thinking; omit it in execution skills.thinking budget per call — high for planning/verification, standard for execution.Dual-mode operation enforces the sandwich architecturally: the OPENDEV paper runs a Plan Mode whose Planner sub-agent has only read-only tools, then a Normal Mode with full access — which is exactly Lesson 10's read-only planning phase doubling as the extra-high-compute slice. One useful corollary: maxing thinking on a balanced model can cost less than upgrading to a higher tier — worth evaluating before you pay for a bigger model on reasoning-heavy work.
The gap between the sandwich (66.5%) and uniform high (63.6%) is only 3% — not always worth the complexity. It loses when phases aren't cleanly separable (exploratory debugging interleaves planning and execution, so routing misclassifies and degrades to noisy uniform compute); when mode-switching adds more bugs than it prevents (teams without reliable planner/executor/verifier routing do better at a single uniform-high tier); when verification is already cheap (tests and types check correctness, so extra-high model-based verification just duplicates the harness); and when execution dominates (bulk migrations spend most tokens executing, so cutting compute there saves little).
ultrathink in plan/verify skills, not in execution.Retrieval practice — recall, don't peek
Question 1The reasoning sandwich allocates the most compute to…
Question 2Continuous maximum reasoning scored worst on the benchmark because…
Question 3An underthought planning phase is costly because its errors…
Question 4In Claude Code, you raise a skill's reasoning budget by…
Question 5 · spaced recall from Lesson 10Plan Mode catches a wrong approach early by restricting the agent to…