Harness Engineering · ~7 min
The model can't reliably judge what it's allowed to touch. So don't make it. Put the boundary in the harness, where a misjudgment can't become an action.
An overeager action is one the agent takes outside your authorized scope on a perfectly benign task: wiping a stale credentials backup while "cleaning up the old auth handler." The agent finishes the job — and does more than asked. The fix is structural.
Why? LLMs encode authorization as text patterns, not durable intent. A "do not delete files outside
src/auth/" line gets pattern-matched against action descriptions — and a file named
auth-credentials.bak pattern-matches as auth-related and gets deleted. Stripping the consent block raised
overeager rates 11.9–17.2 points; the prompt was never a reliable boundary.
An ask-to-continue framework doesn't make the model smarter. It interposes a deterministic checkpoint between proposal and execution — a proposal cannot become an effect without a separate consent event. The misjudgment is unchanged; the harness denies it the chance to act.
Permission framework is a spectrum, not on/off. Match it to how reachable and destructive the surface is:
| Mechanism | What it does |
|---|---|
| Ask-to-continue | Pause for consent before each consequential action — the strongest measured reducer |
| Narrow allowlist | allow: [Edit(./src/**), Bash(npm test)] — deterministic, zero interruption, equivalent safety |
| Deny rules | deny: [Edit(.env*), Bash(rm *)] — a hard floor under everything else |
| Classifier gating (Auto Mode) | A two-stage classifier evaluates each call; auto-approves reads/in-project edits |
Claude Code's Auto Mode runs a fast single-token filter, then chain-of-thought reasoning when the first stage flags. On real traffic its false-positive rate drops to 0.4% — but it still leaves 17% of real overeager actions undetected. In most misses it spots the danger and assumes the consent statement covers it. Model-layer inference has a measured ceiling; framework-layer gating sits below it.
Approval fatigue: if users rubber-stamp every prompt (practitioner reports cite ~93% acceptance), ask-to-continue collapses to a permissive harness with friction — paper safety. Sandbox makes it moot: a hermetic container with no shared credentials neutralizes overeager actions by construction, making permission-mode choice second-order. For headless CI that can't pause, lean on the sandbox and a narrow allowlist, not prompts.
rm, .env, force-push, prod deploys.Retrieval practice — recall, don't peek
Question 1On OverEager-Bench, the biggest driver of overeager-action rate was…
Question 2An ask-to-continue framework reduces overeager actions by…
Question 3An overeager action is best described as…
Question 4Claude Code's Auto Mode classifier still misses roughly…
Question 5 · spaced recall from Lesson 04Multi-agent systems use roughly how many times the tokens of a single chat?