Part 2 · Making Rules Stick

Prompt Engineering · ~7 min

Guardrails Beat Guidance

For coding agents, the rule that helps is the one that closes a door. Positive advice — "follow the style," "write good tests" — measurably hurts.

Why this, for you: Lesson 2 told you to prefer positive framing. This lesson is the one place that flips — but only for coding agents editing real repos. If you write a CLAUDE.md or .cursorrules, the rules that earn their place are guardrails, not encouragement. Knowing where the default inverts is the whole skill.

The first large-scale evaluation of rule files put the question to a benchmark. It scraped 679 rule files — 25,532 rules — from GitHub and ran 5,000+ agent runs on SWE-bench Verified (Zhang et al., 2026). What it found unsettles the polarity advice from Lesson 2.

1 What the benchmark found

Four results matter for how you write rules:

"Follow the existing code style" makes the agent worse on SWE-bench when added alone. "Do not introduce new formatting conventions" makes it better. Same intent — opposite sign on the outcome.

2 Why the asymmetry

Zhang et al. read the effect through potential-based reward shaping: rules don't teach new behaviour, they reshape the agent's search landscape. A negative constraint removes an infeasible branch — a discrete, binary cut. A positive directive adds a soft preference that competes with the model's training priors, and that objective conflict shows up as degraded benchmark performance.

The priming half is independent: any domain-relevant text activates the coding subspace of the model's representations regardless of content — which is exactly why random rules match hand-written ones. Rule presence primes; rule content shapes the search. The two effects stack.

3 Rewriting guidance into guardrails

Each rewrite turns a preference the agent must rank into a boundary it either crosses or doesn't:

Before — positive directiveAfter — negative guardrail
Follow the existing code styleDo not introduce new formatting conventions
Write clear commit messagesDo not squash unrelated changes into one commit
Keep changes focusedDo not refactor code outside the task scope
Write thorough testsDo not delete or skip existing tests

The one positive directive worth keeping is the kind the agent cannot infer from the codebase — a non-obvious build command or project-specific invocation:

# keep this — it's not discoverable from the repo: Run tests with pnpm test:integration --filter $PKG # the default `pnpm test` skips integration tests in this monorepo

Three boundaries on this result

Coding agents on SWE-bench only. General prompting still favours positive directives — the Lesson 2 default holds everywhere else. Rule sets under ~50 rules. Past that, the compliance ceiling (Lesson 4) dominates regardless of polarity — frontier models hit only 68% accuracy at 500 instructions (IFScale, 2025). Complementary, not superseding. AGENTS.md benchmarks found that tool-specific commands and non-inferable constraints produce the largest behaviour change (Gloaguen et al., 2026) — positive directives that still win where they supply information the agent can't reach otherwise.

↪ Your win: guardrails for what it knows, directives for what it can't

Retrieval practice — recall, don't peek

Question 1On SWE-bench, the only individually beneficial rule type was…

Question 2Random rules helped about as much as curated ones because…

Question 3Under reward shaping, a positive directive tends to hurt because it…

Question 4The positive directive worth keeping states…

Question 5 · spaced recall from Lesson 4Adding rules past the compliance ceiling produces…

Ask me anything. Want a pass over your CLAUDE.md converting guidance into guardrails, or help spotting which positive directives are actually non-inferable and should stay? Next in Part 3: Rules or Examples — choosing the vehicle that constrains interpretation.
✎ Feedback