Part 4 · Layering the Defense

Security · ~7 min

No Single Layer Holds

Every control you've learned can be bypassed under a determined attack. The fix isn't a better layer — it's enough independent layers that the survivors catch what the failures missed.

Why this, for you: Parts 1–3 gave you single controls — remove a leg, sandbox, egress policy, plan-then-execute. Each one fails eventually. Defense-in-depth is the meta-pattern that makes a stack survivable: assume every layer breaks, and design so no single break is catastrophic.

Perplexity's response to NIST's agent-security RFI puts the premise bluntly: "the non-deterministic nature of LLM reasoning ensures that any individual defense can be circumvented under sufficiently adaptive attack strategies." If every single layer eventually fails, the only durable answer is layering — multiple independent mechanisms, each catching what the others miss.

1 Five layers, each at a different level

The OPENDEV agent ships five independent safety layers, each operating at a different point in the stack — so the failure of one does not compromise the others:

LayerWhere it actsWhat it catches
Prompt guardrailsSystem promptNaive misuse — bypassed by injection
Schema restrictionsTool registryCalls to tools the subagent can't see
Runtime approvalsBefore dangerous opsWhat automation should not auto-run
Tool validationInput checkingValid tool, malformed/hostile inputs
Lifecycle hooksPre-tool executionAnything the prompt missed, deterministically
No single layer is sufficient. The combination produces safety properties that no individual mechanism can achieve alone — because each layer assumes the others will eventually fail.

2 Schema filtering beats runtime rejection

The layers are not interchangeable. The strongest tool restriction prevents the model from knowing a tool exists. A runtime check denies a forbidden call after the model forms the intent; schema filtering means the model never sees the tool, so it cannot form the intent in the first place. The attack surface shrinks before inference.

# Three layers in one agent definition — independent of each other tools: [Bash, Read] # schema: Write/Edit not even visible disallowed_tools: [Write, Edit, GitCommit] hooks: pre_tool: .claude/hooks/block-prod-commands.sh # deterministic, below the prompt

Even if an injection bypasses the prompt guardrail, the hook still blocks production-targeted commands; even if both the prompt and the hook are somehow circumvented, schema filtering means the agent cannot commit — the tool was never on its menu.

3 Layers cost — apply them in proportion

Depth is not free. Each layer adds configuration, testing, and latency, and a misconfigured layer can block legitimate work or create false confidence while doing nothing.

Approval fatigue compounds across layers

If every layer raises its own prompts, users approve everything to keep moving — and the stack becomes security theater. The fix is approval persistence: classify safe patterns once, grant blanket permission for them, and reserve prompts for genuinely dangerous operations. Without persistence, repeated prompts train users to rubber-stamp — undermining the very layer meant to protect them.

Apply the full five-layer stack to production agents with write access, external integrations, or multi-agent pipelines. For short-lived, read-only, or sandboxed internal tools, one or two targeted layers — schema restrictions plus a lifecycle hook — often deliver sufficient protection at far lower cost.

↪ Your win: build a stack, not a wall

Retrieval practice — recall, don't peek

Question 1The premise of defense-in-depth is that…

Question 2Schema-level tool filtering is stronger than runtime rejection because…

Question 3For independent layers to work, they should sit at…

Question 4Approval fatigue across layers is best mitigated by…

Question 5 · spaced recall from Lesson 6Under plan-then-execute, untrusted page content can…

Ask me anything. Want to map the five layers onto an agent you're shipping, or see which two layers give the best cost/coverage for a read-only tool? Next in Part 4: The Blast Radius — how least privilege bounds what any one layer failure can damage.
✎ Feedback