Security · ~7 min
Every control you've learned can be bypassed under a determined attack. The fix isn't a better layer — it's enough independent layers that the survivors catch what the failures missed.
Perplexity's response to NIST's agent-security RFI puts the premise bluntly: "the non-deterministic nature of LLM reasoning ensures that any individual defense can be circumvented under sufficiently adaptive attack strategies." If every single layer eventually fails, the only durable answer is layering — multiple independent mechanisms, each catching what the others miss.
The OPENDEV agent ships five independent safety layers, each operating at a different point in the stack — so the failure of one does not compromise the others:
| Layer | Where it acts | What it catches |
|---|---|---|
| Prompt guardrails | System prompt | Naive misuse — bypassed by injection |
| Schema restrictions | Tool registry | Calls to tools the subagent can't see |
| Runtime approvals | Before dangerous ops | What automation should not auto-run |
| Tool validation | Input checking | Valid tool, malformed/hostile inputs |
| Lifecycle hooks | Pre-tool execution | Anything the prompt missed, deterministically |
The layers are not interchangeable. The strongest tool restriction prevents the model from knowing a tool exists. A runtime check denies a forbidden call after the model forms the intent; schema filtering means the model never sees the tool, so it cannot form the intent in the first place. The attack surface shrinks before inference.
Even if an injection bypasses the prompt guardrail, the hook still blocks production-targeted commands; even if both the prompt and the hook are somehow circumvented, schema filtering means the agent cannot commit — the tool was never on its menu.
Depth is not free. Each layer adds configuration, testing, and latency, and a misconfigured layer can block legitimate work or create false confidence while doing nothing.
If every layer raises its own prompts, users approve everything to keep moving — and the stack becomes security theater. The fix is approval persistence: classify safe patterns once, grant blanket permission for them, and reserve prompts for genuinely dangerous operations. Without persistence, repeated prompts train users to rubber-stamp — undermining the very layer meant to protect them.
Apply the full five-layer stack to production agents with write access, external integrations, or multi-agent pipelines. For short-lived, read-only, or sandboxed internal tools, one or two targeted layers — schema restrictions plus a lifecycle hook — often deliver sufficient protection at far lower cost.
Retrieval practice — recall, don't peek
Question 1The premise of defense-in-depth is that…
Question 2Schema-level tool filtering is stronger than runtime rejection because…
Question 3For independent layers to work, they should sit at…
Question 4Approval fatigue across layers is best mitigated by…
Question 5 · spaced recall from Lesson 6Under plan-then-execute, untrusted page content can…