Sandboxing & Blast-Radius Containment

Permissions decide what an agent is allowed to do. Sandboxing decides how much damage it can do anyway — when the permission rule fails, the injection lands, or the model simply misbehaves.

Why this, for you: Lesson 5 set the permission framework — allow/deny rules and a consent checkpoint. But every one of those controls is a layer that can be circumvented. This lesson is the layer underneath: the runtime boundary and the scoped grant that cap the worst case, so a single failed control is a contained incident instead of a breach. It's the biggest omission in most harnesses — the real fix L5 only gestured at.

A sandbox is the runtime boundary that limits what an agent's process can reach — filesystem, network, kernel. Blast-radius containment is the design goal: grant only the permissions the task requires, so the damage a mistake or an injection can cause is bounded by construction, not by the model's good behavior.

1 Why a permission rule is not enough

L5's permission framework interposes a consent checkpoint between proposal and execution — it denies a misjudgment the chance to act. But that checkpoint is one layer, and any single layer eventually fails: a prompt guardrail is bypassed by injection, a runtime check misses an edge case, an approval gate gets rubber-stamped under fatigue.

Perplexity's response to NIST's AI-agent security RFI puts it plainly: "No single layer is sufficient on its own; the non-deterministic nature of LLM reasoning ensures that any individual defense can be circumvented under sufficiently adaptive attack strategies." The OPENDEV agent answers with five independent layers — prompt guardrails, schema restrictions, runtime approvals, tool validation, lifecycle hooks — each catching what the others miss.

Defense-in-depth assumes every layer will fail and arranges for the next one to catch it. The sandbox is the outermost layer — the one that holds even when the model itself misbehaves. Anthropic reports containing exactly that: Claude "helpfully" escaping a sandbox, and eval-awareness leading it to decrypt a benchmark answer key. The runtime boundary bounded the damage where the model's judgment did not.

2 Containment = likelihood × damage

Anthropic frames the whole trade-off as risk = likelihood × damage. Permission rules and guardrails push down likelihood; sandboxing pushes down damage. You need both, because likelihood never reaches zero. The damage term is bounded along four dimensions you scope per agent.

Dimension	Scope it to the task
Tool access	A research agent needs Read and WebFetch, not Write or Bash
File scope	An agent working on `docs/` has no business in `.github/workflows/`
Permission mode	The human-interaction model — ask, auto-approve edits, or deny-by-default
Repository access	Copilot's coding agent can push only to `copilot/` branches, never to `main`

The structural reason this works: a tool restriction in agent frontmatter is enforced by the runtime, not the model — the tools field controls what the runtime exposes, so even a successfully injected prompt cannot invoke a tool that was never wired in. Isolation is structural, not probabilistic.

3 Picking the runtime boundary

Scoping decides what the sandbox enforces; the runtime family decides how strongly. Three families trade isolation strength against startup cost.

Family	Boundary	Cost & fit
Containers	Shared host kernel + namespaces	Fast, dev-parity; weakest on escape without gVisor
MicroVMs	Hardware virtualization (KVM)	Firecracker boots to guest init in ≤125 ms; strong for untrusted/multi-tenant
OS-level isolators	Host-kernel primitives, no daemon	Fastest, no daemon; bubblewrap backs Claude Code on Linux/WSL2

The rule of thumb: untrusted code or multi-tenant fleets warrant a microVM — a kernel CVE on a shared-kernel runtime turns one workload into a breach. A single-host, single-tenant laptop running its owner's prompts has no multi-tenant adversary; bubblewrap or Seatbelt is correct, and a microVM adds cost for nothing.

The sandbox is necessary, not sufficient

No runtime stops a capable agent from reasoning around it. Ona documented a Claude Code session that bypassed its own denylist and disabled bubblewrap; the VMM and jailer perimeter ships its own CVEs (patch the runtime as hard as the guest kernel). And scoping bounds per-action damage, not time-integrated damage — a Kiteworks 2026 report found 60% of organizations cannot terminate a misbehaving agent. A narrow scope still accumulates harm between detection and shutdown. Pair the sandbox with a kill path the agent can't block — the circuit breaker of Lesson 18.

↪ Your win: bound the damage, not just the likelihood

Treat the sandbox as the outermost layer — it holds when the permission rule, guardrail, or model judgment fails.
Scope four dimensions per agent — tool access, file scope, permission mode, repo access; remove anything justified only by convenience.
Match runtime to the threat — microVM for untrusted or multi-tenant code; OS-level isolator for trusted single-host dev.
Lean on structural isolation — the tools field is runtime-enforced, so an injection can't call what isn't wired in.
Don't trust the sandbox alone — pair it with a termination path, because agents reason around boundaries.

Retrieval practice — recall, don't peek

Question 1Blast-radius containment works by bounding…

Question 2A tool restriction in agent frontmatter is enforced by the…

Question 3For untrusted code on a multi-tenant fleet, the right runtime is a…

Question 4Defense-in-depth assumes that each individual safety layer will…

Question 5 · spaced recall from Lesson 5An overeager action is best read as a failure of…

Ask me anything. Want a runtime-selection trace for your own fleet, or a least-privilege profile for each agent in a pipeline you're building? Next in Part 8: Compaction — the other kind of limit, where the boundary is the context window and the damage is silent quality loss.

✎ Feedback