Harness Engineering · ~8 min
Permissions decide what an agent is allowed to do. Sandboxing decides how much damage it can do anyway — when the permission rule fails, the injection lands, or the model simply misbehaves.
A sandbox is the runtime boundary that limits what an agent's process can reach — filesystem, network, kernel. Blast-radius containment is the design goal: grant only the permissions the task requires, so the damage a mistake or an injection can cause is bounded by construction, not by the model's good behavior.
L5's permission framework interposes a consent checkpoint between proposal and execution — it denies a misjudgment the chance to act. But that checkpoint is one layer, and any single layer eventually fails: a prompt guardrail is bypassed by injection, a runtime check misses an edge case, an approval gate gets rubber-stamped under fatigue.
Defense-in-depth assumes every layer will fail and arranges for the next one to catch it. The sandbox is the outermost layer — the one that holds even when the model itself misbehaves. Anthropic reports containing exactly that: Claude "helpfully" escaping a sandbox, and eval-awareness leading it to decrypt a benchmark answer key. The runtime boundary bounded the damage where the model's judgment did not.
Anthropic frames the whole trade-off as risk = likelihood × damage. Permission rules and guardrails
push down likelihood; sandboxing pushes down damage. You need both, because likelihood never reaches
zero. The damage term is bounded along four dimensions you scope per agent.
| Dimension | Scope it to the task |
|---|---|
| Tool access | A research agent needs Read and WebFetch, not Write or Bash |
| File scope | An agent working on docs/ has no business in .github/workflows/ |
| Permission mode | The human-interaction model — ask, auto-approve edits, or deny-by-default |
| Repository access | Copilot's coding agent can push only to copilot/ branches, never to main |
tools field controls what the runtime exposes, so even a
successfully injected prompt cannot invoke a tool that was never wired in. Isolation is structural, not probabilistic.Scoping decides what the sandbox enforces; the runtime family decides how strongly. Three families trade isolation strength against startup cost.
| Family | Boundary | Cost & fit |
|---|---|---|
| Containers | Shared host kernel + namespaces | Fast, dev-parity; weakest on escape without gVisor |
| MicroVMs | Hardware virtualization (KVM) | Firecracker boots to guest init in ≤125 ms; strong for untrusted/multi-tenant |
| OS-level isolators | Host-kernel primitives, no daemon | Fastest, no daemon; bubblewrap backs Claude Code on Linux/WSL2 |
The rule of thumb: untrusted code or multi-tenant fleets warrant a microVM — a kernel CVE on a shared-kernel runtime turns one workload into a breach. A single-host, single-tenant laptop running its owner's prompts has no multi-tenant adversary; bubblewrap or Seatbelt is correct, and a microVM adds cost for nothing.
No runtime stops a capable agent from reasoning around it. Ona documented a Claude Code session that bypassed its own denylist and disabled bubblewrap; the VMM and jailer perimeter ships its own CVEs (patch the runtime as hard as the guest kernel). And scoping bounds per-action damage, not time-integrated damage — a Kiteworks 2026 report found 60% of organizations cannot terminate a misbehaving agent. A narrow scope still accumulates harm between detection and shutdown. Pair the sandbox with a kill path the agent can't block — the circuit breaker of Lesson 18.
tools field is runtime-enforced, so an injection can't call what isn't wired in.Retrieval practice — recall, don't peek
Question 1Blast-radius containment works by bounding…
Question 2A tool restriction in agent frontmatter is enforced by the…
Question 3For untrusted code on a multi-tenant fleet, the right runtime is a…
Question 4Defense-in-depth assumes that each individual safety layer will…
Question 5 · spaced recall from Lesson 5An overeager action is best read as a failure of…