The working vocabulary for the Security course. Once a term lives here, every lesson uses this word for it. Grows as we go.
The Threat Model
Lethal trifecta
Three capabilities that are harmless alone but exploitable together on one execution path: private data,
untrusted input, and external egress. Remove any one leg and the exfiltration path closes.
Avoid: "the agent is risky" — name which legs are present on which path.
The unit of a trifecta audit. One agent runs several paths; only a path holding all three legs is unsafe. Audit
per path, not per agent — a three-"Yes" path demands an architectural fix, not a prompt revision.
Avoid: auditing "the agent" as one undifferentiated unit.
An attack where malicious instructions in content the agent reads are followed as if they came from the user or
system prompt. Direct = the user types it; indirect = it rides in on retrieved content.
Avoid: "jailbreak" — injection is about provenance confusion, not safety-training bypass.
Injection where the payload arrives through content the agent retrieves itself — a page, repo file, MCP response,
PDF, or dependency metadata. The surface most developers forget, because clean-environment testing never exercises it.
The reason injection works: transformer attention processes every token uniformly, with no architectural channel
marking origin. Injected instructions share the same token space as legitimate ones and carry no origin metadata.
Avoid: "the model got confused" — the model has no information that could distinguish the sources.
A repository-based vector: malicious instructions in auto-processed config files (CLAUDE.md,
.cursorrules, .github/copilot-instructions.md) that bypass user review when a repo opens.
Supplying secrets as environment variables set before the agent starts — inherited by child processes but never
transmitted as text through a tool call. The default secrets pattern; never paste a key into a prompt.
Avoid: "store the key safely" — the point is the key never enters context at all.
A script that consumes a credential internally and returns only its output. The agent calls it by name with
intent; the raw secret never appears in the tool input or context window.
Enforcing filesystem and network isolation simultaneously, at the OS level. Filesystem-only allows
network exfiltration; network-only allows filesystem-based privilege escalation. Neither boundary alone contains an agent.
Avoid: "sandboxed" when only one boundary is enforced — that's a leak waiting to happen.
The failure mode of granular per-action prompts: users click "approve" without reading — the illusion of
oversight with none of the substance. A dual-boundary sandbox replaces most prompts with a hard safe zone.
Avoid: treating an approval gate as oversight when users habituate to it.
Blast radius · least privilege, permission scoping
The bound on damage a compromised agent can do, set by the permissions you grant. Every unneeded permission is
attack surface. Scope tools, files, mode, and repo access; decompose broad agents into narrow chains.
A harness-enforced domain allow/deny list that rejects a connection before it leaves the process, regardless of
what the model produced. Moves egress out of the model's trust boundary. Denies must override allow wildcards.
Avoid: letting the model decide which URLs to reach — injection defeats that immediately.
Once the egress check lives in the harness, the matcher is the boundary — one parser bug bypasses every
policy (e.g. a SOCKS5 null-byte that passes endsWith() but truncates in getaddrinfo()).
Keep a lower-layer enforcement point that doesn't trust the parser.
Committing to a typed, task-specific program before any untrusted page is observed. Page content can
populate values inside the fixed graph but cannot synthesize new actions or redefine the task. The default for web agents.
Avoid: ReAct for web agents taking consequential actions over multi-party content.
The structural family behind plan-then-execute: a privileged channel carries control flow from the
trusted user task; a quarantined channel handles untrusted content with no authority to alter what runs.
The anti-pattern of an AI reviewer in CI/CD that ingests PR/issue text (untrusted) while holding repo-write
tokens and secrets (private data + egress) in one runtime — the lethal trifecta on every run. Fix: two-runtime separation.
Multiple independent safety mechanisms, each at a different level of the stack, so the failure of any one does
not compromise the others. Assumes every single layer eventually gets bypassed under adaptive attack.
Avoid: trusting one "strong" control — any individual defense can be circumvented.
Removing a tool from the model's schema so it never sees the tool exists — stronger than runtime rejection,
because the model cannot form the intent to call a tool it cannot see. The attack surface shrinks before inference.
An operation outside the user's authorised scope on a benign task — not injection, not escape, an authorisation
failure. Driven more by the permission framework than the base model: identical weights span 1.1%–27.7%.
Avoid: "tune the model" — pick the framework (ask-to-continue or deny-list) first.
The runtime that enforces a sandbox, traded along isolation-vs-startup-cost. Containers: fast,
kernel-shared. MicroVMs: hardware-isolated, ~125 ms boot — the pick for untrusted/multi-tenant code. OS-level
isolators (bubblewrap, Seatbelt): no daemon, fastest, weakest against escape.
A single policy evaluation point between agent and tool that intercepts every call — identity, tool name,
arguments, rate limits — and forwards or denies deterministically. ~27% prompt-only violations vs 0% at the app
layer. Sees only traffic that traverses it; off-protocol actions bypass it.
Avoid: tool-name policies without argument inspection — a pre-approved tool with hostile args is still RCE.
Replacing a long-lived API key with a short-lived token minted from a signed OIDC JWT the runtime already holds.
The federation rule's claim-match block becomes the security boundary; token lifetime is capped so it can't outlive
the upstream identity. Does not close the workload-attestation gap.
Avoid: leaving ANTHROPIC_API_KEY="" — empty still shadows federation; unset it.
A supply-chain attack: an LLM recommends a nonexistent package, an attacker pre-registers the name, the agent
installs malware. 43% of hallucinated names recur across re-runs, making them enumerable; 48.6% sit far from any
real name, so typosquat detectors miss them. Defense is install authority, not model behavior.
Leaking private data in the query string of a fetched URL — the request itself is the channel, before any
response is read. Domain allow-lists ask the wrong question; a public-web index gate asks whether the URL
could encode user-specific data. Covers query strings only, not DNS/timing/header channels.
Agent output executed, rendered, or interpreted by a downstream sink — shell, SQL, HTML renderer, file path,
package manager — without per-sink validation. Trust does not transfer through a string boundary; treat the model
as any other user and validate its output at each sink. The controls are old; the applicability surface is new.
Avoid: conflating it with LLM06 Excessive Agency — that bounds the agent's actions; LLM05 validates its output.
A persistence attack: one untrusted read plants a dormant instruction in long-term memory that activates sessions
later when the user raises a sensitive topic, exfiltrating data. Composes the lethal trifecta across sessions,
so a per-session audit passes each half and misses the pivot. Fix: user-only memory writes; deny tool-return sources.
Avoid: trusting single-session injection resistance to transfer — write-time review runs in a context lacking the trigger.
Relevance-authorization gap · multitenant RAG leak, ABAC-gated retrieval
Retrieval ranks by relevance, which carries no signal about who is asking; in a shared index the top-scoring chunk
for one tenant can belong to another. Fix the search space, not the output: pre-filter candidates by ABAC at
the index, then post-filter the top-K to catch ANN bypass. Authorize the record, not just the tool call.
Avoid: application-layer post-filtering alone — the vector DB already spent its top-K budget on forbidden chunks.
Resource and cost exhaustion as a threat. An LLM call's cost is variable and attacker-influenceable, priced
linearly, so requests-per-second does not bind dollars-per-second. One control surface serves two owners — DoS
(availability) and denial-of-wallet (finance) — and the bill drains while availability metrics stay healthy.
Avoid: a single fixed RPS limit — it misses low-and-slow wallet attacks and breaks legitimate bursty workflows.
The complementary controls for unbounded consumption, each keying on a different unit of cost: per-call token
cap (output size), per-task iteration cap (loop depth), fan-out concurrency cap (parallel
breadth), cost-velocity breaker (rolling $/min), and per-day dollar budget (absolute ceiling).
Tuple-key on (user, repo, model); keep deterministic counters in the enforcement path, classifiers in detection only.