The named failure modes of agent systems. Once a term lives here, every lesson uses this word for it — because the name carries the fix. Grouped by the course's four parts.
Loading as much as possible into the prompt on the theory that more information can't hurt. It can: attention is finite even when the window is not, so unfocused context dilutes signal and degrades output. Fix by loading the smallest high-signal set on demand.
Mixing unrelated tasks in one long-running session. Each finished task leaves residue that fills the window and steers the current decision — the self-inflicted version of Infinite Context. Fix: one objective per session.
Avoid: "just /clear it" — for unrelated work, start a new session, not a clear.
Including instructions that are accurate and on-topic but inapplicable to the current task. Proximity in meaning creates interference, not safety: adjacent rules compete for attention and lower compliance with the rule that applies. Fix by loading task-scoped and pruning the rest.
Avoid: "comprehensive is safer" — relevance is not the inclusion test; "does it help this task?" is.
The six recurring defects in always-loaded context files — Lint Leakage, Context Bloat, Skill Leakage, Conflicting Instructions, Init Fossilization, Blind References — found in 91 of 100 popular repos. Five cut the signal-to-token ratio; one cuts resolvability.
An agent without verification and pushback instructions that executes every request without flagging problems — because flagging was never in scope. Fix with pre-task checks, in-task validation, and explicit stop conditions; reinforce with a separate reviewer.
Avoid: over-correcting into the cry-wolf agent that flags everything and gets ignored.
After context compression, the agent keeps working productively on a subtly wrong goal — a once-stated constraint dropped in summarisation, or initial instructions faded as history grows. Silent: it "completes" the wrong thing. Fix with a structured session_intent field, re-read before each action.
Avoid: "the agent would notice" — drift produces no internal signal.
A vague resource instruction ("preserve tokens", "be efficient") installs a competing objective that a long-horizon agent resolves by doing less work — skipping exploration, refusing ambitious tasks, stopping early. System-level constraints outrank the user task. Fix by reframing as quality targets ("be thorough").
Avoid: conflating with a quantified budget — a bounded TALE-style budget is safe; vague minimisation is not.
Accepting agent output as correct because it looks polished. Fluency is independent of accuracy; the agent is most dangerous when almost right. Fix by checking external ground truth — fetch URLs, run code, cross-reference docs — and automating what can be checked.
Avoid: over-verifying into verification theater (tests that miss the change) or alert fatigue.
Single-Layer Injection Defence · aliases: no defence-in-depth
Adding one mitigation (URL allow-listing, instruction hardening, or output filtering) and treating injection as solved. Each covers only its own vector; an attacker targets the gap. URL validation is not content validation. Fix with three independent layers: model-level resistance, infrastructure controls, product-level confirmation.
Avoid: "instruction hardening is enough" — it lowers rates but is not a hard boundary.
The default shape of an AI reviewer in CI/CD: it ingests attacker-writable PR/issue text while the same runtime holds repo-write tokens and pipeline secrets — the lethal trifecta on every run. One malicious PR title exfiltrates secrets (a vendor-confirmed CVSS 9.4). Fix by splitting a read-only reviewer from a separately-credentialed actor.
Avoid: "a better system prompt fixes it" — the model is not the gate; the fix is architectural.
Layering independent controls so no single bypass compromises the agent: model-level injection resistance, infrastructure-level egress controls, and product-level confirmation flows. The strongest controls are architectural — constraining what the model can do after reading untrusted input, not what it's told to do.
The three conditions that, present together in one runtime, make an agent exploitable: access to untrusted content, access to private data or secrets, and the ability to externally communicate. Breaking any one leg closes the attack.
Transformer attention has no channel separating a system-prompt instruction from a PR title, web page, or tool result that just entered context. The agent treats all tokens uniformly — the mechanism behind every prompt-injection failure.
Avoid: "the model knows what's trusted" — it has no origin metadata to reason over.
The core competence of the course: reading an observed symptom (an agent that "finished" the wrong thing, a vague session, a green check that hid a failure) and naming the failure mode behind it — because the name carries the first fix. Diagnosis precedes treatment.