The discipline of designing what enters a model's context window, how it is structured, and what is excluded, to maximise output quality and reliability.
Avoid: "prompt engineering" (that is a sub-part — designing individual instructions within the context).
The figure of merit. Anthropic's framing: "the smallest set of high-signal tokens that maximize the likelihood of your desired outcome." Optimise for density, not volume.
Avoid: "more context is better", "fill the window".
Every token included is a token excluded. Context space is finite, so each inclusion displaces reasoning, instructions, or task-relevant content. Inclusion is exclusion.
Irrelevant context that accumulates and competes with relevant content for attention. Diagnostic: "Does this improve output on this specific task?" If no, it is pollution.
Avoid: "noise", "clutter" — pollution is the precise term.
One tier of the context stack, each with its own persistence: system prompt → project instructions → skill definitions → conversation history → tool outputs. Persistence and cost differ per layer.
Attention is strongest at the start and end of the window and weakest in the middle, regardless of content importance. Rules belong at the edges; reference material can sit in the middle.
Avoid: "the model ignores stuff" — the bias is positional and predictable.
The token length up to which a model actually performs the task well — often far below the advertised window (RULER: 16–50% of claimed). Degradation onset is closer to an absolute token threshold (~32K–100K) than a fixed percentage, and varies by task type.
Avoid: "the context window" when you mean the usable part.
Lossy summarisation of older turns to reclaim space. Auto-compaction fires at ~95% fill (Claude Code default); manual compaction (/compact) is triggered deliberately, before degradation sets in.
Avoid: "clearing" (that is /clear — discard, not summarise) and "truncation".
Information the agent cannot reach with its read/grep/glob tools — rationale ("why X over Y"),
constraints not encoded in code, domain rules, conventions present-but-unstated, out-of-band integrations.
The only kind that earns a place in an always-loaded instruction file.
Avoid: "documentation" — instruction files are a resource-allocation decision, not docs.
Anything the agent can obtain itself — directory trees, API signatures, dependency versions, config,
test patterns, visible conventions. Including it taxes every turn and creates a second source of truth that
goes stale. Belongs in the codebase, not the instruction file.
Replacing duplicated discoverable content with a path, not a copy — "use the repository pattern in
src/repos/." Gives the agent direction without a stale second copy.
Aggregate instruction load — not file count — drives degradation: doubling the rules in scope makes the
agent less likely to follow any one of them. The mechanism that makes discoverable bloat actively harmful.
The four sources instructions arrive from, outermost to most specific: system prompt → project
instructions (AGENTS.md/CLAUDE.md) → skill content → user message.
On a conflict, the layer closest to the task wins (user > skill > project > system). A behavioral
tendency, not a rule the model enforces — so contradictions across layers yield unpredictable output.
Avoid: "the model obeys the most important rule" — it's about position/specificity, not importance.
Claude Code's @path syntax, expanded verbatim at session start — equivalent to
concatenation. Position-bearing (an @file on line 1 lands in primacy) and silently broken if the
target moves.
A sub-agent starts fresh: it inherits none of the parent's project instructions, skills, or history unless
they are explicitly passed at invocation. Layering without an injection protocol = no project layer for sub-agents.
The cache-efficient context layout: static content (system prompt, tool definitions, project
instructions) first and unchanging, variable content (history, latest message) last. Determines
whether each turn pays ~10% or 100%.
Any change to the cached prefix that forces a full-price re-write: modifying tool definitions,
switching models, non-deterministic tool ordering, or injecting volatile state (timestamps, cwd)
into the prefix. Misses are silent — no error, just full billing.
Assemble static sections before any variable content so the byte-identical prefix matches the
cache. The same assembly-order lever as attention positioning, optimised for cost instead.
Avoid: conflating with attention order — same lever, different objective.
The ratio of task-relevant tokens to total tokens. Maximise it by cutting zero-density ceremony
(filler, boilerplate) while protecting high-density tokens (names, rationale, error messages) the
agent would otherwise reconstruct in reasoning.
Avoid: "make it shorter" — the goal is signal per token, not minimum length.
Applied to every line: "Can I remove a word — or this whole sentence — without losing a
constraint?" If yes, cut it. Convert prose to tables/bullets/rules.
Constraint violations peak at medium compression — best at the verbose and the crisp
ends, worst half-trimmed. Compress decisively to an unambiguous rule, or leave it verbose.
Replacing a processed tool output (file read, search, test log) with a one-line summary before
the next inference call — surgically removing single-use bulk while keeping the agent's decisions
and reasoning. Finer-grained than compaction.
Moving a large payload (file, API response) out of context to disk, leaving a reference + brief
summary the agent can re-read on demand. Recoverable, non-lossy — tier 1 of compression.
Avoid: conflating with summarisation — offloading preserves the content; summarisation discards it.
Replacing conversation history with a summary of objective, state, constraints, and next steps.
Lossy — tier 2, used after offloading and masking. Preserve "what's next," not just "what happened."
Pulling content into context via tool calls at the moment a step needs it, rather than preloading at
session start. Startup holds only instructions + tool descriptions. Preserves budget — but only when
retrieval is accurate (a noisy retriever distracts).
Avoid: assuming on-demand is always better — it trades preload cost for latency + retrieval-quality risk.
The preload counterpart: loading relevant files before a task so the agent pattern-matches against
real project conventions. Use for repetitive access and grounding; broad-to-narrow, critical context first.
Rewriting the objective + task list after each step so it lands in the high-attention recency tail, countering
drift in long sessions. Strong elicitation (imperative restatement of the core goal) cuts drift more than a bare list.
Keeping failed actions and error traces in context as negative examples that steer the model off dead
ends. Removing reasoning traces dropped performance ~30%. Preserve during recovery; compact after success.
Avoid: "cleaning up" errors mid-task — that's deleting the guardrail.
A coordinator delegates subtasks to isolated worker sub-agents that each run in their own window and
return a condensed summary — keeping the coordinator's context clean. The sub-agents inherit nothing
unless passed (see [[sub-agent context isolation]]).
Structural symbols extracted with tree-sitter, ranked by graph importance and fit to a token budget —
gives the agent codebase topology before it reads any file. Navigation over bulk-reading.
Feeding dependency versions, lock files, and runtime constraints into context so the agent codes against
the real environment, not stale training data — closing the version gap that drives environment-blind errors.
The finite token pool. Every token preloaded displaces one available for reasoning, tool results, and
implementation — the opportunity-cost framing behind [[token economics]]. Allocate by task type.
When a model upgrade ships a new tokenizer, the same prompt maps to a different token count — shifting
effective cost, window headroom, and rate limits before you change a line of code.
Avoid: "shorter is always cheaper" — measure the new tokenizer, don't assume.
An early hallucination enters context as a "fact"; every later step builds on the false premise while output
stays coherent and confident. The reliable fix is a clean, re-anchored session — corrective prompts patch the
symptom but leave the poison in context.
Avoid: "the agent will catch its own mistake" — it doesn't hedge.
Malicious instructions hidden in external content an agent consumes — web pages, repo files, MCP responses —
followed as if from the user. Works because the model is provenance-blind: attention treats all tokens
uniformly with no origin metadata. Severity scales with agent capability.
Layering independent controls so no single bypass compromises the agent: model-level injection resistance,
infrastructure-level egress controls, and product-level confirmation flows. The strongest are architectural —
constrain what the model can do after reading untrusted input, not what it's told to do.
Avoid: "URL allow-listing is enough" — allowed pages still carry injections.
Commands that attribute token consumption to the specific tool calls, memory files, and outputs responsible
(Claude Code's /context) — so you shrink the real culprit instead of pruning blindly. Diagnose
before you compress.
Deliberately loading relevant files into context before a task, broad→narrow, so the agent produces
project-specific output instead of generic boilerplate — the preload counterpart to just-in-time retrieval.
Avoid: "one-shot context dump" — order within the load still matters.
Carrying a long loop's record in a typed object outside the prompt, read by tool, instead of replaying the
full transcript each turn — converts O(n²) loop token cost to O(n). For short loops, prompt caching hits the
same curve for less work.
A behavioural shift — not quality decay — where a model rushes to finish, abbreviates reasoning, or summarises
early as it perceives the limit approaching, even with capacity remaining. Countered by buffer allocation,
counter-prompting, and token-budget transparency.
Avoid: conflating with the [[lost in the middle|dumb zone]] — distinct mechanism and fix.