Context Engineering · ~6 min
Most of what fills a long session is tool output you read once and never need again. Strip it, keep a breadcrumb.
In Lesson 07 you split context into a stable prefix and a growing tail. The tail's biggest tenant isn't your reasoning — it's tool output: file reads, search results, test logs.
The one-liner preserves traceability (the agent sees what it consulted) at a fraction of the tokens. Here, ~1,100 tokens saved on every subsequent call.
| Tool output | Decision |
|---|---|
| File content (read, then edited) | Mask after the edit |
| Search results (synthesised into a plan) | Mask after synthesis |
| Test output (failure identified) | Mask after the fix is applied |
| Schema / API contract (queried throughout) | Retain |
| Reference docs (checked repeatedly) | Retain |
The heuristic: once the agent has extracted what it needs and expressed it as a decision or artifact, the raw output is a distractor — a since-edited file read still pulls attention toward stale state. Masking is finer-grained than /compact: it surgically drops single-use bulk while leaving your reasoning and decisions fully intact.
Reasoning models benefit from inspecting their full observation history mid-chain-of-thought — benchmarks show hard masking drops solve rate ~10% for them. Prefer LLM-based summarisation over hard removal in those configs. And never mask before synthesis is confirmed — masking a test failure before the fix removes the ground truth.
Rewriting an old observation changes history mid-stream, so the prompt cache busts from the mask point forward. Masking trades a cache rewrite for attention savings. Resolution: mask recent single-use outputs promptly (before they sink deep into cached history), or batch masks — don't continuously rewrite deep history.
Retrieval practice — recall, don't peek
Question 1In SE benchmarks, tool outputs are roughly what share of trajectory content?
Question 2Masking replaces a processed tool output with…
Question 3Which should you retain, not mask?
Question 4Hard masking hurts which models most?
Question 5 · spaced recall from Lesson 08Constraint violations peak at which compression level?
/compact, and offloading
(writing big payloads to disk and keeping a handle)? Or why this is runtime, not an instruction-file concern — so it
adds no CE check to the audit skill? Next up: Context Compression Strategies — the tiered system masking is one tier of.