Part 2 · Taming the Tail

Context Engineering · ~6 min

Masking the Tail

Most of what fills a long session is tool output you read once and never need again. Strip it, keep a breadcrumb.

Why this, for you: Lessons 01–08 shaped the static prefix. Now the growing tail. This is the highest-frequency daily-coding win in long Claude Code sessions — the thing that keeps you out of the dumb zone without losing your work.

In Lesson 07 you split context into a stable prefix and a growing tail. The tail's biggest tenant isn't your reasoning — it's tool output: file reads, search results, test logs.

~84%
of trajectory content in software-engineering agent benchmarks is observation tokens (tool outputs) — and most are consumed once during synthesis and never referenced again (arXiv:2508.21433).
The useful artifact of a tool call is what the agent produced from it — the edit, the decision, the plan — not the raw output. Observation masking replaces a processed tool output with a one-line summary before the next inference call.
# after: read → edit → test [masked: read_file session.ts — 312 lines, found validateSession return type] [masked: edit_file session.ts — applied refactor, 14 lines changed] [tool: run_tests] → 847 lines, 1 failure ← RETAINED: agent still needs it

The one-liner preserves traceability (the agent sees what it consulted) at a fraction of the tokens. Here, ~1,100 tokens saved on every subsequent call.

What to mask, what to keep

Tool outputDecision
File content (read, then edited)Mask after the edit
Search results (synthesised into a plan)Mask after synthesis
Test output (failure identified)Mask after the fix is applied
Schema / API contract (queried throughout)Retain
Reference docs (checked repeatedly)Retain

The heuristic: once the agent has extracted what it needs and expressed it as a decision or artifact, the raw output is a distractor — a since-edited file read still pulls attention toward stale state. Masking is finer-grained than /compact: it surgically drops single-use bulk while leaving your reasoning and decisions fully intact.

Two edges that bite

① Extended-thinking models lose ~10% from hard masking

Reasoning models benefit from inspecting their full observation history mid-chain-of-thought — benchmarks show hard masking drops solve rate ~10% for them. Prefer LLM-based summarisation over hard removal in those configs. And never mask before synthesis is confirmed — masking a test failure before the fix removes the ground truth.

② Masking mutates the tail — which fights the cache (Lesson 07)

Rewriting an old observation changes history mid-stream, so the prompt cache busts from the mask point forward. Masking trades a cache rewrite for attention savings. Resolution: mask recent single-use outputs promptly (before they sink deep into cached history), or batch masks — don't continuously rewrite deep history.

↪ Your win: drop single-use output, keep a breadcrumb

Retrieval practice — recall, don't peek

Question 1In SE benchmarks, tool outputs are roughly what share of trajectory content?

Question 2Masking replaces a processed tool output with…

Question 3Which should you retain, not mask?

Question 4Hard masking hurts which models most?

Question 5 · spaced recall from Lesson 08Constraint violations peak at which compression level?

Ask me anything. Want the difference between masking, /compact, and offloading (writing big payloads to disk and keeping a handle)? Or why this is runtime, not an instruction-file concern — so it adds no CE check to the audit skill? Next up: Context Compression Strategies — the tiered system masking is one tier of.
✎ Feedback