Context Engineering · ~5 min
Your agent gets dumber long before it runs out of room — and the safety net fires too late to help.
Here's the claim most people get wrong. Ask an experienced engineer "when does a long context start hurting quality?" and they'll say something proportional — "past halfway," "when it's nearly full."
RULER tested 17 models: Yi-34B claims 200K but has ~32K of effective context (16%); GPT-4 claims 128K and reaches ~64K (50%). The Chroma study confirmed all 18 frontier models tested — Opus 4, GPT-4.1, Gemini 2.5 Pro included — degrade with input length. Anthropic calls it "context rot". In this workspace we call it the dumb zone.
The single most useful refinement: reasoning degrades fastest, retrieval is most resilient. Budget by task, not by one percentage rule.
| Task type | Effective context | So… |
|---|---|---|
| Reasoning (planning, architecture) | 10–20% of window | Keep under ~32K where you can |
| Multi-hop / semantic retrieval | 16–50% of window | Prefer similarity over stuffing |
| Simple lookup (needle-in-haystack) | >99% recall, very deep | Tolerates large loads — but misleads |
| Code bug-fixing | Collapses fast* | Test at your real context length |
*Claude 3.5 Sonnet on LongCodeBench: 29% at 32K → 3% at 256K. And total context counts everything — system prompt, instructions, skill defs, history — not just your task tokens.
Claude Code's auto-compaction fires at ~95% fill. But reasoning quality has been eroding since ~10–20%. By the time the safety net triggers, you've spent most of the session in the dumb zone. That whole stretch is where quality silently erodes:
Compact manually at these transitions: before reasoning-heavy work, after big file reads you've extracted from, at task-type switches, and the moment the agent repeats itself.
Direct what survives: /compact Focus on the failing assertions in X and the Y method; drop CI logs.
For a reasoning-heavy session, move the trigger earlier:
CLAUDE_AUTOCOMPACT_PCT_OVERRIDE=55 claude (50–60% for architecture/debugging; leave 95% for pure retrieval).
Retrieval practice — recall, don't peek
Question 1Degradation onset is best modeled as…
Question 2Reasoning tasks effectively use roughly what share of the window?
Question 3Auto-compaction at 95% helps too little because by then…
CLAUDE.md, or see this on your own
repo? Just say so and we'll go deeper or move to Lesson 02.