Context Engineering · ~7 min
Every token-saving trick looks free on a token counter. On a real task, most of them quietly cost you accuracy — measure that, not just the savings.
Three popular optimizations — swapping tool-call notation, minifying source code, and upgrading models — all save tokens. All three can regress accuracy or cost, and a token counter shows none of it. The fix is the same every time: A/B on a real task.
Token-optimized notations like TOON and TRON re-encode JSON to strip repeated property names and structural punctuation. Isolated single-turn benchmarks report 30–60% savings. Inside real agentic loops, the savings shrink and the accuracy bill arrives.
The cost has a clean mechanism: models were trained predominantly on JSON, so unfamiliar notation burns reasoning capacity on parsing instead of the task. Worse, TOON shows cascading parse failures across turns and collapses parallel tool-call output on most tested models — failure modes a single-turn test can never surface.
Reading an unfamiliar format degrades gracefully; generating it regresses sharply. Compressing the schema the model reads (input-only) is far safer than asking it to emit the compact format (bidirectional). Measure the two separately — input-only may pass where the full swap fails.
Stripping comments, whitespace, identifier length, and docstrings from source fed to a coding agent is the same trap with a sharper edge. The token savings are real; so is the lost capability.
On SWE-bench Verified with GPT-5-mini, cumulative minification cut input tokens 42% (~90,500 → ~52,800 per task) but dropped pass@1 resolution from 50.0% to 38.0% — a 12-point absolute regression, roughly one in four previously-solved tasks now failing.
The reason: LLMs use identifier names as a primary semantic channel, not redundant gloss on the AST. Strip the names and you remove load-bearing input. The lost capacity doesn't vanish — it resurfaces as failed tasks, or (in a log-compression study) as a 17% token cut that raised total session cost 67% because the model spent reasoning tokens reconstructing what was removed.
You don't have to change a line of code to get hit. A model upgrade can ship a new tokenizer, and the same prompt suddenly maps to more tokens — silently shifting cost, window headroom, and rate-limit consumption.
The multiplier is workload-shape dependent: a text-heavy system prompt measured 1.46×, a PDF 1.08×, matched-resolution
images ~1.0×. Forecast from a measured per-content-type multiplier on your own traffic, not the vendor's
worst case — and re-check max_tokens, compaction triggers, and any client-side token estimator before
cutting production traffic.
The escape hatch is to cut tokens without removing meaning. AST-preserving idiomatic transforms
(comprehensions over loops, +=, f-strings) achieve 18–38% token reduction with no correctness
loss on HumanEval — structural, not lossy. But even here, push too hard: reductions past 30% correlated with
an 18.7% drop in unit-test pass rate. Validate against a test suite.
cost-per-successful-task captures the trade in one number.Retrieval practice — recall, don't peek
Question 1Swapping JSON tool schemas to TRON inside agentic loops tends to…
Question 2Cumulative minification on SWE-bench Verified moved resolution rate…
Question 3A model upgrade shipping a new tokenizer silently changes…
Question 4The lower-risk way to cut generated-code tokens is to…
Question 5 · spaced recall from Lesson 17Building a prompt from modular, priority-ordered sections is…
cost-per-successful-task from your traces? Next: the Capstone — the whole discipline as one symptom→move decision table.