Measure Before You Optimize

Every token-saving trick looks free on a token counter. On a real task, most of them quietly cost you accuracy — measure that, not just the savings.

Why this, for you: token optimizations are tempting because the savings are loud and the cost is silent. A counter shows the win instantly; the accuracy hit only shows up in failed tasks weeks later. This lesson is cost/perf engineering for harnesses (#2) plus the skepticism that protects your daily output quality (#1).

Three popular optimizations — swapping tool-call notation, minifying source code, and upgrading models — all save tokens. All three can regress accuracy or cost, and a token counter shows none of it. The fix is the same every time: A/B on a real task.

1 Token-saving formats lose accuracy

Token-optimized notations like TOON and TRON re-encode JSON to strip repeated property names and structural punctuation. Isolated single-turn benchmarks report 30–60% savings. Inside real agentic loops, the savings shrink and the accuracy bill arrives.

Across four agentic suites on five open-weight models, TRON cut input tokens up to 27% but regressed accuracy within 14 percentage points of JSON; TOON saved up to 18% and lost ~9pp. Plain JSON had the best one-shot accuracy.

The cost has a clean mechanism: models were trained predominantly on JSON, so unfamiliar notation burns reasoning capacity on parsing instead of the task. Worse, TOON shows cascading parse failures across turns and collapses parallel tool-call output on most tested models — failure modes a single-turn test can never surface.

Input and output compression are not one decision

Reading an unfamiliar format degrades gracefully; generating it regresses sharply. Compressing the schema the model reads (input-only) is far safer than asking it to emit the compact format (bidirectional). Measure the two separately — input-only may pass where the full swap fails.

2 Minification cuts tokens, tanks solve rate

Stripping comments, whitespace, identifier length, and docstrings from source fed to a coding agent is the same trap with a sharper edge. The token savings are real; so is the lost capability.

42% fewer tokens, 50% → 38% solved

On SWE-bench Verified with GPT-5-mini, cumulative minification cut input tokens 42% (~90,500 → ~52,800 per task) but dropped pass@1 resolution from 50.0% to 38.0% — a 12-point absolute regression, roughly one in four previously-solved tasks now failing.

The reason: LLMs use identifier names as a primary semantic channel, not redundant gloss on the AST. Strip the names and you remove load-bearing input. The lost capacity doesn't vanish — it resurfaces as failed tasks, or (in a log-compression study) as a 17% token cut that raised total session cost 67% because the model spent reasoning tokens reconstructing what was removed.

3 A new tokenizer moves your numbers

You don't have to change a line of code to get hit. A model upgrade can ship a new tokenizer, and the same prompt suddenly maps to more tokens — silently shifting cost, window headroom, and rate-limit consumption.

Claude Opus 4.7's tokenizer produces ~1.0–1.35× as many tokens as prior models (up to ~35% more). Per-token price and the 1M window are unchanged — yet a measured text-heavy mix at 1.34× lifts a $600 spend forecast to ~$804, and 1M nominal context holds only ~746K "old-equivalent" tokens.

The multiplier is workload-shape dependent: a text-heavy system prompt measured 1.46×, a PDF 1.08×, matched-resolution images ~1.0×. Forecast from a measured per-content-type multiplier on your own traffic, not the vendor's worst case — and re-check max_tokens, compaction triggers, and any client-side token estimator before cutting production traffic.

4 Structural beats lossy compression

The escape hatch is to cut tokens without removing meaning. AST-preserving idiomatic transforms (comprehensions over loops, +=, f-strings) achieve 18–38% token reduction with no correctness loss on HumanEval — structural, not lossy. But even here, push too hard: reductions past 30% correlated with an 18.7% drop in unit-test pass rate. Validate against a test suite.

Prefer structural efficiency over lossy compression, and never optimize via the prompt: "be concise" is a competing objective. Cursor watched a model refuse a task — "I'm not supposed to waste tokens… not worth continuing!" Show idiomatic examples; don't moralize tokens.

↪ Your win: A/B every token optimization on a real task

A/B before adopting — replay real sessions, measure end-to-end success, not just token count.
Price the accuracy cost — cost-per-successful-task captures the trade in one number.
Compress input and output separately — input-only is safer; bidirectional often regresses.
Budget for tokenizer changes on any model migration; re-measure the multiplier on your mix.
Prefer structural efficiency over lossy compression; never optimize tokens via the prompt.

Retrieval practice — recall, don't peek

Question 1Swapping JSON tool schemas to TRON inside agentic loops tends to…

Question 2Cumulative minification on SWE-bench Verified moved resolution rate…

Question 3A model upgrade shipping a new tokenizer silently changes…

Question 4The lower-risk way to cut generated-code tokens is to…

Question 5 · spaced recall from Lesson 17Building a prompt from modular, priority-ordered sections is…

Ask me anything. Want the three-config replay harness that decouples input-only from bidirectional compression, or how to compute cost-per-successful-task from your traces? Next: the Capstone — the whole discipline as one symptom→move decision table.