The same assembly order that shapes attention also sets your bill — and one careless byte in the prefix silently charges you full price on every turn.
Why this, for you: caching is where harness design (priority #2) meets the cost line. It's not a
toggle you flip afterward — it's a structural constraint on how you compose context. And it reframes a
decision you already know (assembly order) around a second objective: money and latency.
Prompt caching reuses the model's computation for an exact, byte-identical prefix. On Anthropic,
a cached read costs ~10% of base price; a cache write costs 125–200%.
Manus calls cache hit rate "the single most important metric for a production agent" — a 10× differential.
Design context as an immutable prefix + a growing tail: static content (system prompt, tool
definitions, project instructions) first and unchanging; variable content (history, latest message) last.
The layout — not a config flag — decides whether you pay 10% or 100% every turn.
Switching models. Model-specific instructions live in the prefix; a swap busts the whole session. Treat it as a context boundary.
Mutating the prefix to carry state. A timestamp, cwd, or config value in an early section re-writes the cache every call. Volatile data belongs in the tail.
~10%
cached read vs base
125–200%
cache write premium
10×
hit-vs-miss differential
0 errors
a miss is charged silently
A real Claude Code SDK bug busted the cache on every call — 12× cost, undetected until someone watched cache_read_input_tokens vs cache_creation_input_tokens. Misses never throw; they just bill.
The tension with everything you've learned
Attention says "rules at both edges." Caching says "static first, variable last." Do they fight?
At the front, they agree — your stable rules in primacy are good for attention and sit in the cached prefix.
At the back, they don't actually collide — because they work at different scopes. The whole instruction block is static, so its internal tail (your "critical rules, read last" restatement) is still inside the cached prefix. The truly variable content — history, the new message — comes after the entire instruction block.
The resolution is one rule: keep the instruction block static and put its critical rules at its own edges; push anything volatile (timestamps, cwd, per-session data) out of the prefix entirely. Volatile-in-prefix is the one move that loses on both axes — it busts the cache and adds noise.
↪ Your win: build a stable prefix and watch the meter
Immutable prefix, dynamic tail. System prompt + tool defs + instructions never mutate mid-session.
No volatile content in the prefix — no timestamps, cwd, or per-call personalization. Push it to the tail.
Sort tool definitions deterministically — non-deterministic order is a silent cache miss every call.
Monitorcache_read vs cache_creation — a mid-session creation spike means something mutated the prefix.
Compact by forking: keep the prefix, append the summary as new tail content — don't rebuild from scratch.
Question 5 · spaced recall from Lesson 06Adding more instruction layers past the ceiling tends to…
Ask me anything. Want the break-even math for your session shape (the 62.5-minute
1-hour-TTL rule), or to check whether any always-loaded file in content/ smuggles volatile content into
the prefix? That last one is exactly what the skill's new CE-9 check now hunts for.