Tool Engineering · ~7 min
Every tool call is a deposit into the context window. A tool that returns 10,000 tokens when 200 would do has spent 10% of a 100K window on a single call.
Reframe the tool. It isn't a data-access function; it's a context injection. Whatever it returns enters the window and competes for attention with everything else there. The design question is no longer "what can I return?" but "what does the agent need to know to decide what to do next?"
"3 checks passed, 1 failed: lint" — not the 40-field GitHub Actions run object with timestamps,
URLs, and raw logs the agent will never read.It's not only opportunity cost. Transformer self-attention computes pairwise relationships across every token, so irrelevant tokens compete with relevant ones for focus. Worse, oversized output strands the useful fields in the low-attention middle — the lost-in-the-middle effect. The agent's ability to act correctly on a field degrades the more noise surrounds it.
A CI tool returning the raw run object costs ~400 tokens per call. Compute the
summary at the tool layer — passed, and the names of any failures — and it drops to
~20. The agent reads "1 failed: lint" and runs the lint fixer immediately: no parsing,
nothing discarded.
A useful rule of thumb: tool output should fit in a paragraph. If it doesn't, one of three things is true — and each has a different fix.
| If the output is large because… | Then… |
|---|---|
| The tool returns more than the decision needs | Add filtering or summarisation at the tool layer |
| The task genuinely needs all of it | Load it once and structure it carefully (Lesson 4) |
| It's bulk the agent re-reads on demand | Write it to a file and return a reference + preview |
Prefer IDs and summaries over full objects. Structured output — JSON with named fields, or concise prose — is easier for the agent to process than a raw API dump.
Output isn't the only injection. Every tool definition sits in context on every turn. A typical multi-server MCP setup can consume ~55,000 tokens in tool definitions alone — before any task work begins. So token efficiency has two fronts: shrink what each call returns, and shrink the standing cost of the toolset itself. Precise descriptions help here too — an ambiguous one forces the agent to spend tokens resolving it before invoking (Lesson 2); a vague toolset taxes every decision (Lesson 6).
Over-filtering has its own failure modes. A summary that drops "unimportant" fields will eventually drop one a rare-but-valid path needs — and the agent can't ask for data it doesn't know exists. A bespoke summariser also breaks on every upstream schema change, and for a tool called once a session can cost more than it saves. Apply this where output is consistently large or the tool runs in a loop — measure context pressure before you build a summarisation layer.
Retrieval practice — recall, don't peek
Question 1The right design question for tool output is…
Question 2Beyond cost, oversized output hurts because irrelevant tokens…
Question 3The rough sizing heuristic is that tool output should fit in…
Question 4Aggressively stripping fields backfires mainly because the agent…
Question 5 · spaced recall from Lesson 02The most common tool-description failure states what the tool does but not…