Tool Engineering · ~7 min
More tools feels like more capability. To the agent it's more reasoning tax — and past a point, measurably worse accuracy. The fix is fewer, well-scoped tools.
The default mistake is to mirror the API: one tool per endpoint, one per operation. That produces a large set where the agent must chain calls for a single logical action — and must reason, at each step, about which tool comes next. Every one of those decisions is an opportunity for the wrong choice.
Each tool should map to one distinct, human-understandable sub-task. The test is mechanical:
Now the agent selects between "find" and "book" instead of reasoning about a five-step pipeline.
The real cost driver is overlapping function, which produces two failure modes: redundant calls (the
agent calls both when one would do) and wrong selection (it picks the less appropriate one because the distinction
is unclear). OpenAI's data-agent team found exposing their full tool set was "confusing to agents"; consolidating
and restricting — even removing valid options — improved end-to-end reliability. When multiple related
tools are genuinely necessary, group them under a namespace prefix (asana_search,
asana_task_create) so the relationship is explicit.
Collapsing tools only helps if the agent can still choose reliably. A unified search with
mode={text,semantic,symbol} wins only if the model picks the mode reliably
— otherwise you've moved the ambiguity from tool choice into parameter choice and gained nothing. Same with a
generic do_action(system, verb, payload): it pushes selection into parameter space and discards the
per-tool schemas that made each call legible.
Over-consolidation has its own failures. A merged tool that does too much becomes a black box: when
find_and_book_flight silently fails at the hold step, it looks identical to a failure at confirmation,
and the agent can't reason about which step broke. Keep tools separate when they:
| Don't merge if the tools… | Because… |
|---|---|
| Serve distinct sub-tasks not always done together | Forcing a merged call wastes tokens and obscures intent |
| Have different permission requirements | Merging grants excess access to every caller |
| Have wildly different output schemas | The merged response becomes incoherent to pattern-match |
The test: does the merged tool still map to one clear human action? If describing it takes a paragraph, it's over-consolidated. If two sub-tasks are sometimes called together but not always, keep them separate and let the agent compose.
Minimal tools pair with goal-oriented prompting. Prescriptive step-by-step instructions anchor the agent to one procedure that breaks when the task varies — "rigid instructions often pushed the agent down incorrect paths." Define the outcome and constraints, not the steps; the model has information about what it found that the prompt author didn't. (Weaker models are the exception — they benefit more from procedural scaffolding.)
Retrieval practice — recall, don't peek
Question 1Two tools that are always called together should be…
Question 2The real driver of selection errors in a large toolset is…
Question 3Over-consolidating into one black-box tool mainly costs you…
Question 4A unified search with a mode parameter only helps if…
Question 5 · spaced recall from Lesson 05An agent-facing error message should, above all,…