Tool Discoverability at Scale

Lesson 6 thinned one toolset. But some agents legitimately need dozens of tools — and past a point the model can't hold them all in view. The fix isn't fewer tools; it's loading them at the right time.

Why this, for you: the lever for when consolidation has already run and you still have too many tools to show at once. Discoverability decides which capabilities sit in context every turn and which get fetched on demand — a choice that moves both selection accuracy and token cost at the same time.

Consolidation cuts overlap, but it can't shrink a genuinely large surface area — a coding agent with search, source control, CI, observability, and ticketing servers can easily exceed what fits comfortably in context. Showing every tool, every turn, has two costs that pull together: prefix tokens spent whether or not the tool is used, and a selection accuracy that degrades once too many tools are visible at once.

Tool selection degrades significantly past roughly 30–50 visible tools, and a typical multi-server setup can consume on the order of tens of thousands of tokens in definitions before the conversation even starts. More visible tools is not more capability — past the cliff it's measurably worse selection and a standing token tax.

1 Eager vs. just-in-time loading

The modern answer is to not load everything. Hosts now support deferred tools: instead of sitting in the system-prompt prefix, a deferred tool waits behind a tool search step. The model issues a search, gets back a few matching tool references, and those expand inline only when needed.

# Eager — in the prefix from turn one project-search alwaysLoad: true # used every turn github alwaysLoad: true # used most sessions # Deferred — loaded only when a tool search matches linear # "ticket" → fetched on the rare turn it's needed sentry # "error" → incident-only grafana # "metric" → incident-only

Eager tools pay their token cost every turn but answer instantly. Deferred tools cost a search round-trip on first reference but stay out of the prefix the rest of the time. Crucially, deferral preserves prompt cache: the definition appends inline as a reference block rather than mutating the cached prefix.

2 The classification rubric

Decide per server — or per tool — using four signals. The published floor: enabling tool search is worth it at roughly 10+ tools or >10K tokens of definitions; below that, eager-load everything.

Signal	Eager-load	Defer (just-in-time)
Hit rate	Used in most sessions	Used in a small minority
Definition size	Small against the budget	Large enough to dominate
Latency tolerance	Cold round-trip is felt	A search step is acceptable
Description quality	Trustworthy under search	Keywords don't match task language

The guidance operationalises the first two as a rule of thumb: keep your 3–5 most frequently used tools eager for best performance, and defer the long tail.

3 The silent failure: search-miss

Deferral introduces a failure mode consolidation doesn't have. A deferred tool is only findable if the model can phrase a search query that matches its description. If the description uses internal jargon the model won't search for, the tool is invisible — the agent reports "no tool available" when the tool is right there.

Description craft becomes a discoverability gate

Before you defer a tool, audit its description against the task language a model would actually use — not the domain's internal vocabulary. Deferral turns the description-quality lesson into a hard dependency: a vague description that merely cost you a wrong selection when eager now costs you a tool that can't be found at all.

At the far end of scale — catalogs in the hundreds or thousands of tools — in-context enumeration stops working entirely, and retrieval-based selection (embedding search over a tool index) replaces it. The principle is the same one Lesson 6 started: the agent should see the right tools, not all of them.

↪ Your win: load tools at the right time, not all at once

Selection degrades past ~30–50 visible tools and definitions tax every turn — visibility isn't free.
Eager-load the 3–5 hottest tools; defer the long tail behind tool search.
Classify per server on hit rate, size, latency tolerance, and description quality.
Deferral preserves prompt cache — definitions append inline rather than mutating the prefix.
Audit descriptions before deferring — search-miss is the silent "tool not available" failure.

Retrieval practice — recall, don't peek

Question 1Tool selection accuracy degrades significantly once visible tools exceed about…

Question 2A deferred tool, versus an eager one, trades prefix tokens for…

Question 3The guidance says to keep eager the tools that are your…

Question 4"Tool not available" when the tool exists is the failure caused by…

Question 5 · spaced recall from Lesson 08A harness that parallelises read-only tools is only safe once the annotations have been…

Ask me anything. Want to classify a specific MCP server as eager or deferred, or see how retrieval-based selection takes over once a catalog runs to the thousands? Next: MCP Tool Exposure — the server-side decisions, from picking tool-vs-resource-vs-prompt to structured output, that make a server findable and legible in the first place.