MCP Server Design · ~7 min
A great server nobody can find is dead weight. So is one that floods the window. Discoverability and token cost are two sides of the same budget — and they shape how your server evolves.
Tool schemas are injected into the model's context at startup. The GitHub MCP server alone has been measured at roughly 55,000 tokens across its 93 tool definitions; stacking servers can consume a third or more of a 200K window before any user input. Discovery is the mechanism that makes a large catalog affordable.
Modern hosts let each server choose whether its tools live in context from turn one (eager) or sit
behind a tool-search step that loads them on first reference (just-in-time). Claude Code exposes this
as alwaysLoad; the API has a per-tool defer_loading. Four signals decide:
| Signal | Eager (alwaysLoad: true) | JIT (default) |
|---|---|---|
| Hit rate | Used most sessions | Used in <20% of sessions |
| Definition size | Small vs. budget | Large enough to dominate |
| Latency tolerance | Round-trip is felt | Search step is acceptable |
| Description quality | Trustworthy under search | Keywords don't match tasks |
Deferred loading preserves the prompt cache: deferred tool definitions append inline when discovered, so the cacheable prefix is untouched — unlike naively rebuilding the tool list per step, which busts the cache.
JIT's failure mode is invisible. If your tool descriptions use jargon the model won't search for, the agent reports “no tool available” when one is right there.
One benchmark found retrieval errors account for nearly half of all failures in MCP agent tasks across diverse tool sets. Selection accuracy also degrades significantly past 30–50 visible tools. Audit description craft before you defer — and keep the catalog small.
Server definitions can live at user, project/workspace, and local scope. When the same name appears at multiple scopes, hosts apply most-specific-wins — the closest scope defining a name silently disables the outer ones. Two consequences for a server author:
github points at a private fork and the team's at
the public registry, dedup silently disables one. Shadowing is for overrides of the same intent; parallel
intents need distinct names.enum breaks until redeploy — prefer thin string types
on fast-churning fields so your surface stays stable across the agent's training gap.The remote-vs-local choice from Lesson 1 also locks your auth path: remote forces OAuth, which forces the client-registration decision (CIMD vs. DCR), which shapes whether you need a credential vault. Sequence those decisions deliberately — each forecloses the next.
Retrieval practice — recall, don't peek
Question 1Anthropic's floor for enabling tool search is roughly…
Question 2The silent failure mode of just-in-time loading is…
Question 3When one server name appears at several scopes, hosts apply…
Question 4Deferred tool definitions preserve the prompt cache because they…
Question 5 · spaced recall from Lesson 4For coding agents, the leg of the lethal trifecta to remove first is…