The working vocabulary for the Tool Engineering course. Once a term lives here, every lesson uses this word for it. Grows as we go.
The Discipline
Tool engineering
Designing the tools agents use to act on the world — name, schema, description, output, errors, and the set as a whole — so an agent can select, invoke, and recover from them reliably. Agent quality is bounded by tool quality.
Avoid: "writing API wrappers" — a boilerplate wrapper is exactly what this discipline is not.
The literal bytes an agent reads to use a tool: name, schema, description, output, error. The agent reasons over the surface — not the implementation — so the surface is where reliability is won or lost.
The discipline of designing an external surface (SDK, CLI, API, docs) so an AI agent consumer can discover, invoke, and recover from it. The agent reads literally and has no intuition to repair ambiguity.
Avoid: conflating with harness engineering — AX is the surface you ship, not the agent's own scaffold.
A tool that mirrors an API endpoint one-to-one and returns its raw response. Shaped for the API, not the agent — the default mistake tool engineering corrects.
The "use this when X, prefer over other_tool when Y" guidance in a description that lets the agent choose between plausible tools. An instruction to the agent, not documentation of the interface.
Avoid: a bare capability statement — accurate about what the tool does but silent on when to prefer it.
A tool description written as if training a competent new hire: explicit about implicit context — domain terms, ID formats, valid filter values, resource relationships — that a terse API reference omits.
Redesigning the schema so the wrong call cannot be made: enumerated values over free text, bounded ranges with defaults, prerequisite gates (read-before-write), unambiguous names (user_id, not user).
Avoid: "validation" — validation rejects bad input at runtime; poka-yoke removes it from the interface entirely.
How much of the call sequence a description prescribes. Too low (hard-coded steps) is brittle when the task varies; too high (vague) burns tokens resolving ambiguity. State requirements and constraints; leave sequencing to the agent.
The framing that every tool result enters the context window and competes for attention. The design question becomes "what does the agent need to decide next?", not "what can I return?"
The sizing rule of thumb: tool output should fit in a paragraph. If it doesn't, the tool returns too much (filter it), the task truly needs it (structure it), or it's re-readable bulk (offload it to a file).
Returning natural-language fields the agent can reason over — names and formatted values — instead of opaque identifiers (UUIDs, MIME types). Reduces hallucination in retrieval tasks by removing transcription risk.
Avoid: developer-convenience output — the raw record with internal IDs and debug fields; gate that behind a debug mode.
The graceful-truncation shape for output that overflows the budget: a useful prefix + a structurally distinct marker (a leading banner, not trailing prose) + a continuation handle (offset, cursor, page token). All three, or it collapses to a hard error.
Avoid: a trailing [PARTIAL] line — the model reads the prefix as complete and the line as commentary.
A parameter (concise / detailed) that lets the agent pick output depth against its current context budget, instead of the tool always returning full or minimal.
An error that diagnoses the specific problem and names the fix ("end_date must be after start_date. Received…"), so the agent self-corrects on the next call. The error is part of the tool description in practice.
Avoid: opaque codes and tracebacks (400 Bad Request) — they tell the agent nothing it can act on.
The application/problem+json standard for machine-readable HTTP errors — five base fields plus extensions (retryable, retry_after, error_category) that map directly to agent control-flow branches: retry, escalate, or fail fast.
Keeping a failed call and its error in context as a negative example that steers the model off the dead end. Preserve during recovery; compact after success. Stripping reasoning traces measured a ~30% performance drop.
Avoid: "cleaning up" errors mid-task — that's deleting the guardrail.
Folding tools that are always called together — or where one's output always feeds another — into a single tool that maps to one human-understandable action. Reduces selection ambiguity and context footprint.
Avoid: over-consolidation into a black box that hides which step failed, costing failure granularity.
Two tools whose functions you can't distinguish in one sentence. The real driver of selection errors — producing redundant calls and wrong-tool choices — more than raw tool count.
A shared prefix (asana_search, asana_task_create) that signals related tools operate on the same system, reducing confusion when several genuinely distinct tools must coexist.
Defining the outcome and constraints rather than a step-by-step procedure. Prescriptive sequences are brittle when the task varies; the model has information the prompt author didn't. The complement to a minimal toolset.
Avoid: prescriptive scaffolding for capable models — reserve it for weaker models or compliance-driven, audit-trail work.
The property that running an operation twice leaves the same end state as running it once. Agents fail mid-task and get re-run with no memory of the first run, so idempotency is what keeps a retry from duplicating a branch, a comment, or a charge.
The foundational idempotency technique: one read to test for existing state before any write. Generalises to upsert over create and keying every artifact on a unique identifier (issue number, SHA) so it is findable rather than re-created.
Avoid: a single workflow-level guard — it skips unfinished work on a partial run; guard each artifact instead.
For effects that can't be probed for existence — charges, emails, webhooks — a record written against a unique key before executing and checked before re-executing. The log is the dedup record the resource itself doesn't provide; it must outlive the worst-case retry horizon.
An advisory flag on an MCP tool — readOnlyHint, destructiveHint, idempotentHint, openWorldHint — that the harness may read to make scheduling or gating decisions. Metadata, not a guarantee.
Avoid: trusting it unaudited — the spec treats annotations as untrusted unless the server is; a mis-marked mutating tool races under parallel dispatch.
A harness running honest readOnlyHint tools in parallel, collapsing the wall-clock cost of N independent reads from sum(latency) toward max(latency). Near-zero to wire (a static annotation lookup), but load-bearing on annotation honesty.
The per-server choice between keeping a tool's definitions in the prompt prefix every turn (eager) and deferring them behind a tool-search step (JIT). Classify on hit rate, definition size, latency tolerance, and description quality; keep the 3–5 hottest tools eager.
The point — roughly 30–50 visible tools — past which tool selection accuracy degrades significantly, as the correct tool gets harder to find among distractors. The reason discoverability, not just consolidation, matters at scale.
The silent failure of deferred loading: a tool whose description doesn't match the model's natural search terms becomes invisible, and the agent reports "no tool available" when the tool exists. Audit description craft before deferring.
One of the three things an MCP server exposes: a tool (model-invoked action), a resource (application-attached read-only context), or a prompt (user-triggered workflow). Picking the wrong one creates friction before naming or schema matter.
Avoid: exposing static read-only context as a tool — that forces a selection decision the agent shouldn't have to make.
Validated, schema-typed output a server returns alongside serialized JSON, so the agent relies on a known shape rather than parsing free text. Pairs with the two MCP error channels: protocol errors for the client, isError tool errors for the agent.
The standing token cost of tool definitions, paid every turn before any call because the definitions live in context. A large catalog can run to tens of thousands of tokens; deferring the long tail and trimming schemas is how you stop paying it.
The cost-budgeting rule: the locally cheapest call can be the most expensive across a task. An over-trimmed output forces a second round-trip; parallel dispatch against a rate-limited backend burns recovery turns. Price the loop, not the single call.