Tool Engineering · ~10 min
Fourteen lessons, one habit: read the symptom, find the surface, apply the move. Here it is as a single routing table — now spanning reliability, scale, cost, and the surfaces beyond the catalog — plus a mixed review of the whole course.
The through-line of the whole course: agent quality is bounded by tool quality, and the bound is set by the surface, not the implementation. When an agent misuses a tool, you don't reach for the prompt — you read the symptom off the surface and apply the matching move.
Match the symptom to the surface, then to the fix. Each row is one lesson.
| Symptom you observe | Surface | The move |
|---|---|---|
| Agent picks the wrong tool, or doesn't find the right one | Name & description | Add selection signals: "use when X, prefer over Y when Z" (L2) |
| Agent passes malformed or out-of-range parameters | Schema | Poka-yoke: enums, bounded ranges + defaults, prerequisite gates (L2) |
| Context fills fast; quality drops mid-session | Output volume | Return only the next decision's inputs; paragraph heuristic (L3) |
| Agent hallucinates field names or miscopies IDs | Output shape | Semantic names over UUIDs; only decision-relevant fields (L4) |
| Big results overflow or get silently truncated | Output overflow | PARTIAL: prefix + leading marker + continuation handle (L4) |
| Agent retries a failing call or gives up | Errors | Diagnose + direct; RFC 9457 fields; preserve the trace (L5) |
| Agent hesitates between tools or chains many calls | The set | Consolidate overlap; namespace; prompt to outcomes (L6) |
| A re-run after failure duplicates branches, comments, or charges | Effects | Check-before-act, upsert, unique keys; idempotency log for side effects (L7) |
| Parallel reads race, or a "read" tool quietly mutates state | Annotations | Honest readOnlyHint/idempotentHint; audit before trusting (L8) |
| Selection degrades as the catalog grows; "tool not available" | Discoverability | Eager-load the 3–5 hot tools; defer the rest behind tool search (L9) |
| Wrong primitive, opaque names, unstructured server output | MCP exposure | Tool vs resource vs prompt; verb_noun; outputSchema (L10) |
| Calls are correct but the toolset is slow or expensive | Cost & latency | Budget the catalog tax; overlap reads to max; clear results (L11) |
| Agent misuses a correct tool; the gap is usage knowledge a schema can't carry | Skill packaging | Package knowledge (not behavior); description gate + Gotchas; fork heavy work (L12) |
| A rule must hold whatever the model decides — never push main, use pnpm | Lifecycle enforcement | Hook it, don't prompt it: PreToolUse + exit 2; cover substitution paths (L13) |
| The typed catalog is large but the model already knows the CLI | Tool interface | Collapse toward one run(); split execution from presentation; guard with a hook (L14) |
The course isn't a list of "always do X" rules — its levers pull against each other, and engineering is picking the right point on each axis.
For almost every decision in this course, one question routes it: does this make the agent's correct next action easier to take? If a field, a tool merge, an annotation, or an error rewrite makes the next step clearer or safer, keep it. If it adds tokens, ambiguity, or a race without making the next action easier, cut it. "Fix the interface, not the prompt" terminates — the prompt-patch loop doesn't.
The whole discipline assumes a stable, shared tool called across many sessions. That's where engineering the surface pays back. It's overhead — sometimes net-negative — for one-off exploratory scripts, tools wrapping a well-documented API the model already has strong priors for, or interfaces still changing every sprint, where heavy docs drift from reality and mislead more than a terse stub. The reliability and scaling moves carry the same caveat: idempotency guards, honest annotations, and defer-vs-eager tuning earn their keep on durable, high-traffic tools and are ceremony on a throwaway one. Engineer durable surfaces; keep throwaway ones thin.
Mixed review — the whole course, recall don't peek
Question 1 · from Lesson 01"Agent quality is bounded by tool quality" tells you to fix a misuse by changing the…
Question 2 · from Lesson 04 & 05A tool returning a 40-field run object with raw UUIDs, then a bare 500 on failure, breaks two rules — shape the output and…
Question 3 · from Lesson 07The first technique that makes a re-run after failure safe is to…
Question 4 · from Lesson 08A harness that dispatches readOnlyHint: true tools in parallel is only safe once you have…
Question 5 · from Lesson 09 & 11Past roughly thirty to fifty visible tools, eager-loading the whole catalog mainly costs you…
Question 6 · from Lesson 12, 13 & 14Match the surface beyond the catalog: usage knowledge a schema can't carry, a rule that must hold whatever the model decides, and a model that already knows the CLI go to…