Capstone

Tool Engineering · ~10 min

Capstone — The Tool Engineer's Decision Table

Fourteen lessons, one habit: read the symptom, find the surface, apply the move. Here it is as a single routing table — now spanning reliability, scale, cost, and the surfaces beyond the catalog — plus a mixed review of the whole course.

Why this, for you: the page to keep open while you design. Every recurring agent-tool failure maps to a surface and a move you now know. This capstone collapses the course into a diagnostic you can run in seconds against a real tool — and then tests that you've internalised it.

The through-line of the whole course: agent quality is bounded by tool quality, and the bound is set by the surface, not the implementation. When an agent misuses a tool, you don't reach for the prompt — you read the symptom off the surface and apply the matching move.

1 The decision table

Match the symptom to the surface, then to the fix. Each row is one lesson.

Symptom you observeSurfaceThe move
Agent picks the wrong tool, or doesn't find the right oneName & descriptionAdd selection signals: "use when X, prefer over Y when Z" (L2)
Agent passes malformed or out-of-range parametersSchemaPoka-yoke: enums, bounded ranges + defaults, prerequisite gates (L2)
Context fills fast; quality drops mid-sessionOutput volumeReturn only the next decision's inputs; paragraph heuristic (L3)
Agent hallucinates field names or miscopies IDsOutput shapeSemantic names over UUIDs; only decision-relevant fields (L4)
Big results overflow or get silently truncatedOutput overflowPARTIAL: prefix + leading marker + continuation handle (L4)
Agent retries a failing call or gives upErrorsDiagnose + direct; RFC 9457 fields; preserve the trace (L5)
Agent hesitates between tools or chains many callsThe setConsolidate overlap; namespace; prompt to outcomes (L6)
A re-run after failure duplicates branches, comments, or chargesEffectsCheck-before-act, upsert, unique keys; idempotency log for side effects (L7)
Parallel reads race, or a "read" tool quietly mutates stateAnnotationsHonest readOnlyHint/idempotentHint; audit before trusting (L8)
Selection degrades as the catalog grows; "tool not available"DiscoverabilityEager-load the 3–5 hot tools; defer the rest behind tool search (L9)
Wrong primitive, opaque names, unstructured server outputMCP exposureTool vs resource vs prompt; verb_noun; outputSchema (L10)
Calls are correct but the toolset is slow or expensiveCost & latencyBudget the catalog tax; overlap reads to max; clear results (L11)
Agent misuses a correct tool; the gap is usage knowledge a schema can't carrySkill packagingPackage knowledge (not behavior); description gate + Gotchas; fork heavy work (L12)
A rule must hold whatever the model decides — never push main, use pnpmLifecycle enforcementHook it, don't prompt it: PreToolUse + exit 2; cover substitution paths (L13)
The typed catalog is large but the model already knows the CLITool interfaceCollapse toward one run(); split execution from presentation; guard with a hook (L14)

2 Three tensions to hold

The course isn't a list of "always do X" rules — its levers pull against each other, and engineering is picking the right point on each axis.

Completeness vs. economy. Richer descriptions and outputs prevent misuse — but every token is paid on every call. Consolidation vs. granularity. Fewer tools cut selection ambiguity — but a merged black box hides which step failed. Eager vs. deferred. Keeping a tool in-context saves a discovery round-trip — but past ~30–50 visible tools it dilutes selection. Enforce vs. guide. A hook makes a rule absolute — but it sees parameters, not intent, so anything contextual belongs in the prompt. There is no universal setting; each depends on whether the tool is durable, shared, and frequently hit.

The unifying test

For almost every decision in this course, one question routes it: does this make the agent's correct next action easier to take? If a field, a tool merge, an annotation, or an error rewrite makes the next step clearer or safer, keep it. If it adds tokens, ambiguity, or a race without making the next action easier, cut it. "Fix the interface, not the prompt" terminates — the prompt-patch loop doesn't.

3 Where this is overhead, not investment

The whole discipline assumes a stable, shared tool called across many sessions. That's where engineering the surface pays back. It's overhead — sometimes net-negative — for one-off exploratory scripts, tools wrapping a well-documented API the model already has strong priors for, or interfaces still changing every sprint, where heavy docs drift from reality and mislead more than a terse stub. The reliability and scaling moves carry the same caveat: idempotency guards, honest annotations, and defer-vs-eager tuning earn their keep on durable, high-traffic tools and are ceremony on a throwaway one. Engineer durable surfaces; keep throwaway ones thin.

↪ Your win: the whole discipline, in one habit

Mixed review — the whole course, recall don't peek

Question 1 · from Lesson 01"Agent quality is bounded by tool quality" tells you to fix a misuse by changing the…

Question 2 · from Lesson 04 & 05A tool returning a 40-field run object with raw UUIDs, then a bare 500 on failure, breaks two rules — shape the output and…

Question 3 · from Lesson 07The first technique that makes a re-run after failure safe is to…

Question 4 · from Lesson 08A harness that dispatches readOnlyHint: true tools in parallel is only safe once you have…

Question 5 · from Lesson 09 & 11Past roughly thirty to fifty visible tools, eager-loading the whole catalog mainly costs you…

Question 6 · from Lesson 12, 13 & 14Match the surface beyond the catalog: usage knowledge a schema can't carry, a rule that must hold whatever the model decides, and a model that already knows the CLI go to…

Ask me anything. Bring a real tool — paste its name, schema, a sample output, and an error — and we'll run the decision table against it together, surface by surface. Or ask which single move would most improve a toolset you're already shipping, from selection all the way to its cost budget.
✎ Feedback