Part 6 · Beyond the Tool Catalog

Tool Engineering · ~7 min

The Unix CLI as a Tool Interface

The whole course assumed a catalog of typed tools. This lesson considers the opposite extreme: hand the agent one run(command) tool and let Unix supply the rest.

Why this, for you: the design axis the catalog hides. Every prior lesson tuned which typed tools to ship and how to shape them. This one asks whether you need typed tools at all — and shows that the answer is a point on a spectrum, not a yes/no. Knowing where on that spectrum your task sits is the last design call.

Most frameworks register many typed tools — read_file, search_code, list_directory — each with a schema, an error format, and a slot in the catalog. The alternative collapses all of them into one execution primitive and lets the agent compose shell commands directly.

A single run(command) tool can replace a large typed catalog by exploiting the model's dense pretraining on shell usage. It's the extreme end of consolidation: where Lesson 6 merged overlapping tools, the single-tool design eliminates tool selection entirely — there is only ever one tool to pick.

1 Three things Unix supplies for free

The catalog gives you discovery, error handling, and composition as bespoke per-tool work. Unix already has all three, and the model already knows them from pretraining:

What the catalog builds per toolWhat Unix supplies for free
Tool descriptions for discovery--help, man — lazy discovery on demand, no upfront schema
Custom error format per toolstderr + exit codes — "command not found" routes the next action
Pre-built consolidationpipes, &&, || — search, filter, transform in one call

The agent runs gh --help to learn capabilities instead of loading a schema; reads pull request not found on stderr and adjusts; and chains a query with jq in a single invocation. None of it needed a bespoke tool definition — which is the catalog token tax from Lesson 11 dropping to near zero.

2 Separate execution from presentation

Raw shell access is a loaded gun pointed at the context window. A kubectl get pods returns a page; a PNG returns binary garbage that fills the window with uninterpretable bytes. The fix is a two-layer architecture: the agent works in raw CLI, and a presentation layer handles what the agent should not.

Presentation guardWhat it prevents
Binary guardNon-text output (a PNG) poisoning context — return a placeholder
Overflow truncationA huge result overflowing — preserve head and tail (the L4 PARTIAL shape)
Stderr attachmentSilent failure — surface stderr alongside stdout so the agent can route on it

A second move pulls in the same direction: wrapper scripts that pre-filter at the source. Instead of returning the full pod table, a check-pods.sh returns only the non-running pods as JSON. This is Lesson 3's paragraph heuristic and Lesson 4's result-shaping, applied to CLI output — "return only what the next decision needs", enforced by the script rather than the schema. Anthropic measured a related filter-before-return pattern cutting a workload from 150,000 to 2,000 tokens.

# Raw — returns hundreds of lines the agent must parse kubectl get pods -n production # Wrapper — returns only the non-running pods, decision-ready kubectl get pods -n production --no-headers \ | awk '$3 != "Running" {print $1, $3, $4}' \ | head -20 || echo "All pods running"

3 Where on the spectrum does your task sit?

This isn't all-or-nothing. The single-tool design and the typed catalog are two ends of one axis, and most production systems land between them. Pick by what the task actually needs:

Single run(command) winsTyped tools win
High pretraining alignment — model knows the CLIStrong parameter constraints — enums, bounded ranges
Composition via pipes is the natural shapeHigh-security surfaces needing per-tool gating
Discovery is cheap (--help exists)Multimodal payloads — images, audio

The honest middle: five well-designed tools plus shell access captures most of the benefit without unrestricted-execution risk. And the catalog's parameter constraints aren't free to give up — a free-form command string is the opposite of the poka-yoke from Lesson 2. When you do ship CLIs for agents, design them for machine consumption: a --json flag, distinct exit codes, --dry-run, and --yes/--force to kill interactive prompts an agent has no stdin to answer.

The security surface is the real cost

One run(command) tool means arbitrary execution — the broadest possible security surface, where each typed tool was constrained by construction. This is exactly where Lesson 13's hooks earn their place: a PreToolUse matcher on the Bash / run tool is the deterministic gate that bounds what an otherwise-unbounded primitive may do. The single-tool design and the hook are complements — the CLI gives reach, the hook draws the line.

↪ Your win: treat typed-vs-CLI as a spectrum you choose

Retrieval practice — recall, don't peek

Question 1A single run(command) tool works because models have dense pretraining on…

Question 2The Unix mechanism that gives lazy, on-demand discovery without loading a schema is…

Question 3The presentation layer's binary guard exists to stop non-text output from…

Question 4The practical middle of the spectrum, capturing most of the upside, is…

Question 5 · spaced recall from Lesson 13The deterministic gate that bounds what an arbitrary-execution tool may do is a…

Ask me anything. Want to place a real toolset on the typed-vs-CLI spectrum, or design the presentation layer (binary guard, overflow, stderr) for a run() tool you're considering? Next, the Capstone — every surface and mechanism from the course in one decision table, plus a mixed review spanning selection to enforcement.
✎ Feedback