Part 2 · Shaping the Surface

Tool Engineering · ~7 min

Errors as a Teaching Signal

An agent has no intuition to fall back on. It takes your error message literally — so a cryptic failure leaves it stuck, and a well-shaped one tells it exactly what to do next.

Why this, for you: the cheapest reliability upgrade in the course. You don't change what fails — you change what the failure says. A structured, actionable error turns a dead-end retry loop into a one-call self-correction. For agents making many calls, that's the difference between recovery and a stuck session.

When a human hits Error: operation failed, they reason from experience about what probably went wrong. The agent can't. It has no intuition to repair the gap, so it either retries the same failing call or gives up. The error message is the recovery instruction — write it as one.

Errors should "clearly communicate specific and actionable improvements, rather than opaque error codes or tracebacks." An error that names what went wrong and what to try instead lets the agent self-correct on the next call — no human in the loop.

1 Diagnose, then direct

A good agent-facing error does two jobs: it diagnoses the specific problem, and it points at the fix. Generic status codes do neither.

# Useless — the agent learns nothing, retries blindly {"error": "400 Bad Request"} # Teaching — names the problem AND the correction {"error": "Invalid date range: end_date must be after start_date. Received start=2024-03-01, end=2024-02-01."}

The same applies to over-broad queries: not "0 results" but "Query too broad. Narrow by date range or add a status filter, e.g. status:open." The error is part of the tool description in practice — it's what the agent reads when a call fails, so treat it as guidance, not just a diagnostic.

2 Structure errors so the agent can branch

For HTTP APIs, prose isn't enough — the agent shouldn't pattern-match a 14,000-token Cloudflare HTML page to guess whether to retry. RFC 9457 (application/problem+json) defines a standard structure with five base fields — type, status, title, detail, instance — and operational extensions map directly to agent control-flow branches.

FieldLets the agent…
retryable (bool)Decide whether retrying the same call can ever succeed
retry_after (seconds)Wait the right amount before a retry, not guess
owner_action_required (bool)Escalate to a human instead of looping
error_category (string)Route: retry · escalate · fail fast

The agent branches on explicit signals rather than inferring intent from a status code. The token win is large too: a Cloudflare rate-limit page is ~14,252 tokens as HTML; the same as RFC 9457 JSON is ~256 — a ~56x reduction. Claude's tool-use API forwards these via is_error: true, so the model sees the structured fields directly.

Parse defensively — adoption is uneven

Most third-party APIs ignore the Accept: application/problem+json header and return HTML anyway, and middleware can strip the header before it reaches the origin. Always attempt structured extraction first, then fall back to plain-text. Emit RFC 9457 from your own agent-facing services, where you control the contract.

3 Don't delete the failure — it's a negative example

Shaping the error is half of it. The other half is what you do with it after. The instinct to clean up a failed call — strip the trace, clear the error — is usually wrong. A failed action plus its error is a negative example: the model sees the dead end and routes around it next turn.

# keep this in context — it's the guardrail, not clutter write_file("/etc/config.yaml") → PermissionError [Errno 13] # next turn the agent reroutes instead of retrying the same path: write_file("/tmp/config.yaml") → ok

Stripping reasoning traces measured a ~30% performance drop — lost subgoals, repeated re-derivation, doom loops. Preserve novel failures during active recovery; compact once recovery succeeds, or when the same error repeats 3+ times (a doom loop — break it and change strategy, don't keep retrying).

↪ Your win: make every failure teach

Retrieval practice — recall, don't peek

Question 1The core job of an agent-facing error message is to…

Question 2RFC 9457's retryable and error_category fields exist so the agent can…

Question 3Deleting a failed call's error trace from context tends to…

Question 4You should parse third-party error responses defensively because…

Question 5 · spaced recall from Lesson 04A graceful truncation marker survives the model's reading order when it's placed…

Ask me anything. Want the error-category taxonomy (which categories map to retry vs escalate vs fail-fast), or how to wire Accept: text/markdown for token-cheap structured errors? Next, Part 3 opens with Consolidation vs Sprawl — managing the whole tool set, not one tool.
✎ Feedback