Tool Engineering · ~7 min
An agent has no intuition to fall back on. It takes your error message literally — so a cryptic failure leaves it stuck, and a well-shaped one tells it exactly what to do next.
When a human hits Error: operation failed, they reason from experience about what
probably went wrong. The agent can't. It has no intuition to repair the gap, so it either retries the same failing
call or gives up. The error message is the recovery instruction — write it as one.
A good agent-facing error does two jobs: it diagnoses the specific problem, and it points at the fix. Generic status codes do neither.
The same applies to over-broad queries: not "0 results" but "Query too broad. Narrow by date range
or add a status filter, e.g. status:open." The error is part of the tool description in
practice — it's what the agent reads when a call fails, so treat it as guidance, not just a diagnostic.
For HTTP APIs, prose isn't enough — the agent shouldn't pattern-match a 14,000-token Cloudflare HTML page to
guess whether to retry. RFC 9457 (application/problem+json) defines a standard
structure with five base fields — type, status, title, detail,
instance — and operational extensions map directly to agent control-flow branches.
| Field | Lets the agent… |
|---|---|
retryable (bool) | Decide whether retrying the same call can ever succeed |
retry_after (seconds) | Wait the right amount before a retry, not guess |
owner_action_required (bool) | Escalate to a human instead of looping |
error_category (string) | Route: retry · escalate · fail fast |
The agent branches on explicit signals rather than inferring intent from a status code. The token win is large
too: a Cloudflare rate-limit page is ~14,252 tokens as HTML; the same as RFC 9457 JSON is
~256 — a ~56x reduction. Claude's tool-use API forwards these via is_error: true, so
the model sees the structured fields directly.
Most third-party APIs ignore the Accept: application/problem+json header and return HTML anyway,
and middleware can strip the header before it reaches the origin. Always attempt structured extraction first,
then fall back to plain-text. Emit RFC 9457 from your own agent-facing services, where
you control the contract.
Shaping the error is half of it. The other half is what you do with it after. The instinct to clean up a failed call — strip the trace, clear the error — is usually wrong. A failed action plus its error is a negative example: the model sees the dead end and routes around it next turn.
Stripping reasoning traces measured a ~30% performance drop — lost subgoals, repeated re-derivation, doom loops. Preserve novel failures during active recovery; compact once recovery succeeds, or when the same error repeats 3+ times (a doom loop — break it and change strategy, don't keep retrying).
retryable / retry_after / error_category so the agent branches deterministically.Retrieval practice — recall, don't peek
Question 1The core job of an agent-facing error message is to…
Question 2RFC 9457's retryable and error_category fields exist so the agent can…
Question 3Deleting a failed call's error trace from context tends to…
Question 4You should parse third-party error responses defensively because…
Question 5 · spaced recall from Lesson 04A graceful truncation marker survives the model's reading order when it's placed…
Accept: text/markdown for token-cheap structured errors? Next, Part 3
opens with Consolidation vs Sprawl — managing the whole tool set, not one tool.