Show Your Reasoning

One example fixes the shape of an answer. A worked reasoning trace fixes the shape of the thinking behind it — and on multi-step tasks, that's where the wins are.

Why this, for you: Lesson 6 used an example to anchor a format. This lesson uses an example to anchor a reasoning path — what to check, in what order, and why the wrong path is wrong. When your agent's answers look plausible but miss the failures you actually care about, this is the lever: stop describing good reasoning and show it, in your own domain.

A generic instruction like "reason carefully before acting" tells the model nothing about what good reasoning looks like in your domain. On the τ-Bench airline task, switching from a baseline prompt to one with detailed domain-specific guidance and worked examples produced a 54% relative improvement in pass rate — same model, same tools, only the prompt changed (Anthropic — the think tool).

1 Format anchoring vs. reasoning anchoring

Both rest on the same mechanism — an exemplar guides which path the model follows — but they aim at different targets. Lesson 6's one example fixes the output shape. A reasoning trace fixes the decision sequence that produces it.

A domain-specific reasoning example carries four things prose can't: the domain vocabulary (real resource types, not generic nouns), a worked reasoning chain for a real edge case, explicit edge-case guidance for ambiguous inputs, and a success/failure definition the agent can self-check against.

The research base is old and broad: chain-of-thought showed that surfacing intermediate steps beats stating rules on complex tasks, and few-shot prompting found worked examples beat task descriptions across many benchmarks. The 54% figure is that same effect, applied to a specific agent domain.

2 A trace, not a happy path

The example below is from a billing-support domain. The generic version names a goal; the domain version walks the exact tool sequence, the condition gating each step, and the cost of the wrong path.

# generic — names a goal, shows no reasoning: "Reason carefully. If a customer asks about their subscription, check the relevant account details." # domain-specific — the reasoning chain, in order: On a failed payment: 1. get_subscription_status — if past_due, retries are exhausted; do NOT promise "we'll retry soon." 2. get_payment_method — if the card is expired, send to the billing portal before anything else. 3. else get_payment_failure_reason — insufficient_funds and do_not_honor need different scripts. Wrong path: offering a retry on an expired card wastes the customer's time and logs a second failure event.

The after version encodes the call sequence, the gates, and why the wrong path causes a real problem. That last line matters: an example that shows only the correct path teaches the shape; one that names the failure teaches the boundary. Put this in the system prompt, not a tool description — system-prompt content applies across every reasoning step, while guidance buried in a tool's docs is fragmented and applies only at selection time.

3 Write traces from production, then maintain them

Invented examples miss real failure patterns — they encode what you imagine goes wrong, not what does. The reliable loop is instrumented:

Log the reasoning — capture the think-tool output or reasoning trace from real runs.
Find the low-quality cases — where reasoning reached a wrong conclusion or stalled.
Write a trace for each — the correct chain for exactly that case, then add it to the prompt.
Measure, then repeat — confirm the gain on an eval; the example set is never finished.

This is the spec-as-prompt instinct (Lesson 9) turned on reasoning itself: don't describe the good path in the abstract — capture a real one and point the model at it.

When a reasoning trace earns nothing

The 54% gain is not universal — it needs a multi-step chain to shape. Anthropic flags four dead zones: single or parallel tool calls (no sequential reasoning to guide); low-constraint tasks where the default behaviour is already adequate; insufficient production data, so the examples are guesswork that misses real failures; and high-churn domains, where vocabulary and rules shift fast enough to make the example set stale or actively misleading. The token cost is real — for short tasks already near ceiling, a worked trace adds latency and cost with nothing to show.

↪ Your win: anchor the thinking, not just the format

Show the reasoning chain — a worked trace lifts decision quality; an example only fixes shape.
Include the wrong path — naming why a path fails teaches the boundary, not just the happy case.
Keep traces in the system prompt — they apply across every step, not just at tool selection.
Write them from real failures — instrument production; invented examples miss the real modes.
Know the dead zones — single calls, low-constraint, thin data, high churn → skip the trace.

Retrieval practice — recall, don't peek

Question 1A worked reasoning trace differs from a format example because it anchors…

Question 2Domain-specific guidance and examples lifted τ-Bench pass rate by about…

Question 3Worked reasoning examples belong in the system prompt, not a tool description, because they then…

Question 4A worked reasoning trace adds little when the task involves…

Question 5 · spaced recall from Lesson 13In a production system prompt, temporal grounding (date, environment) sits at the head mainly so that it…

Ask me anything. Want to draft a reasoning trace from one of your agent's real failure logs, or judge whether your task even has the multi-step chain that makes a trace pay off? Next, the Capstone: The Compliance Stack — every lever in this course, wired into one decision procedure for a failing rule.