Observability · ~7 min
When an agent produces bad output, the bug is almost never the model. Classify which of four layers failed before you change anything.
Code bugs have stack traces. Agent bugs have context windows. When an agent fails, the problem is usually not the model — it's what the model was given. Four modes account for nearly all of it.
| Mode | What went wrong |
|---|---|
| Missing context | The agent never had the information it needed |
| Conflicting instructions | Two directives contradicted each other |
| Missing/blocked tools | The agent couldn't act and worked around the gap |
| Capability ceiling | The task exceeded the model tier's capability |
Walk the layers in order — each step answers a different question about what the agent experienced.
A subtle one lives in step 1: when Claude Code compacts a long session, path-scoped rules and nested CLAUDE.md files drop from context until a matching file is read again. An agent that followed a rule early may stop later — not because it ignored the rule, but because the rule fell out. Check session length before blaming the model.
Start a fresh session, give exactly the same inputs, watch for the same bad output.
Recurs → the problem is structural: instructions, skills, or tool config. Doesn't
recur → it was session-specific: context drift, overflow, or transient state. Transcripts under
~/.claude/projects/ let you trace when behavior diverged and whether a tool failure caused the pivot.
A worked example: an agent keeps writing *.spec.ts when the project wants *.test.ts.
/memory shows the project CLAUDE.md says *.test.ts but a stale user-level CLAUDE.md
says *.spec.ts — both loaded. Project scope wins, but the user-level entry is noise. Remove it; reproduce
in a fresh session; fixed. Root cause was a scope conflict, not the model or the prompt.
The sequence assumes a structural failure. Skip it for transient errors (rate limits, timeouts — just retry in a fresh session), for genuine capability gaps (no context tuning fixes a task beyond every tier), and for over-instrumented setups (when dense CLAUDE.md layers are the problem, reducing instruction surface area beats per-failure diagnosis).
/memory and the session transcript.Retrieval practice — recall, don't peek
Question 1When an agent produces bad output, the cause is usually…
Question 2An agent that followed a rule early but stops later may have hit…
Question 3Reproducing in a fresh session separates…
Question 4You should reach step 4 (model tier) only after…
Question 5 · spaced recall from Lesson 03Under ESAA, "fail-closed" validation means a malformed event…
/doctor surfaces config issues invisible during normal runs? Next in Part 3:
Stopping the Bleeding — loop detection and circuit breakers.