Part 4 · Many Agents, One Trace

Observability · ~7 min

One ID Across the Trace

Six signals per agent are useless if you can't ask "show me everything this subagent did." A stable agent_id on every header and span makes the trace queryable by identity — no tree walk required.

Why this, for you: Lesson 08 gave you per-agent signals; this lesson makes them addressable. Across 200 spans and 12 subagents, "which spans belong to this agent" and "what call chain caused this 429" are the two questions span nesting alone can't answer cheaply. A flat, propagated identifier answers both with one query — and it's already shipping in Claude Code.

Across 200 spans and 12 subagents, span hierarchy can answer "which spans belong to a subagent" via tree traversal — but only inside a single session, and only while lineage holds. Once work crosses a boundary the instrumentation doesn't cover — a shell-out, a webhook, a queue — lineage is gone. The fix is a flat, propagated identifier that survives where the trace context ends.

1 The dual-surface contract

Claude Code 2.1.139 instruments subagents on two surfaces at once, carrying the same identity pair on each.

SurfaceMechanism · value
Outgoing HTTPHeaders x-claude-code-agent-id / -parent-agent-id
OTEL spansclaude_code.llm_request attrs agent_id / parent_agent_id

agent_id identifies the subagent that issued a request (absent on the main session); parent_agent_id identifies the agent that spawned it. The pair is load-bearing: agent_id supports "all work by this subagent"; parent_agent_id reconstructs the dispatch hierarchy from a flat query — "which subagent spawned this 429-emitting child" — without walking the span tree.

Span lineage answers "what's the call structure inside this turn" — it needs the parent span live when the child starts. The propagated attribute answers "what work was caused by this identity, regardless of dispatch path". OpenTelemetry separates trace context from cross-cutting context for exactly this reason: neither alone suffices.

2 The queries it unlocks

With agent_id on every span and API event, the trace store becomes a per-agent analytics surface.

# questions that were a per-trace tree walk, now flat queries sum(input_tokens + output_tokens) group by agent_id # cost per subagent quantile(0.99, duration_ms) group by agent_id # p99 per subagent count(status='ERROR') / count(*) group by agent_id # error rate per subagent follow parent_agent_id upward from the failing span # the chain to a 429

Without the attribute, the same queries require walking parent links per-trace — feasible at one trace, expensive at scale.

3 ID for incidents, name for dashboards

The crucial discipline: agent_id is a per-instance identifier, so it's high cardinality. Use it as a span/event attribute, never a metric label — per-instance IDs create unbounded time series. Its metric-safe partner is agent.name, the subagent type, a bounded set.

Use thisFor
agent.nameDashboard aggregation — bounded cardinality, safe as a metric label
agent_idDrilling into one specific incident in the trace store

It's the same discipline Lesson 02 set for prompt.id: a per-instance correlation key belongs on spans, not on metrics.

Where the propagation breaks

The contract holds only over surfaces Claude Code controls. Shell-out via Bash — a curl — produces a tool span but carries no x-claude-code-agent-id header; the service sees an anonymous request. Subprocess work inherits TRACEPARENT but not the agent identity. Fire-and-forget queues discard the header on enqueue. Close the gap by lifting the call into an instrumented MCP tool, or wrapping the shell-out with an explicit header. And mind privacy: agent_id is opaque, but built-in agent.name values are verbatim — custom names redact to "custom", yet a skill or plugin name on the same span can still leak intent.

↪ Your win: a trace queryable by identity

Retrieval practice — recall, don't peek

Question 1Span hierarchy alone fails to correlate a subagent's work once…

Question 2The two propagation surfaces in the contract are…

Question 3To trace the dispatch chain up to a failing call, you follow…

Question 4For dashboard aggregation you slice by agent.name, not agent_id, because…

Question 5 · spaced recall from Lesson 08Failure-aware observability is best described as…

Ask me anything. Want the before/after Tempo trace where the 429 lands on fanout-3 spawned by orch, or how to wrap a Bash curl so it carries the agent header? Next, the Capstone: a decision table that ties single-agent and multi-agent observability into one chooser.
✎ Feedback