Part 4 · Many Agents, One Trace

Observability · ~8 min

Catching the Wasted Run

A multi-agent run burns thousands of tokens before the grader ever sees the answer. Six trace signals tell you why a run is failing while budget remains to intervene — not after.

Why this, for you: the course so far is almost entirely single-agent — a token count climbing and a step counter advancing both look like progress. In a multi-agent system they hide a planner stuck in a loop or a retrieval agent that stopped finding anything. This lesson is the taxonomy that names which failure is happening, mid-trajectory, so you can stop paying for it.

Multi-agent systems burn tokens, tool calls, retries, and code-execution attempts before producing an answer. Final-answer evaluation reveals the endpoint but rarely the moment the trajectory stopped making recoverable progress. Failure-aware observability instruments a fixed set of online trace signals whose patterns precede final-answer failure — turning postmortem grading into mid-run diagnosis.

1 The six signals

Li et al. (arXiv 2606.01365) define six online trace signals, each tied to a distinct failure mechanism. The framework is taxonomic, not algorithmic: the contribution is the failure-mode → signal map, not a stopping rule.

Failure modeSignal · what it diagnoses
Tool instabilityTool error rate, retries — calls burn budget without usable state
Execution failureCompile / import / timeout classes — code fails without recovery
Repeated action loopRepeated action keys, ABAB cycles — computation, no strategy change
Low information gainNew URLs, fact count — retrieval no longer adds task-relevant state
Evidence failureAnswer-citation similarity — the answer isn't supported by artefacts
Budget wasteTokens, tool calls, budget pressure — the intervention window is closing
A single cost ceiling can trip for six different reasons. Knowing which one separates swapping the model from re-prompting with explicit evidence requirements from aborting to retry with a smaller goal. That's the gap a stopping rule can't fill and a taxonomy can.

2 Two signals carry formulas

Most signals are counts; two are defined precisely, and cost is a weighted sum the harness tunes itself.

# tool reliability — error fraction over a run ToolErr(r) = N_err(r) / N_tool(r) # evidence support — answer sentences backed by a citation, tau = 0.65 Support_t(r) = fraction of answer sentences whose embedding has cosine similarity >= 0.65 with some citation # cost — coefficients left un-fixed; weight by your own marginal cost C_r = a*tokens + b*tool_calls + g*retries + d*exec_attempts

The coefficients are deliberately un-fixed so the harness weights tokens, calls, retries, and execution attempts by its own marginal cost — the taxonomy supplies the signal, you supply the policy.

3 Why it beats single-signal stopping

Single-signal mechanisms — iteration caps, edit counters, cost ceilings — answer "when do I stop?". This framework answers "why is this run failing, now?". It sits a layer up from the breakers and loop detectors of Part 3, mapping six failure classes to six signal classes.

The empirical basis: across 165 GAIA validation traces, per-level failure rates ran 38–46%, with mean token use rising from 8,152 to 16,389. Concurrent work found full execution traces improve failure-attribution accuracy by up to 76% over partial-observation baselines.

When two signals beat six

The taxonomy targets multi-agent systems where 16k-token trajectories with consecutive tool failures are the failure surface. For a single-agent harness under ten tool calls, loops and budget overrun surface directly — loop detection plus a circuit breaker cover that regime. Six signals also create six false-positive surfaces, and they correlate (a loop implies low information gain). Without a trace store and an intervention path, the signals reduce to postmortem instrumentation no faster than final-answer eval. The steelman: one hard budget cap plus one repetition detector, until your tooling can act on six dimensions independently.

↪ Your win: diagnose mid-run, not post-mortem

Retrieval practice — recall, don't peek

Question 1Failure-aware observability turns postmortem grading into…

Question 2The framework's actual contribution is best described as…

Question 3A streak of retrieval calls returning no new URLs is the signal for…

Question 4The cost formula leaves its coefficients un-fixed so that…

Question 5 · spaced recall from Lesson 07When a long agentic run is dominated by file reads and grep, the better attribution cut is…

Ask me anything. Want the GAIA Level-2 walkthrough where retrieval-agent-2 hits ToolErr = 0.7 at step 18 and the orchestrator reassigns the query, or how the evidence-support threshold re-baselines for numeric ground truth? Next in Part 4: One ID Across the Trace — making those per-agent signals queryable by identity.
✎ Feedback