Observability · ~8 min
A multi-agent run burns thousands of tokens before the grader ever sees the answer. Six trace signals tell you why a run is failing while budget remains to intervene — not after.
Multi-agent systems burn tokens, tool calls, retries, and code-execution attempts before producing an answer. Final-answer evaluation reveals the endpoint but rarely the moment the trajectory stopped making recoverable progress. Failure-aware observability instruments a fixed set of online trace signals whose patterns precede final-answer failure — turning postmortem grading into mid-run diagnosis.
Li et al. (arXiv 2606.01365) define six online trace signals, each tied to a distinct failure mechanism. The framework is taxonomic, not algorithmic: the contribution is the failure-mode → signal map, not a stopping rule.
| Failure mode | Signal · what it diagnoses |
|---|---|
| Tool instability | Tool error rate, retries — calls burn budget without usable state |
| Execution failure | Compile / import / timeout classes — code fails without recovery |
| Repeated action loop | Repeated action keys, ABAB cycles — computation, no strategy change |
| Low information gain | New URLs, fact count — retrieval no longer adds task-relevant state |
| Evidence failure | Answer-citation similarity — the answer isn't supported by artefacts |
| Budget waste | Tokens, tool calls, budget pressure — the intervention window is closing |
Most signals are counts; two are defined precisely, and cost is a weighted sum the harness tunes itself.
The coefficients are deliberately un-fixed so the harness weights tokens, calls, retries, and execution attempts by its own marginal cost — the taxonomy supplies the signal, you supply the policy.
Single-signal mechanisms — iteration caps, edit counters, cost ceilings — answer "when do I stop?". This framework answers "why is this run failing, now?". It sits a layer up from the breakers and loop detectors of Part 3, mapping six failure classes to six signal classes.
The taxonomy targets multi-agent systems where 16k-token trajectories with consecutive tool failures are the failure surface. For a single-agent harness under ten tool calls, loops and budget overrun surface directly — loop detection plus a circuit breaker cover that regime. Six signals also create six false-positive surfaces, and they correlate (a loop implies low information gain). Without a trace store and an intervention path, the signals reduce to postmortem instrumentation no faster than final-answer eval. The steelman: one hard budget cap plus one repetition detector, until your tooling can act on six dimensions independently.
Retrieval practice — recall, don't peek
Question 1Failure-aware observability turns postmortem grading into…
Question 2The framework's actual contribution is best described as…
Question 3A streak of retrieval calls returning no new URLs is the signal for…
Question 4The cost formula leaves its coefficients un-fixed so that…
Question 5 · spaced recall from Lesson 07When a long agentic run is dominated by file reads and grep, the better attribution cut is…
ToolErr =
0.7 at step 18 and the orchestrator reassigns the query, or how the evidence-support threshold re-baselines
for numeric ground truth? Next in Part 4: One ID Across the Trace — making those per-agent signals
queryable by identity.