A study of 1,600+ production traces sorts the failures into three buckets — and most of them are design problems, not model problems. The prompt is the usual culprit.
Why this, for you: when a multi-agent system misbehaves, the instinct is to swap in a stronger
model. The data says the bigger lever is the design: 41.8% of failures trace to design issues, not model
capability. This lesson gives you the failure taxonomy as a diagnostic map and the single most common
self-inflicted wound — the over-prescriptive prompt.
The Multi-Agent System Failure Taxonomy (MAST) annotated over 1,600 production traces and sorted what
went wrong into three categories. Knowing which bucket a failure lands in tells you which fix to reach for.
1 The three failure categories
Category
What it looks like
Specification & design
Bad decomposition, unclear roles, missing constraints — 41.8% of failures trace to design, not capability
Inter-agent misalignment
Communication breakdowns, inconsistent goals, protocol violations — 36.9% of failures
Verification & termination
No independent check, premature or never-ending stop, self-verification bias
Three more failure modes appear across every topology: self-verification bias (an agent
confirms its own output — route to an independent evaluator), doom loops (10+ iterations on a broken
approach — needs loop detection and budget warnings), and context blindness (acting without
orientation — inject directory and tooling inventories at init).
The headline: capable frontier single-agents often match multi-agent orchestration
on SE tasks while avoiding the specification complexity. Adding agents multiplies LLM calls, attack surface, and
brittleness — when the quality delta over a strong single-agent is small, the cost rarely justifies it.
2 The self-inflicted wound: prescriptive prompts
Subagents receive only their own system prompt and the delegation message — not the lead's full context. So minor
wording changes in the lead's prompt cascade unpredictably. Anthropic observed it directly: "small
changes to the lead agent can unpredictably change how subagents behave." This isn't a bug — it's a property of
systems where agents interpret instructions rather than execute them.
A framework prompt defines division of labor, problem-solving heuristics, effort budgets, and quality criteria —
what "done" looks like, not how to get there. Effective multi-agent prompts encode "heuristics
rather than rigid rules."
# prescriptive — adding "thorough" makes subagents over-scale unpredictably"Thoroughly review every file for correctness, security, performance. List all findings."# framework — the effort budget caps behavior regardless of adjective"Review changed files only. Max 5 findings per category. One pass. Return structured JSON."
3 The mitigations that actually moved the number
Task granularity as isolation — decompose so each agent operates independently; shared state lets one behavior change cascade to all.
Explicit effort scaling — embed subagent count, search duration, and stopping conditions, or agents infer scope and over-invest (Anthropic's system once spawned 50 subagents for simple queries).
Cascade-aware testing — measure end-to-end behavior when prompts change; individual agent correctness does not predict system behavior.
Harness changes beat model changes
Loop detection, pre-completion checklists, and prompt adjustments — all harness-level, no model
change — collectively produced a 13.7-point improvement on Terminal-Bench 2.0. The fixes for multi-agent
failure live in the environment and the prompts, not in a bigger model.
↪ Your win: diagnose the bucket, fix the design
Sort failures into three buckets — specification/design, inter-agent misalignment, verification/termination.
Suspect the design before the model — 41.8% of failures are design, not capability.
Write framework prompts, not prescriptive ones — principles and effort budgets survive wording changes.
Embed effort budgets explicitly — or agents infer scope and spawn 50 workers for a simple query.
Test end-to-end on every prompt change — correct individual agents still combine into wrong systems.
Retrieval practice — recall, don't peek
Question 1The MAST taxonomy was built by annotating…
Question 2A large share of multi-agent failures trace to…
Question 3Compared with prescriptive prompts, framework prompts are…
Question 4Without explicit effort budgets, a research system once spawned…
Question 5 · spaced recall from Lesson 6Admission control fixes self-judgement bias by making "done" decided by…
Ask me anything. Want help rewriting a prescriptive orchestrator prompt into a framework prompt
with explicit effort budgets, or classifying a failure you've hit into the three MAST buckets? Next in Part 3:
Rainbow Deployments — shipping new agent versions without breaking live sessions.