A Field Guide to Multi-Agent Failure Modes

The MAST taxonomy from Cemri et al. (NeurIPS 2025) classifies 14 failure modes from 1,642 annotated traces. A guide to what breaks, where it enters a trace, and what interventions have measured effect sizes.

"The agents got confused."

"It went off the rails."

That is how most multi-agent post-mortems read. The vocabulary is too imprecise to be actionable. A planning failure requires a different fix from a communication failure or a verification failure, and treating them as a single category produces interventions that address none of them.

Cemri et al. (2025) built a taxonomy bottom-up from 1,642 annotated execution traces across seven popular frameworks. The result is 14 failure modes in 3 categories, with inter-annotator agreement at Cohen's kappa 0.88. Published at NeurIPS 2025.

The MAST taxonomy

The study covered seven frameworks: AutoGPT, AgentVerse, MetaGPT, ChatDev, and three others. Six expert annotators labeled each trace, with multiple rounds of refinement to resolve disagreements. Cohen's kappa 0.88 is considered strong agreement.

14 failure modes. 3 categories.

Category 1: Specification and system design (44.2%)

The largest category, and the most tractable.

These failures are introduced at design time. The agent does not break; it faithfully executes a flawed setup.

The five modes: step repetition is the single most frequent at 15.7% of all failures; disobey task specification at 11.8%; disobey role specification; loss of conversation history; and unaware of termination conditions.

All five are addressable before the coordination layer exists. A precise task spec, enforced roles, and explicit stop conditions prevent a large share of everything downstream.

Category 2: Inter-agent misalignment (32.3%)

These are the failures unique to having more than one agent.

Conversation resets. Agents that withhold information another agent needs. Task derailment. Agents that ignore each other's output. Reasoning-action mismatch (13.2%), where an agent decides one thing and does another.

Every one of these is impossible in a single-agent system.

Cognition identified the mechanism: "Actions carry implicit decisions, and conflicting decisions carry bad results." (Walden Yan, Don't Build Multi-Agents, June 2025.) When agents operate from partial context, their decisions conflict in ways that are not visible to any individual agent.

Mitigation requires sharing full agent execution traces, not just inter-agent messages. This is architectural, not a configuration change.

Category 3: Verification and termination (23.5%)

The smallest category, and often the highest-leverage to fix.

Premature termination. No verification. Incorrect verification.

Cemri et al. tested a direct intervention on ChatDev: adding a high-level verification step improved task success by 15.6 percentage points. Tightening role specifications improved it by 9.4 percentage points.

A verification step is contained, measurable, and has the strongest documented effect size in the MAST intervention studies.

The detection problem

A taxonomy tells you what to look for. Detecting failures automatically after the fact is a separate, harder problem.

Zhang et al. (ICML 2025) benchmarked automated failure attribution across 127 multi-agent systems. The best method reached 53.5% accuracy at identifying the responsible agent and 14.2% at pinpointing the responsible step. Frontier reasoning models performed below the automated baseline on step attribution.

Failures are usually cascades. An early specification ambiguity surfaces ten steps later as a verification failure. The trace does not announce the link.

Where to spend effort, in order

Fix specification failures first. They are the cheapest, they are front-loaded, and preventing them reduces exposure across all three categories.

Add a verification step. It is contained, has a measured effect size, and is the most straightforward architectural addition.

Address misalignment structurally. Share full agent execution traces rather than individual messages. This requires architectural commitment, not a single patch.

Originally published on dev.to

More on the failure taxonomy at /taxonomy. Detector benchmarks at /benchmarks/detectors. Framework adapters at /frameworks.