Do You Actually Need a Multi-Agent System?

Multi-agent AI fails on 41–87% of tasks and costs roughly 15x more tokens than a single agent. A structured framework for deciding when the complexity is worth it.

Multi-agent AI fails between 41% and 87% of the time across state-of-the-art open-source frameworks.

That is not a fringe finding. It is the headline result of a study that annotated 1,642 real execution traces across seven frameworks, published at NeurIPS 2025.

Before building a multi-agent system, it is worth asking whether you need one.

The four-rung ladder

Think about this as a ladder of increasing autonomy. Each rung adds capability. Each rung adds cost and failure surface in equal measure.

Rung 1: Single prompt. One model call, a good system prompt, maybe a few examples. More production value lives here than most teams expect.

Rung 2: Workflow. Multiple calls on control flow you wrote. The steps are fixed, legible, and debuggable. Cheap to run, cheap to fix.

Rung 3: Single agent. The model directs its own tool use. A meaningful capability increase for open-ended tasks that fixed workflows cannot handle, and where the first hard failures appear: loops, runaway tool use, premature completion.

Rung 4: Multi-agent system. Several agents coordinating. And a new category of failures that cannot occur on the rungs below.

Anthropic's June 2025 report on their multi-agent research system puts the token multiplier at roughly 4x for a single agent and roughly 15x for a full multi-agent system, both relative to a standard chat turn. Each rung also introduces a failure class that does not exist at the rungs below.

The coordination tax

When you split a task across agents that do not share full context, each agent makes decisions from a partial view.

Cognition named the mechanism directly: "Actions carry implicit decisions, and conflicting decisions carry bad results." (Walden Yan, Don't Build Multi-Agents, June 2025.)

The MAST study sized it: inter-agent misalignment accounts for 32.3% of observed multi-agent failures. One in three failures is a failure that cannot occur in a single-agent system.

Three signals that multi-agent earns its complexity

You should climb to rung four when at least one of these is true:

The work is genuinely parallel. Independent subtasks running concurrently convert latency into token spend. That is often the right trade for a user waiting on a result.

Contexts need isolation. When you want an independent review, or need to keep a noisy tool output out of the main reasoning thread, separation is the feature, not a workaround.

The task exceeds one window or one skillset. When the roles map to real boundaries in the work, decomposition is not optional.

If none of these hold, a single agent or a fixed workflow will be more reliable and cost less.

The number worth keeping

The best published result on automated multi-agent failure attribution: 53.5% accuracy at identifying the responsible agent, 14.2% at pinpointing the responsible step. Frontier reasoning models performed below the automated baseline on both metrics. (Zhang et al., Who Causes Task Failures and When? ICML 2025, n=127 systems.)

Failures are usually cascades. An early specification ambiguity surfaces ten steps later as a verification failure. The trace does not announce the link.

The specification failure category accounts for 44.2% of all multi-agent failures in the MAST corpus, and every one of those modes is introduced at design time, before the coordination layer is built. That is where reliable prevention is feasible.

Multi-agent systems cost roughly 15x more tokens than a single-agent call (Anthropic, June 2025) and fail on 41–87% of tasks across seven popular frameworks (Cemri et al., NeurIPS 2025, n=1,642). The argument for a simpler architecture is both economic and structural: inter-agent misalignment, a failure class absent from single-agent systems, accounts for 32.3% of multi-agent failures.

Originally published on dev.to

More on the failure taxonomy at /taxonomy. Detector benchmarks at /benchmarks/detectors. Framework adapters at /frameworks.