Empty Success: how cheaper multi-agent orchestration quietly loses your state
Splitting an AI system into multiple agents and having the cheap ones summarize what they pass along looks like a free way to cut costs. We tested it. When agents have to carry precise data, a short natural-language summary between them loses most or all of that data, and the cheaper the carrier and the longer the chain, the more often it loses everything. Every step still returns a normal success response, so nothing looks broken. We call that failure mode empty success, and it is the one multi-agent systems are least equipped to see.
The pitch for multi-agent orchestration is a cost pitch. Split the work across specialized agents, route the cheap legs to cheap models, and compress what each agent hands to the next so the expensive orchestrator never has to re-read a wall of intermediate output. Every one of those moves lowers the bill. Every one of them also erodes the thing the system exists to produce: an accurate answer assembled from state that passed through many hands.
This is the part the architecture diagrams leave out. The cost savings and the reliability are the same dial. Turn it toward savings and you are turning it away from fidelity, because the savings come from compression, and the one thing you cannot compress is state that has to arrive intact.
We ran the experiment to find where the dial breaks. It breaks earlier, and far more quietly, than the tidy diagrams suggest.
The test
We built a pipeline that has to carry a precise ledger. Twelve accounts, each with an exact balance and a status, handed from an extractor through a chain of relay agents to a final agent that has to answer questions only the state can settle. The ground truth is defined in code, so grading is arithmetic rather than opinion.
We varied exactly two things: the discipline of the handoff, and the model doing the carrying. Three disciplines. Verbatim: reproduce everything, no compression. Structured: emit the state as complete JSON. Summary: write a short natural-language note for the next agent, which is what “just have the cheap agent summarize” actually looks like in production. Two models: Claude Opus 4.8 and Claude Haiku 4.5, the latter one-fifth the price per token. Then we scaled the whole thing: twenty accounts, five hops instead of three. Ten runs per cell.
The result
The structured handoff never lost a record. Across all forty structured runs, both models, three hops and five, twelve records and twenty, every account arrived with its exact balance and status. Not one run dropped a single field. Structure survives distance, and it does so deterministically.
The summary handoff falls apart, and it falls apart in the worst possible way. There is no graceful slide toward lower fidelity. Each run is a weighted coin flip that either preserves the whole ledger or loses all of it, and the coin lands on total loss more often as the chain deepens and the carrier gets cheaper.
| Handoff | Model | 12 records, 3 hops | 20 records, 5 hops |
|---|---|---|---|
| Structured | Opus 4.8 | 100% | 100% |
| Structured | Haiku 4.5 | 100% | 100% |
| Summary | Opus 4.8 | 80% 1 in 5 empty | 50% half empty |
| Summary | Haiku 4.5 | 3% 4 in 5 empty | 2% 9 in 10 empty |
Record fidelity: fraction of accounts arriving with exact balance and status, mean of 10 runs. “Empty” is the share of runs that lost the entire ledger.
The averages hide the shape, and the shape is the finding. A summary handoff on Opus across five hops delivered a perfect ledger half the runs and an empty one the other half. Averaged, that reads as a sixty percent system; in practice it works in the demo and then, in production, silently zeroes out one request in two with no warning it did so. On Haiku, the cheap carrier, summaries collapsed to nothing in eight or nine runs out of ten. A mean fidelity of three percent flatters it.
Now the part that should trouble anyone optimizing a bill. The summary handoff is the cheapest configuration by a wide margin. At scale, Haiku-summary moved 556 handoff tokens where structured moved 2,255, about seventy-five percent fewer. Call that seventy-five percent efficiency if you like; it is the state that went missing. The tokens you saved were the balances you dropped.
The cheapest, leanest, fastest configuration in the experiment is also the one that loses everything, and it does so while every single API call returns success.
Empty success
We call this failure mode empty success. Every agent in the summary chain returned a valid response. HTTP 200. Well-formed output. A green checkmark at every hop. The orchestrator has no signal that anything went wrong, because at the level orchestrators usually monitor, nothing did. The failure lives one layer down, in the payload, where the balances used to be.
The mechanism is not subtle. A summary is lossy compression, and lossy compression of a ledger is a ledger with the numbers rubbed out. Structured handoffs preserve state precisely because they refuse to compress the part that matters, and they cost more tokens for exactly that reason. Every token a handoff saves is taken from the fidelity of what it carries, and past a threshold there is nothing left to carry. Cheaper models cross that threshold sooner, because a smaller model told to “summarize” will paraphrase away precision faster than a larger one.
There is a second-order finding worth flagging before anyone declares structured handoffs a solved problem. Even perfect state is not sufficient. Haiku carrying twenty records preserved every one of them, then answered only seventy-two percent of the aggregate questions over them correctly, missing better than a quarter. The state arrived; the arithmetic did not. Accurate orchestration needs the acting agent to compute with tools rather than in its head, a separate discipline from carrying state and an equally easy one to skip.
A second road to the same place
The state experiment shows one path to empty success. We found another by accident, from a completely different mechanism, while benchmarking models for the same pipeline.
Claude Fable 5, Anthropic’s most capable released model, refused to fix a bug in a twelve-line interval-merging function. Ten times out of ten, deterministically, with the refusal category reported as “cyber” and the content empty. The identical request on Opus 4.8 succeeded five times out of five. The trigger was the code itself. The word “bug” and the request to fix were incidental: asking Fable merely to explain what the same function does refused just as reliably, and describing the task in prose with no code attached did not refuse at all. A second, unrelated benign snippet (a one-line string reverser) reproduced the effect.
Drop an agent like that into a pipeline and you have manufactured empty success by a new route. The model returns HTTP 200. The stop reason is “refusal,” a field almost no orchestration harness inspects. The content is empty. The next agent receives nothing, and if it sits in a summary chain, it will cheerfully summarize the nothing. Anthropic documents a fix, a server-side fallback to another model, which recovered the request cleanly in our test. The fix only helps integrators who know to configure it, which is the same population that already knows to check payloads instead of status codes.
The same failure, two mechanisms
In one, an agent compresses state into oblivion. In the other, a classifier blanks it. The signature is identical: a successful response that carries nothing the next agent can use, and a monitoring layer that sees only success. This is the failure class multi-agent systems are most exposed to and least instrumented for. The MAST and Who&When taxonomies that catalog agent failures rank inter-agent information loss near the top for a reason. It is structurally invisible to the tools most teams point at their pipelines, because those tools watch the transport and the failure lives in the cargo.
The engineering response is plain and non-negotiable.
- Stop trusting status codes as proxies for delivered work. Validate the payload at every boundary against a schema that knows what the state is supposed to contain. A 200 with an empty or truncated body should page you, not pass.
- Keep authoritative state out of the model’s mouth. Hold the ledger in the orchestrator’s memory and pass each agent the slice it needs plus its decision back. State that is never regenerated is state that cannot be paraphrased away.
- Do not compress what has to arrive exact. Handoff compression is a real lever for real savings, but on state that has to arrive intact, the saving and the accuracy are the same tokens. You cannot keep both.
What we are and are not claiming
The usual caveats, stated plainly. This is one provider’s model family, a synthetic ledger, and ten runs per cell. The summary cells are bimodal, perfect or empty, so their means carry wide variance; that bimodality is itself the finding, not noise to be averaged away. We are less interested in the exact percentages than in the shape, and the shape is stable: structured survives, summary collapses, cheap models and deep chains both make it worse, and all of it happens behind a green checkmark. The production numbers for a given system will depend on that system’s state size, hop count, and task. The direction strengthened when we scaled from three hops to five, which is the property you want in a finding you intend to build on.
The plumbing was always the hard part
The industry is spending 2026 discovering that the difficulty of agents was never the reasoning. It is the plumbing between the reasoners. Every orchestration framework ships a cost story, and the cost story is true: you can make these systems dramatically cheaper by routing down and compressing across. What the cost story omits is that the savings are collateralized by fidelity, and that the collateral is called silently, in the payload, behind a wall of green checkmarks.
Building agents that work is, to a first approximation, the work of making that silent failure loud. That is the whole problem, and it is the one Pisama exists to solve.