Why your multi-agent system fails silently, and how to detect it

Most multi-agent failures are silent: no exception, no log line, just a wrong answer or a stuck run. Here is the structural taxonomy and how to catch them.

Single-agent failures are loud. The model raises an exception, returns malformed JSON, or hits a token limit. You see it in your logs and you fix it.

Multi-agent failures are quiet. Two agents loop on each other for 14 turns and the run "completes" with a useless answer. Shared state gets corrupted at step 7 and the final agent works from a wrong assumption. The planner agent decomposes the task badly and every downstream agent does its job correctly while the overall outcome is wrong.

No exception fires for any of these. Your trace looks fine. The user gets a bad answer.

Why silent multi-agent failures are different

After analysing 7,212 labelled agent traces from 13 external datasets we found that silent multi-agent failures cluster into 17 distinct modes. Each has a structural signature. Each can be detected with pattern matching, no LLM judge required, in milliseconds.

The five highest-frequency modes account for 60% of silent failures: loops (state recurrence across turns), corruption (state schema changes mid-run), persona drift (agent output diverges from declared role), context neglect (key entities from upstream context never referenced), and coordination breakdown (agent A speaks, agent B never acknowledges).

The 17 failure modes cluster into a small set

For loops: hash the (sender, receiver, content-fingerprint) tuple per turn; recurrence within a window is a loop. Subsequence matching catches longer cycles. No LLM call needed.

Each mode has a structural signature

For corruption: snapshot state schema at every step transition; flag type changes, missing keys, or shape drift. The detector lives in tier 1 of a 5-tier escalation pipeline (hash, delta, embeddings, LLM judge, human review).

For persona drift: compare output against the agent's declared role/instructions using embedding similarity. Threshold-tune per agent. Only escalate to LLM judge for genuinely ambiguous cases.

For context neglect: extract critical entities from upstream context (numbers, dates, IDs, items tagged CRITICAL) and verify they appear in downstream agent outputs. Missing critical entities are flagged.

For coordination breakdown: count cross-agent reference rates. If agent A sends N messages and agent B never references any of them, that is a coordination failure even if agent B "responded".

On the TRAIL benchmark, 20 of these heuristic detectors achieve 59.9% joint accuracy at $0 cost, 5x better than the best frontier LLM judge at 11.9%. The structural signatures are sharper than what an LLM picks up.

Heuristic detectors beat LLM judges on TRAIL

The full failure taxonomy and detector source are at github.com/Pisama-AI. The pipeline is MIT-licensed.

Originally published on dev.to

More on the failure taxonomy at /taxonomy. Detector benchmarks at /benchmarks/detectors. Framework adapters at /frameworks.