Open Source · MIT Licensed

Multi-Agent Failure
Detection Taxonomy

Name: Pisama
Author: Pisama

Agent evaluation asks “did the agent give a good answer?” Failure detection asks “what went wrong and why?” These are different questions. Pisama answers the second one with 87 detectors across 4 frameworks, 6 externally validated at production grade.

vs. Frontier LLMs on TRAIL Benchmark

The TRAIL benchmark (148 traces, 841 errors) tests trace-level failure detection. Purpose-built heuristic detectors outperform general-purpose LLM reasoning at zero cost.

59.9%

Pisama (heuristic)

11.6%

GPT-5.5

6.7%

Claude Opus 4.7

5.2%

Claude Haiku 4.5

2.9%

Gemini 3.5 Flash

1.7%

Grok 4.3

The tiered pipeline escalates uncertain cases to LLM judges (Tier 4) for better coverage.

Who&When: Failure Attribution (ICML 2025)

Given a multi-agent trace with a failure, identify which agent failed and at which step.

Method	Agent Accuracy	Step Accuracy
Pisama + Sonnet 4	60.3%	24.1%
GPT-5.4 Mini	60.3%	22.4%
Gemini 3.1 Flash-Lite	50%	19%
Pisama heuristic-only	31%	16.8%

Source: Who&When: Automated Multi-Agent Failure Attribution (ICML 2025).

Externally validated at production grade

Frameworks supported

Detection tiers

$0.05

Average cost per trace

Externally validated at production grade: real-trace F1 0.80 or higher, precision 0.70 or higher, 30 or more external traces, external-grounded thresholds, and no per-difficulty blind spot (capability registry, external-only lane, 2026-06-14).

Framework-Aware Detection

Detectors purpose-built for specific frameworks, calibrated per platform. No other tool has framework-aware failure detection.

n8n6 detectors

Timeout analysis96%

Resource exhaustion95%

Complexity overflow84%

Error propagation81%

Cycle detection81%

Schema validation77%

LangGraph6 detectors

Recursion depth98%

Tool failure cascade90%

Parallel sync87%

Checkpoint corruption87%

State corruption81%

Edge misroute84%

Dify6 detectors

Model fallback97%

Schema mismatch95%

Iteration escape95%

RAG poisoning93%

Classifier drift90%

Variable leak87%

OpenClaw6 detectors

Channel mismatch99%

Session loop99%

Spawn chain93%

Elevated risk89%

Tool abuse87%

Sandbox escape86%

5-Tier Detection Architecture

Fast heuristics handle 90%+ of detections at zero cost. LLM judges escalate only when needed.

Tier	Method	Latency	Cost	Description
T1	Hash	~0ms	$0	Exact match, structural fingerprints
T2	State Delta	~1ms	$0	Diff analysis, transition validation
T3	Embeddings	~10ms	$0	Semantic similarity, drift measurement
T4	LLM Judge	~200ms	~$0.02	Claude-based reasoning for ambiguous cases
T5	Human Review	async	--	Dashboard escalation for novel patterns

Evaluation vs. Detection: Different Problems

Agent Evaluation

Scores output quality against golden datasets. “Is this answer correct?” Commoditizing fast — now bundled free in AWS Bedrock, Azure AI Foundry, and Google Vertex AI.

Players: Arize, LangSmith, cloud platforms

Failure Detection

Classifies behavioral failure patterns in running multi-agent systems. “Is this agent looping? Corrupting state? Drifting from its persona?”

Player: Pisama

MAST Failure Taxonomy

Based on the MAST: Multi-Agent System Failure Taxonomy (2025). Showing calibrated detectors with published F1 scores.

FC1: Planning Failures

Task specification, decomposition, and workflow design

Specification Mismatchspecification

80%

Poor Task Decompositiondecomposition

77%

Flawed Workflow Designworkflow

89%

FC2: Execution Failures

Derailment, withholding, coordination, and communication breakdown

Coordination Failurecoordination

75%

Communication Breakdowncommunication

77%

Information Withholdingwithholding

87%

Context Neglectcontext

73%

EXT: Cross-Cutting Detectors

Behavioral patterns across planning, execution, and verification

Context Overflowoverflow

77%

Retrieval Qualityretrieval_quality

78%

State Corruptioncorruption

79%

Loop Detectionloop

83%

Persona Driftpersona_drift

79%

Hallucinationhallucination

84%

Prompt Injectioninjection

92%

Start Detecting Failures

pip install pisama. Analyze your first trace in 30 seconds.

Quickstart Full Benchmarks

Multi-Agent FailureDetection Taxonomy