Pisama
Open Source · MIT Licensed

Multi-Agent Failure
Detection Taxonomy

Agent evaluation asks “did the agent give a good answer?” Failure detection asks “what went wrong and why?” These are different questions. Pisama answers the second one with 51 production-grade detectors across 4 frameworks.

vs. Frontier LLMs on TRAIL Benchmark

The TRAIL benchmark (148 traces, 841 errors) tests trace-level failure detection. Purpose-built heuristic detectors outperform general-purpose LLM reasoning at zero cost.

59.9%
Pisama (heuristic)
11.6%
GPT-5.5
6.7%
Claude Opus 4.7
5.2%
Claude Haiku 4.5
2.9%
Gemini 3.5 Flash
1.7%
Grok 4.3

The tiered pipeline escalates uncertain cases to LLM judges (Tier 4) for better coverage.

Who&When: Failure Attribution (ICML 2025)

Given a multi-agent trace with a failure, identify which agent failed and at which step.

MethodAgent AccuracyStep Accuracy
Pisama + Sonnet 460.3%24.1%
GPT-5.4 Mini60.3%22.4%
Gemini 3.1 Flash-Lite50%19%
Pisama heuristic-only31%16.8%

Source: Who&When: Automated Multi-Agent Failure Attribution (ICML 2025).

42
Production detectors (F1 ≥ 0.80)
4
Frameworks supported
5
Detection tiers
$0.05
Average cost per trace

Framework-Aware Detection

24 detectors purpose-built for specific frameworks. 22 at production quality. No other tool has framework-aware failure detection.

n8n5/6 production
Timeout analysis96%
Resource exhaustion95%
Complexity overflow84%
Error propagation81%
Cycle detection81%
Schema validation77%
LangGraph6/6 production
Recursion depth98%
Tool failure cascade90%
Parallel sync87%
Checkpoint corruption87%
State corruption81%
Edge misroute84%
Dify5/6 production
Model fallback97%
Schema mismatch95%
Iteration escape95%
RAG poisoning93%
Classifier drift90%
Variable leak87%
OpenClaw6/6 production
Channel mismatch99%
Session loop99%
Spawn chain93%
Elevated risk89%
Tool abuse87%
Sandbox escape86%

5-Tier Detection Architecture

Fast heuristics handle 90%+ of detections at zero cost. LLM judges escalate only when needed.

TierMethodLatencyCost
T1Hash~0ms$0
T2State Delta~1ms$0
T3Embeddings~10ms$0
T4LLM Judge~200ms~$0.02
T5Human Reviewasync--

Evaluation vs. Detection: Different Problems

Agent Evaluation

Scores output quality against golden datasets. “Is this answer correct?” Commoditizing fast — now bundled free in AWS Bedrock, Azure AI Foundry, and Google Vertex AI.

Players: Arize, LangSmith, cloud platforms
Failure Detection

Classifies behavioral failure patterns in running multi-agent systems. “Is this agent looping? Corrupting state? Drifting from its persona?”

Player: Pisama

MAST Failure Taxonomy

Based on the MAST: Multi-Agent System Failure Taxonomy (2025). Showing calibrated detectors with published F1 scores.

FC1: Planning Failures

Task specification, decomposition, and workflow design
Specification Mismatchspecification
80%
Poor Task Decompositiondecomposition
77%
Flawed Workflow Designworkflow
89%

FC2: Execution Failures

Derailment, withholding, coordination, and communication breakdown
Coordination Failurecoordination
75%
Communication Breakdowncommunication
77%
Information Withholdingwithholding
87%
Context Neglectcontext
73%

EXT: Cross-Cutting Detectors

Behavioral patterns across planning, execution, and verification
Context Overflowoverflow
77%
Retrieval Qualityretrieval_quality
78%
State Corruptioncorruption
79%
Loop Detectionloop
83%
Persona Driftpersona_drift
79%
Hallucinationhallucination
84%
Prompt Injectioninjection
92%

Start Detecting Failures

pip install pisama. Analyze your first trace in 30 seconds.

© 2026 Pisama. All rights reserved.