Multi-Agent Failure
Detection Taxonomy
Agent evaluation asks “did the agent give a good answer?” Failure detection asks “what went wrong and why?” These are different questions. Pisama answers the second one with 51 production-grade detectors across 4 frameworks.
vs. Frontier LLMs on TRAIL Benchmark
The TRAIL benchmark (148 traces, 841 errors) tests trace-level failure detection. Purpose-built heuristic detectors outperform general-purpose LLM reasoning at zero cost.
The tiered pipeline escalates uncertain cases to LLM judges (Tier 4) for better coverage.
Who&When: Failure Attribution (ICML 2025)
Given a multi-agent trace with a failure, identify which agent failed and at which step.
| Method | Agent Accuracy | Step Accuracy |
|---|---|---|
| Pisama + Sonnet 4 | 60.3% | 24.1% |
| GPT-5.4 Mini | 60.3% | 22.4% |
| Gemini 3.1 Flash-Lite | 50% | 19% |
| Pisama heuristic-only | 31% | 16.8% |
Source: Who&When: Automated Multi-Agent Failure Attribution (ICML 2025).
Framework-Aware Detection
24 detectors purpose-built for specific frameworks. 22 at production quality. No other tool has framework-aware failure detection.
5-Tier Detection Architecture
Fast heuristics handle 90%+ of detections at zero cost. LLM judges escalate only when needed.
| Tier | Method | Latency | Cost |
|---|---|---|---|
| T1 | Hash | ~0ms | $0 |
| T2 | State Delta | ~1ms | $0 |
| T3 | Embeddings | ~10ms | $0 |
| T4 | LLM Judge | ~200ms | ~$0.02 |
| T5 | Human Review | async | -- |
Evaluation vs. Detection: Different Problems
Scores output quality against golden datasets. “Is this answer correct?” Commoditizing fast — now bundled free in AWS Bedrock, Azure AI Foundry, and Google Vertex AI.
Classifies behavioral failure patterns in running multi-agent systems. “Is this agent looping? Corrupting state? Drifting from its persona?”
MAST Failure Taxonomy
Based on the MAST: Multi-Agent System Failure Taxonomy (2025). Showing calibrated detectors with published F1 scores.
FC1: Planning Failures
Task specification, decomposition, and workflow designspecificationdecompositionworkflowFC2: Execution Failures
Derailment, withholding, coordination, and communication breakdowncoordinationcommunicationwithholdingcontextEXT: Cross-Cutting Detectors
Behavioral patterns across planning, execution, and verificationoverflowretrieval_qualitycorruptionlooppersona_drifthallucinationinjectionStart Detecting Failures
pip install pisama. Analyze your first trace in 30 seconds.