80 calibrated multi-agent failure detectors
77 in the published scoreboard, plus 3 Q2 2026 additions calibrated against Pisama-Bench v0 (sycophancy, consensus_collapse, specification_compliance). Per-detector precision, recall, and F1, all published. Calibrated on a 14,665-trace dataset held out against Pisama-Bench v0 and the TRAIL benchmark. No surveyed competitor publishes per-detector calibration.
Benchmark
On the TRAIL benchmark (Patronus, 2025) twenty Pisama heuristic detectors achieve 59.9% joint accuracy at zero LLM cost. The best frontier judge scores 11.6%. A 5x lead from structural detection.
See Pisama vs Patronus for the side-by-side and the full scoreboard for sortable per-detector metadata (precision, recall, sample count, latency, tier, mode).
Multi-agent core · 13
- injection0.965Prompt injection or instruction override patterns in input.
- withholding0.957Agent has answer in working state but does not surface it.
- corruption0.869Shared state schema or types mutate mid-run.
- hallucination0.866Output contains content unsupported by sources.
- context0.838Upstream context entities never referenced downstream.
- coordination0.794Cross-agent entity reference rate below threshold.
- persona_drift0.772Agent output diverges from declared role, tone, or scope.
- completion0.739Subtask coverage falls short of declared success criteria.
- loop0.732State recurrence within a turn window.
- derailment0.633Agent output drifts from task definition mid-run.
- communication0.629Agent A sends information; agent B never acknowledges or references it.
- overflow0.588Context-window exhaustion or unsafe truncation.
- workflow0.433Execution diverges from declared workflow definition.
Retrieval and grounding · 9
- entity_confusion0.968Entities conflated across retrieval results.
- citation0.959Claims with citations point to unsupporting sources.
- rag_poisoning0.940Adversarial content in retrieval corpus affecting output.
- propagation0.899Errors in retrieved context propagate through downstream agents.
- retrieval_quality0.858Retrieved documents do not cover query intent.
- context_precision0.789Precision of retrieved context against query intent.
- chunk_relevance0.765Retrieved chunks irrelevant to the query.
- grounding0.758Output entities lack source attribution.
- chunk_attribution0.000Source chunks for cited claims mis-attributed or missing.
Reasoning and planning · 7
- multi_chain0.889Multiple reasoning chains contradict without resolution.
- adaptive_thinking0.885Reasoning depth fails to scale with problem difficulty.
- parallel_consistency0.860Parallel branches produce inconsistent results.
- convergence0.843Metric-aware: trajectory slope, regression, divergence on cost or accuracy.
- reasoning_consistency0.820Conclusions vary across re-runs on identical input.
- specification0.757Agent interprets task differently from user intent.
- decomposition0.602Planner breaks task into subtasks that miss requirements.
Safety and control · 6
- critic_quality0.966Critic agent fails to catch actor errors.
- computer_use0.960Computer-use agent performs destructive UI action.
- subagent_boundary0.894Subagent operates outside declared capability scope.
- cowork_safety0.800Cross-agent action triggers unsafe combined effect.
- exploration_safety0.727Agent explores state spaces beyond declared safety boundary.
- approval_bypass0.559Human-in-the-loop approval step skipped or spoofed.
Memory and context · 2
- compaction_quality0.844Context compaction loses load-bearing information.
- memory_staleness0.819Long-term memory references stale facts about user or world.
Orchestration · 7
- task_starvation0.980Queued tasks indefinitely deferred by higher-priority work.
- scheduled_task0.930Scheduled-task agent fires at wrong time or skips.
- delegation0.895Delegation chain loses context or authority.
- orchestration_quality0.880Top-level orchestrator dispatches to wrong specialist agent.
- dispatch_async0.827Asynchronous dispatch deadlock or starvation.
- routing0.779Request routed to wrong specialist or wrong model.
- model_selection0.675Model selected for task mismatches required capability.
LangGraph · 5
- langgraph_tool_failure0.920Tool node returns error or unexpected schema.
- langgraph_parallel_sync0.913Parallel branches merge with inconsistent state.
- langgraph_state_corruption0.846StateGraph mutation violates schema between nodes.
- langgraph_edge_misroute0.793Conditional edge routes to wrong node.
- langgraph_checkpoint_corruption0.640Checkpoint reload reconstructs invalid graph state.
n8n · 3
- n8n_resource0.909n8n node accesses or mutates an unauthorized resource.
- n8n_complexity0.876n8n workflow complexity exceeds maintainable threshold.
- n8n_error0.844n8n execution error pattern matched to known failure mode.
Dify · 2
- dify_variable_leak0.923Dify variable scope leak between agent runs.
- dify_classifier_drift0.900Dify classifier output drifts from declared categories.
OpenClaw · 5
- openclaw_channel_mismatch0.993OpenClaw inter-agent channel sender or receiver mis-bound.
- openclaw_spawn_chain0.933Agent spawn chain exceeds depth or fan-out limit.
- openclaw_elevated_risk0.886OpenClaw session enters elevated-risk control state.
- openclaw_sandbox_escape0.883Sandbox isolation boundary violation in OpenClaw runtime.
- openclaw_tool_abuse0.882OpenClaw tool invocation pattern matches abuse signature.
Protocols · 1
- mcp_protocol0.990MCP message schema violation or unsafe tool capability.
Other · 17
- impersonation_risk1.000Categorization in next refresh.
- openclaw_session_loop0.985Categorization in next refresh.
- over_refusal0.971Categorization in next refresh.
- consensus_collapse0.967Categorization in next refresh.
- specification_compliance0.966Categorization in next refresh.
- deception0.966Categorization in next refresh.
- under_refusal0.964Categorization in next refresh.
- reward_hacking0.939Categorization in next refresh.
- sycophancy0.902Categorization in next refresh.
- jailbreak_compliance0.897Categorization in next refresh.
- planning_fallacy0.892Categorization in next refresh.
- scope_escalation0.885Categorization in next refresh.
- role_usurpation_exec0.870Categorization in next refresh.
- multi_agent_contagion0.769Categorization in next refresh.
- authority_gradient0.679Categorization in next refresh.
- role_usurpation_canonical0.425Categorization in next refresh.
- role_usurpation0.226Categorization in next refresh.
CAIS 2026 additions · 3
Three detectors shipped in the Q2 2026 release on 2026-04-30, after the scoreboard snapshot above. F1 numbers reported against Pisama-Bench v0; full scoreboard integration in the next refresh.
- sycophancyAgent uncritically agrees with user assertions against evidence.Q2 2026 ship. Calibration F1 0.902 on Pisama-Bench v0. Next scoreboard refresh.
- consensus_collapseMulti-agent debate amplifies rather than corrects errors.Q2 2026 ship. Calibration F1 0.967 on Pisama-Bench v0. Next scoreboard refresh.
- specification_complianceAgentPex pattern (Sharma et al., 2026). Extracts behavioral rules from system prompts and checks trace for compliance.Q2 2026 ship under feature flag. Re-calibrating against τ2-bench traces from AgentPex paper.
Methodology
Each detector ships with a calibrated F1 on Pisama-Bench v0, a held-out set of multi-agent traces with detector-level ground-truth labels. Calibration uses an 80/20 train/eval split with per-difficulty stratification (easy / medium / hard) per the Anthropic Demystifying Evals methodology.
The scoreboard above is dated 2026-05-26. We recalibrate per sprint and version the report. Sycophancy and consensus_collapse shipped in the Q2 CAIS deploy on 2026-04-30. Specification compliance, an implementation of the AgentPex pattern (Sharma et al., 2026), is shipped under a feature flag and re-calibrating against the τ2-bench traces from the AgentPex paper.
Detection is tiered. Tier 1 hash and delta detectors run at zero cost in under ten milliseconds. Tier 2 embedding detectors run at near-zero cost. Tier 3 LLM judges handle genuinely ambiguous cases. Tier 4 human review handles the residue. Ninety percent of detections resolve at tiers 1 through 3.
Detector source is open at github.com/Pisama-AI under the MIT license. The calibration dataset is published with the detector code.
What this enables
- Pick detectors by tier-gated F1 thresholds (production, beta, experimental) rather than running everything blindly.
- Audit calibration: every F1 is reproducible against the published dataset.
- Route at the right cost tier: a detector with F1 above 0.95 on hash recurrence does not need an LLM judge fallback.
- Compare detectors across releases: per-detector F1 lets you spot regressions before they ship.