Detectors

80 calibrated multi-agent failure detectors

77 in the published scoreboard, plus 3 Q2 2026 additions calibrated against Pisama-Bench v0 (sycophancy, consensus_collapse, specification_compliance). Per-detector precision, recall, and F1, all published. Calibrated on a 14,665-trace dataset held out against Pisama-Bench v0 and the TRAIL benchmark. No surveyed competitor publishes per-detector calibration.

Detectors
80
Mean F1
0.82
Production tier
53
Calibration set
14,665

Benchmark

On the TRAIL benchmark (Patronus, 2025) twenty Pisama heuristic detectors achieve 59.9% joint accuracy at zero LLM cost. The best frontier judge scores 11.6%. A 5x lead from structural detection.

See Pisama vs Patronus for the side-by-side and the full scoreboard for sortable per-detector metadata (precision, recall, sample count, latency, tier, mode).

Multi-agent core · 13

  • injection
    0.965
    Prompt injection or instruction override patterns in input.
  • withholding
    0.957
    Agent has answer in working state but does not surface it.
  • corruption
    0.869
    Shared state schema or types mutate mid-run.
  • hallucination
    0.866
    Output contains content unsupported by sources.
  • context
    0.838
    Upstream context entities never referenced downstream.
  • coordination
    0.794
    Cross-agent entity reference rate below threshold.
  • persona_drift
    0.772
    Agent output diverges from declared role, tone, or scope.
  • completion
    0.739
    Subtask coverage falls short of declared success criteria.
  • loop
    0.732
    State recurrence within a turn window.
  • derailment
    0.633
    Agent output drifts from task definition mid-run.
  • communication
    0.629
    Agent A sends information; agent B never acknowledges or references it.
  • overflow
    0.588
    Context-window exhaustion or unsafe truncation.
  • workflow
    0.433
    Execution diverges from declared workflow definition.

Retrieval and grounding · 9

  • entity_confusion
    0.968
    Entities conflated across retrieval results.
  • citation
    0.959
    Claims with citations point to unsupporting sources.
  • rag_poisoning
    0.940
    Adversarial content in retrieval corpus affecting output.
  • propagation
    0.899
    Errors in retrieved context propagate through downstream agents.
  • retrieval_quality
    0.858
    Retrieved documents do not cover query intent.
  • context_precision
    0.789
    Precision of retrieved context against query intent.
  • chunk_relevance
    0.765
    Retrieved chunks irrelevant to the query.
  • grounding
    0.758
    Output entities lack source attribution.
  • chunk_attribution
    0.000
    Source chunks for cited claims mis-attributed or missing.

Reasoning and planning · 7

  • multi_chain
    0.889
    Multiple reasoning chains contradict without resolution.
  • adaptive_thinking
    0.885
    Reasoning depth fails to scale with problem difficulty.
  • parallel_consistency
    0.860
    Parallel branches produce inconsistent results.
  • convergence
    0.843
    Metric-aware: trajectory slope, regression, divergence on cost or accuracy.
  • reasoning_consistency
    0.820
    Conclusions vary across re-runs on identical input.
  • specification
    0.757
    Agent interprets task differently from user intent.
  • decomposition
    0.602
    Planner breaks task into subtasks that miss requirements.

Safety and control · 6

  • critic_quality
    0.966
    Critic agent fails to catch actor errors.
  • computer_use
    0.960
    Computer-use agent performs destructive UI action.
  • subagent_boundary
    0.894
    Subagent operates outside declared capability scope.
  • cowork_safety
    0.800
    Cross-agent action triggers unsafe combined effect.
  • exploration_safety
    0.727
    Agent explores state spaces beyond declared safety boundary.
  • approval_bypass
    0.559
    Human-in-the-loop approval step skipped or spoofed.

Memory and context · 2

  • compaction_quality
    0.844
    Context compaction loses load-bearing information.
  • memory_staleness
    0.819
    Long-term memory references stale facts about user or world.

Orchestration · 7

  • task_starvation
    0.980
    Queued tasks indefinitely deferred by higher-priority work.
  • scheduled_task
    0.930
    Scheduled-task agent fires at wrong time or skips.
  • delegation
    0.895
    Delegation chain loses context or authority.
  • orchestration_quality
    0.880
    Top-level orchestrator dispatches to wrong specialist agent.
  • dispatch_async
    0.827
    Asynchronous dispatch deadlock or starvation.
  • routing
    0.779
    Request routed to wrong specialist or wrong model.
  • model_selection
    0.675
    Model selected for task mismatches required capability.

LangGraph · 5

  • langgraph_tool_failure
    0.920
    Tool node returns error or unexpected schema.
  • langgraph_parallel_sync
    0.913
    Parallel branches merge with inconsistent state.
  • langgraph_state_corruption
    0.846
    StateGraph mutation violates schema between nodes.
  • langgraph_edge_misroute
    0.793
    Conditional edge routes to wrong node.
  • langgraph_checkpoint_corruption
    0.640
    Checkpoint reload reconstructs invalid graph state.

n8n · 3

  • n8n_resource
    0.909
    n8n node accesses or mutates an unauthorized resource.
  • n8n_complexity
    0.876
    n8n workflow complexity exceeds maintainable threshold.
  • n8n_error
    0.844
    n8n execution error pattern matched to known failure mode.

Dify · 2

  • dify_variable_leak
    0.923
    Dify variable scope leak between agent runs.
  • dify_classifier_drift
    0.900
    Dify classifier output drifts from declared categories.

OpenClaw · 5

  • openclaw_channel_mismatch
    0.993
    OpenClaw inter-agent channel sender or receiver mis-bound.
  • openclaw_spawn_chain
    0.933
    Agent spawn chain exceeds depth or fan-out limit.
  • openclaw_elevated_risk
    0.886
    OpenClaw session enters elevated-risk control state.
  • openclaw_sandbox_escape
    0.883
    Sandbox isolation boundary violation in OpenClaw runtime.
  • openclaw_tool_abuse
    0.882
    OpenClaw tool invocation pattern matches abuse signature.

Protocols · 1

  • mcp_protocol
    0.990
    MCP message schema violation or unsafe tool capability.

Other · 17

  • impersonation_risk
    1.000
    Categorization in next refresh.
  • openclaw_session_loop
    0.985
    Categorization in next refresh.
  • over_refusal
    0.971
    Categorization in next refresh.
  • consensus_collapse
    0.967
    Categorization in next refresh.
  • specification_compliance
    0.966
    Categorization in next refresh.
  • deception
    0.966
    Categorization in next refresh.
  • under_refusal
    0.964
    Categorization in next refresh.
  • reward_hacking
    0.939
    Categorization in next refresh.
  • sycophancy
    0.902
    Categorization in next refresh.
  • jailbreak_compliance
    0.897
    Categorization in next refresh.
  • planning_fallacy
    0.892
    Categorization in next refresh.
  • scope_escalation
    0.885
    Categorization in next refresh.
  • role_usurpation_exec
    0.870
    Categorization in next refresh.
  • multi_agent_contagion
    0.769
    Categorization in next refresh.
  • authority_gradient
    0.679
    Categorization in next refresh.
  • role_usurpation_canonical
    0.425
    Categorization in next refresh.
  • role_usurpation
    0.226
    Categorization in next refresh.

CAIS 2026 additions · 3

Three detectors shipped in the Q2 2026 release on 2026-04-30, after the scoreboard snapshot above. F1 numbers reported against Pisama-Bench v0; full scoreboard integration in the next refresh.

  • sycophancy
    Agent uncritically agrees with user assertions against evidence.
    Q2 2026 ship. Calibration F1 0.902 on Pisama-Bench v0. Next scoreboard refresh.
  • consensus_collapse
    Multi-agent debate amplifies rather than corrects errors.
    Q2 2026 ship. Calibration F1 0.967 on Pisama-Bench v0. Next scoreboard refresh.
  • specification_compliance
    AgentPex pattern (Sharma et al., 2026). Extracts behavioral rules from system prompts and checks trace for compliance.
    Q2 2026 ship under feature flag. Re-calibrating against τ2-bench traces from AgentPex paper.

Methodology

Each detector ships with a calibrated F1 on Pisama-Bench v0, a held-out set of multi-agent traces with detector-level ground-truth labels. Calibration uses an 80/20 train/eval split with per-difficulty stratification (easy / medium / hard) per the Anthropic Demystifying Evals methodology.

The scoreboard above is dated 2026-05-26. We recalibrate per sprint and version the report. Sycophancy and consensus_collapse shipped in the Q2 CAIS deploy on 2026-04-30. Specification compliance, an implementation of the AgentPex pattern (Sharma et al., 2026), is shipped under a feature flag and re-calibrating against the τ2-bench traces from the AgentPex paper.

Detection is tiered. Tier 1 hash and delta detectors run at zero cost in under ten milliseconds. Tier 2 embedding detectors run at near-zero cost. Tier 3 LLM judges handle genuinely ambiguous cases. Tier 4 human review handles the residue. Ninety percent of detections resolve at tiers 1 through 3.

Detector source is open at github.com/Pisama-AI under the MIT license. The calibration dataset is published with the detector code.

What this enables

  • Pick detectors by tier-gated F1 thresholds (production, beta, experimental) rather than running everything blindly.
  • Audit calibration: every F1 is reproducible against the published dataset.
  • Route at the right cost tier: a detector with F1 above 0.95 on hash recurrence does not need an LLM judge fallback.
  • Compare detectors across releases: per-detector F1 lets you spot regressions before they ship.