Detectors

52 measured failure detectors across single-agent, multi-agent, and sub-agent systems

Name: Pisama
Author: Pisama

52 of the 87 detectors in the capability registry are measured on the external-only lane: real traces from public benchmarks and production integrations. 6 are externally validated at production grade. Per-detector precision, recall, and F1, all published, failing detectors included. No surveyed competitor publishes per-detector calibration.

Measured

Mean F1

0.75

Production grade

In registry

Externally validated at production grade: real-trace F1 0.80 or higher, precision 0.70 or higher, 30 or more external traces, external-grounded thresholds, and no per-difficulty blind spot (capability registry, external-only lane, 2026-06-14). Mean F1 spans every measured detector, failing ones included.

Benchmark

On the TRAIL benchmark (Patronus, 2025) twenty Pisama heuristic detectors achieve 59.9% joint accuracy at zero LLM cost. The best frontier judge scores 11.6%. A 5x lead from structural detection.

See Pisama vs Patronus for the side-by-side and the full scoreboard for sortable per-detector metadata (precision, recall, sample count, tier, mode).

Agent core · 14

persona_drift
1.000
Agent output diverges from declared role, tone, or scope.
workflow
1.000
Execution diverges from declared workflow definition.
consensus_collapse
1.000
Multi-agent debate amplifies rather than corrects errors.
corruption
0.941
Shared state schema or types mutate mid-run.
injection
0.875
Prompt injection or instruction override patterns in input.
sycophancy
0.857
Agent uncritically agrees with user assertions against evidence.
completion
0.828
Subtask coverage falls short of declared success criteria.
context
0.812
Upstream context entities never referenced downstream.
hallucination
0.807
Output contains content unsupported by sources.
loop
0.600
State recurrence within a turn window.
communication
0.568
Agent A sends information; agent B never acknowledges or references it.
coordination
0.444
Cross-agent entity reference rate below threshold.
derailment
0.267
Agent output drifts from task definition mid-run.
withholding
0.000
Agent has answer in working state but does not surface it.

Retrieval and grounding · 7

retrieval_quality
1.000
Retrieved documents do not cover query intent.
citation
1.000
Claims with citations point to unsupporting sources.
chunk_relevance
1.000
Retrieved chunks irrelevant to the query.
chunk_attribution
1.000
Source chunks for cited claims mis-attributed or missing.
rag_poisoning
0.941
Adversarial content in retrieval corpus affecting output.
grounding
0.752
Output entities lack source attribution.
context_precision
0.667
Precision of retrieved context against query intent.

Reasoning and planning · 4

convergence
1.000
Metric-aware: trajectory slope, regression, divergence on cost or accuracy.
decomposition
0.940
Planner breaks task into subtasks that miss requirements.
specification
0.926
Agent interprets task differently from user intent.
specification_compliance
0.000
AgentPex pattern (Sharma et al., 2026). Extracts behavioral rules from system prompts and checks trace for compliance.

Orchestration · 5

synthesis_failure
0.965
Parent agent mis-synthesizes or drops sub-agent results.
silent_cascade
0.937
Sub-agent failure propagates upward without surfacing to the parent.
delegation
0.667
Delegation chain loses context or authority.
redundant_delegation_conflict
0.573
Duplicate delegations produce conflicting results that go unreconciled.
routing
0.444
Request routed to wrong specialist or wrong model.

n8n · 3

n8n_complexity
1.000
n8n workflow complexity exceeds maintainable threshold.
n8n_error
0.667
n8n execution error pattern matched to known failure mode.
n8n_resource
0.000
n8n node accesses or mutates an unauthorized resource.

OpenClaw · 5

openclaw_tool_abuse
1.000
OpenClaw tool invocation pattern matches abuse signature.
openclaw_spawn_chain
1.000
Agent spawn chain exceeds depth or fan-out limit.
openclaw_channel_mismatch
1.000
OpenClaw inter-agent channel sender or receiver mis-bound.
openclaw_sandbox_escape
1.000
Sandbox isolation boundary violation in OpenClaw runtime.
openclaw_elevated_risk
0.800
OpenClaw session enters elevated-risk control state.

Other · 14

openclaw_session_loop
1.000
Categorization in next refresh.
deception
1.000
Categorization in next refresh.
impersonation_risk
1.000
Categorization in next refresh.
scope_escalation
1.000
Categorization in next refresh.
reward_hacking
1.000
Categorization in next refresh.
role_usurpation_exec
0.857
Categorization in next refresh.
over_refusal
0.835
Categorization in next refresh.
role_usurpation
0.692
Categorization in next refresh.
multi_agent_contagion
0.667
Categorization in next refresh.
under_refusal
0.578
Categorization in next refresh.
jailbreak_compliance
0.507
Categorization in next refresh.
output_validation
0.500
Categorization in next refresh.
role_usurpation_canonical
0.267
Categorization in next refresh.
analytical_semantics
0.000
Categorization in next refresh.

Methodology

Each detector ships with a calibrated F1 on the external-only lane: real agent traces from public benchmarks (TRAIL, Who&When, MAST, GAIA, and others) and production integrations, with detector-level ground-truth labels. Calibration is cross-validated with per-difficulty stratification (easy / medium / hard) per the Anthropic Demystifying Evals methodology. TRAIL-derived material is part of the calibration corpus, so Pisama's TRAIL benchmark results are in-distribution rather than held out; the paper carries the full disclosure.

The scoreboard above is dated 2026-06-14, from the capability registry. We recalibrate per sprint and version the report. Sycophancy, consensus_collapse, and specification_compliance shipped in the Q2 2026 release and are now measured rows in the scoreboard above.

Detection is tiered. Tier 1 hash and delta detectors run at zero cost in under ten milliseconds. Tier 2 embedding detectors run at near-zero cost. Tier 3 LLM judges handle genuinely ambiguous cases. Tier 4 human review handles the residue. Ninety percent of detections resolve at tiers 1 through 3.

Detector source is open at github.com/Pisama-AI under the MIT license. The calibration dataset is published with the detector code.

What this enables

Pick detectors by tier-gated F1 thresholds (production, beta, experimental) rather than running everything blindly.
Audit calibration: every F1 is reproducible against the published dataset.
Route at the right cost tier: a detector with F1 above 0.95 on hash recurrence does not need an LLM judge fallback.
Compare detectors across releases: per-detector F1 lets you spot regressions before they ship.

Compare

Pisama vs Patronus

Open detector pipeline vs proprietary judge-as-a-service. 59.9% on TRAIL.

Compare

Pisama vs Operama

Shipped calibrated detectors vs control-plane narrative.