Benchmarks/Detector scoreboard

Detector Calibration Scoreboard

Name: Pisama
Author: Pisama

Every Pisama detector, calibrated on golden datasets and reported per-failure-mode. Hyperscaler agent platforms advertise “anomaly detection” without publishing per-detector F1. We publish the whole table.

Detectors measured

of 87 in the registry

Mean F1

0.753

range 0.000 – 1.000

Production grade

7 beta · 18experimental · 21 failing

Last calibrated

2026-06-14

external-only lane (real traces)

Externally validated at production grade: real-trace F1 0.80 or higher, precision 0.70 or higher, 30 or more external traces, external-grounded thresholds, and no per-difficulty blind spot (capability registry, external-only lane, 2026-06-14).

	Tier	Mode		95% CI
persona_drift	experimental	heuristic	1.000	1.000–1.000	100.0%	100.0%	13
workflow	failing	heuristic	1.000	0.000–1.000	100.0%	100.0%	6
retrieval_quality	failing	heuristic	1.000	1.000–1.000	100.0%	100.0%	5
n8n_complexity	failing	heuristic	1.000	0.000–1.000	100.0%	100.0%	2
openclaw_tool_abuse	failing	heuristic	1.000	0.000–1.000	100.0%	100.0%	4
openclaw_spawn_chain	failing	heuristic	1.000	0.000–1.000	100.0%	100.0%	4
openclaw_channel_mismatch	failing	heuristic	1.000	0.000–1.000	100.0%	100.0%	4
openclaw_sandbox_escape	failing	heuristic	1.000	0.000–1.000	100.0%	100.0%	4
convergence	beta	heuristic	1.000	1.000–1.000	100.0%	100.0%	16
citation	failing	heuristic	1.000	0.000–1.000	100.0%	100.0%	2
openclaw_session_loop	failing	heuristic	1.000	0.000–1.000	100.0%	100.0%	4
consensus_collapse	experimental	heuristic	1.000	1.000–1.000	100.0%	100.0%	10
chunk_relevance	failing	heuristic	1.000	0.000–1.000	100.0%	100.0%	4
chunk_attribution	failing	heuristic	1.000	0.000–1.000	100.0%	100.0%	4
deception	experimental	heuristic	1.000	1.000–1.000	100.0%	100.0%	13
impersonation_risk	experimental	heuristic	1.000	1.000–1.000	100.0%	100.0%	13
scope_escalation	experimental	heuristic	1.000	1.000–1.000	100.0%	100.0%	13
reward_hacking	experimental	heuristic	1.000	1.000–1.000	100.0%	100.0%	8
synthesis_failure	production	heuristic	0.965	0.938–0.987	97.9%	95.0%	200
corruption	experimental	heuristic	0.941	0.769–1.000	100.0%	88.9%	14
rag_poisoning	beta	heuristic	0.941	0.778–1.000	88.9%	100.0%	16
decomposition	production	heuristic	0.940	0.894–0.976	98.4%	90.0%	104
silent_cascade	production	heuristic	0.937	0.899–0.968	98.9%	89.0%	200
specification	beta	heuristic	0.926	0.840–0.984	92.6%	92.6%	113
injection	beta	heuristic	0.875	0.667–1.000	77.8%	100.0%	15
role_usurpation_exec	experimental	heuristic	0.857	0.000–1.000	100.0%	75.0%	13
sycophancy	experimental	heuristic	0.857	0.400–1.000	100.0%	75.0%	8
over_refusal	beta	heuristic	0.835	0.767–0.914	91.5%	76.8%	112
completion	production	heuristic	0.828	0.759–0.889	94.2%	73.9%	106
context	production	heuristic	0.812	0.717–0.885	93.2%	71.9%	62
hallucination	production	heuristic	0.807	0.742–0.868	74.2%	88.5%	113
openclaw_elevated_risk	failing	heuristic	0.800	0.000–1.000	66.7%	100.0%	4
grounding	beta	heuristic	0.752	0.648–0.829	62.1%	95.3%	78
role_usurpation	beta	heuristic	0.692	0.455–0.867	100.0%	52.9%	26
n8n_error	failing	heuristic	0.667	0.000–1.000	50.0%	100.0%	3
delegation	failing	heuristic	0.667	0.000–1.000	66.7%	66.7%	7
context_precision	failing	heuristic	0.667	0.000–1.000	50.0%	100.0%	4
multi_agent_contagion	experimental	heuristic	0.667	0.222–1.000	75.0%	60.0%	10
loop	experimental	heuristic	0.600	0.429–0.750	79.0%	48.4%	49
under_refusal	experimental	heuristic	0.578	0.436–0.694	92.3%	42.1%	113
redundant_delegation_conflict	experimental	heuristic	0.573	0.474–0.667	95.3%	41.0%	200
communication	experimental	heuristic	0.568	0.424–0.691	75.0%	45.6%	61
jailbreak_compliance	experimental	heuristic	0.507	0.356–0.643	95.0%	34.5%	112
output_validation	experimental	hybrid	0.500	0.000–0.857	100.0%	33.3%	10
coordination	experimental	heuristic	0.444	0.235–0.557	45.2%	43.8%	113
routing	experimental	heuristic	0.444	0.000–0.800	50.0%	40.0%	10
derailment	failing	heuristic	0.267	0.103–0.425	85.7%	15.8%	72
role_usurpation_canonical	failing	heuristic	0.267	0.000–0.556	100.0%	15.4%	22
withholding	failing	heuristic	0.000	0.000–1.000	0.0%	0.0%	2
n8n_resource	failing	heuristic	0.000	0.000–1.000	0.0%	0.0%	2
specification_compliance	failing	heuristic	0.000	0.000–0.000	0.0%	0.0%	4
analytical_semantics	failing	hybrid	0.000	0.000–0.000	0.0%	0.0%	5

Why publish this

Generic anomaly detection is becoming a feature in every hyperscaler agent platform. That makes “we have anomaly detection” a check-the-box claim, not a differentiator.

Pisama’s position is the opposite: a structured failure taxonomy with calibrated detectors, each tuned and reported on a per-mode basis. Coordination failure does not look like grounding failure does not look like persona drift, and the eval surface should reflect that.

Numbers above come from the capability registry (external-only lane): real traces from public benchmarks and production integrations, cross-validated per detector. Tier is the registry readiness; the production gate is spelled out under the summary cards. Failing detectors stay in the table, since publishing only the winners would misrepresent the measured population.

Source code: github.com/Pisama-AI/pisama. Snapshot generated by backend/scripts/generate_scoreboard_snapshot.py.