Detector Calibration Scoreboard
Every Pisama detector, calibrated on golden datasets and reported per-failure-mode. Hyperscaler agent platforms advertise “anomaly detection” without publishing per-detector F1. We publish the whole table.
| Tier | ||||||
|---|---|---|---|---|---|---|
| impersonation_risk | production | 1.000 | 100.0% | 100.0% | 85 | 0.0 |
| openclaw_channel_mismatch | production | 0.993 | 100.0% | 98.5% | 120 | 0.0 |
| mcp_protocol | production | 0.990 | 100.0% | 98.0% | 100 | 0.0 |
| openclaw_session_loop | production | 0.985 | 100.0% | 97.0% | 115 | 0.0 |
| task_starvation | production | 0.980 | 100.0% | 96.0% | 100 | 0.0 |
| over_refusal | production | 0.971 | 100.0% | 94.4% | 85 | 0.0 |
| entity_confusion | production | 0.968 | 100.0% | 93.9% | 100 | 0.0 |
| consensus_collapse | production | 0.967 | 96.7% | 96.7% | 60 | 0.0 |
| critic_quality | production | 0.966 | 100.0% | 93.5% | 95 | 0.0 |
| specification_compliance | production | 0.966 | 100.0% | 93.3% | 30 | 0.0 |
| deception | production | 0.966 | 93.3% | 100.0% | 85 | 0.0 |
| injection | production | 0.965 | 94.1% | 99.0% | 200 | 0.0 |
| under_refusal | production | 0.964 | 96.4% | 96.4% | 85 | 0.0 |
| computer_use | production | 0.960 | 96.0% | 96.0% | 57 | 0.0 |
| citation | production | 0.959 | 92.1% | 100.0% | 115 | 0.0 |
| withholding | production | 0.957 | 92.5% | 99.0% | 149 | 0.0 |
| rag_poisoning | production | 0.940 | 92.2% | 95.9% | 195 | 0.0 |
| reward_hacking | production | 0.939 | 95.8% | 92.0% | 50 | 0.0 |
| openclaw_spawn_chain | production | 0.933 | 92.7% | 94.0% | 110 | 0.0 |
| scheduled_task | production | 0.930 | 100.0% | 87.0% | 62 | 0.0 |
| dify_variable_leak | production | 0.923 | 98.2% | 87.1% | 94 | 0.0 |
| langgraph_tool_failure | production | 0.920 | 92.0% | 92.0% | 94 | 0.0 |
| langgraph_parallel_sync | production | 0.913 | 91.3% | 91.3% | 97 | 0.0 |
| n8n_resource | production | 0.909 | 87.0% | 95.2% | 129 | 0.0 |
| sycophancy | production | 0.902 | 88.5% | 92.0% | 50 | 0.0 |
| dify_classifier_drift | production | 0.900 | 100.0% | 81.8% | 100 | 0.0 |
| propagation | production | 0.899 | 100.0% | 81.6% | 97 | 0.0 |
| jailbreak_compliance | production | 0.897 | 86.7% | 92.9% | 85 | 0.0 |
| delegation | production | 0.895 | 81.0% | 100.0% | 145 | 0.0 |
| subagent_boundary | production | 0.894 | 100.0% | 80.8% | 95 | 0.0 |
| planning_fallacy | production | 0.892 | 86.1% | 92.5% | 72 | 0.0 |
| multi_chain | production | 0.889 | 80.0% | 100.0% | 63 | 0.0 |
| openclaw_elevated_risk | production | 0.886 | 79.5% | 100.0% | 120 | 0.0 |
| scope_escalation | production | 0.885 | 93.1% | 84.4% | 85 | 0.0 |
| adaptive_thinking | production | 0.885 | 79.4% | 100.0% | 80 | 0.0 |
| openclaw_sandbox_escape | production | 0.883 | 85.3% | 91.4% | 120 | 0.0 |
| openclaw_tool_abuse | production | 0.882 | 97.8% | 80.4% | 110 | 0.0 |
| orchestration_quality | production | 0.880 | 84.6% | 91.7% | 39 | 0.0 |
| n8n_complexity | production | 0.876 | 81.6% | 94.7% | 124 | 0.0 |
| role_usurpation_exec | production | 0.870 | 87.0% | 87.0% | 83 | 0.0 |
| corruption | production | 0.869 | 88.9% | 85.0% | 215 | 0.0 |
| hallucination | production | 0.866 | 77.0% | 99.0% | 300 | 0.0 |
| parallel_consistency | production | 0.860 | 75.4% | 100.0% | 100 | 0.0 |
| retrieval_quality | production | 0.858 | 88.4% | 83.3% | 236 | 0.0 |
| langgraph_state_corruption | production | 0.846 | 100.0% | 73.3% | 97 | 0.0 |
| n8n_error | production | 0.844 | 76.8% | 93.8% | 127 | 0.0 |
| compaction_quality | production | 0.844 | 80.7% | 88.5% | 98 | 0.0 |
| convergence | production | 0.843 | 86.0% | 82.7% | 120 | 0.0 |
| context | production | 0.838 | 87.0% | 80.8% | 300 | 0.0 |
| dispatch_async | production | 0.827 | 81.6% | 83.8% | 86 | 0.0 |
| reasoning_consistency | production | 0.820 | 83.7% | 80.4% | 99 | 0.0 |
| memory_staleness | production | 0.819 | 76.8% | 87.8% | 99 | 0.0 |
| cowork_safety | production | 0.800 | 70.2% | 93.0% | 85 | 0.0 |
| coordination | production-watch | 0.794 | 84.6% | 74.9% | 300 | 0.0 |
| langgraph_edge_misroute | production-watch | 0.793 | 78.6% | 80.0% | 100 | 0.0 |
| context_precision | production-watch | 0.789 | 65.2% | 100.0% | 30 | 0.0 |
| routing | production-watch | 0.779 | 63.7% | 100.0% | 100 | 0.0 |
| persona_drift | production-watch | 0.772 | 76.9% | 77.5% | 184 | 0.0 |
| multi_agent_contagion | production-watch | 0.769 | 90.9% | 66.7% | 60 | 0.0 |
| chunk_relevance | production-watch | 0.765 | 65.0% | 92.9% | 28 | 0.0 |
| grounding | production-watch | 0.758 | 67.6% | 86.2% | 300 | 0.0 |
| specification | production-watch | 0.757 | 87.5% | 66.7% | 300 | 0.0 |
| completion | production-watch | 0.739 | 67.8% | 81.0% | 300 | 0.0 |
| loop | production-watch | 0.732 | 80.9% | 66.8% | 300 | 0.0 |
| exploration_safety | production-watch | 0.727 | 80.0% | 66.7% | 100 | 0.0 |
| authority_gradient | beta | 0.679 | 81.8% | 58.1% | 62 | 0.0 |
| model_selection | beta | 0.675 | 61.9% | 74.3% | 102 | 0.0 |
| langgraph_checkpoint_corruption | beta | 0.640 | 50.8% | 86.5% | 100 | 0.0 |
| derailment | beta | 0.633 | 52.8% | 79.0% | 300 | 0.0 |
| communication | beta | 0.629 | 100.0% | 45.9% | 149 | 0.0 |
| decomposition | beta | 0.602 | 45.4% | 89.3% | 300 | 0.0 |
| overflow | beta | 0.588 | 81.0% | 46.1% | 150 | 0.0 |
| approval_bypass | beta | 0.559 | 88.9% | 40.8% | 149 | 0.0 |
| workflow | beta | 0.433 | 74.6% | 30.6% | 210 | 0.0 |
| role_usurpation_canonical | beta | 0.425 | 92.3% | 27.6% | 147 | 0.0 |
| role_usurpation | experimental | 0.226 | 100.0% | 12.7% | 263 | 0.0 |
| chunk_attribution | experimental | 0.000 | 0.0% | 0.0% | 20 | 0.0 |
Why publish this
Generic anomaly detection is becoming a feature in every hyperscaler agent platform. That makes “we have anomaly detection” a check-the-box claim, not a differentiator.
Pisama’s position is the opposite: a structured failure taxonomy with calibrated detectors, each tuned and reported on a per-mode basis. Coordination failure does not look like grounding failure does not look like persona drift, and the eval surface should reflect that.
Numbers above are from the Sprint 11 calibration run on golden datasets sized at 14,665 entries. Confidence intervals come from cross-validation. LLM-judge escalation cost is the total spend to recalibrate the entire bench in one pass.
Source code: github.com/Pisama-AI/pisama. Snapshot generated by backend/scripts/generate_scoreboard_snapshot.py.