Pisama
Benchmarks/Detector scoreboard

Detector Calibration Scoreboard

Every Pisama detector, calibrated on golden datasets and reported per-failure-mode. Hyperscaler agent platforms advertise “anomaly detection” without publishing per-detector F1. We publish the whole table.

Detectors calibrated
77
53at F1 ≥ 0.80
Mean F1
0.823
range 0.0001.000
Judge cost / full run
$3.66
2,175 LLM escalations
Last calibrated
2026-05-26
14,665 golden samples
Tier
impersonation_riskproduction1.000100.0%100.0%850.0
openclaw_channel_mismatchproduction0.993100.0%98.5%1200.0
mcp_protocolproduction0.990100.0%98.0%1000.0
openclaw_session_loopproduction0.985100.0%97.0%1150.0
task_starvationproduction0.980100.0%96.0%1000.0
over_refusalproduction0.971100.0%94.4%850.0
entity_confusionproduction0.968100.0%93.9%1000.0
consensus_collapseproduction0.96796.7%96.7%600.0
critic_qualityproduction0.966100.0%93.5%950.0
specification_complianceproduction0.966100.0%93.3%300.0
deceptionproduction0.96693.3%100.0%850.0
injectionproduction0.96594.1%99.0%2000.0
under_refusalproduction0.96496.4%96.4%850.0
computer_useproduction0.96096.0%96.0%570.0
citationproduction0.95992.1%100.0%1150.0
withholdingproduction0.95792.5%99.0%1490.0
rag_poisoningproduction0.94092.2%95.9%1950.0
reward_hackingproduction0.93995.8%92.0%500.0
openclaw_spawn_chainproduction0.93392.7%94.0%1100.0
scheduled_taskproduction0.930100.0%87.0%620.0
dify_variable_leakproduction0.92398.2%87.1%940.0
langgraph_tool_failureproduction0.92092.0%92.0%940.0
langgraph_parallel_syncproduction0.91391.3%91.3%970.0
n8n_resourceproduction0.90987.0%95.2%1290.0
sycophancyproduction0.90288.5%92.0%500.0
dify_classifier_driftproduction0.900100.0%81.8%1000.0
propagationproduction0.899100.0%81.6%970.0
jailbreak_complianceproduction0.89786.7%92.9%850.0
delegationproduction0.89581.0%100.0%1450.0
subagent_boundaryproduction0.894100.0%80.8%950.0
planning_fallacyproduction0.89286.1%92.5%720.0
multi_chainproduction0.88980.0%100.0%630.0
openclaw_elevated_riskproduction0.88679.5%100.0%1200.0
scope_escalationproduction0.88593.1%84.4%850.0
adaptive_thinkingproduction0.88579.4%100.0%800.0
openclaw_sandbox_escapeproduction0.88385.3%91.4%1200.0
openclaw_tool_abuseproduction0.88297.8%80.4%1100.0
orchestration_qualityproduction0.88084.6%91.7%390.0
n8n_complexityproduction0.87681.6%94.7%1240.0
role_usurpation_execproduction0.87087.0%87.0%830.0
corruptionproduction0.86988.9%85.0%2150.0
hallucinationproduction0.86677.0%99.0%3000.0
parallel_consistencyproduction0.86075.4%100.0%1000.0
retrieval_qualityproduction0.85888.4%83.3%2360.0
langgraph_state_corruptionproduction0.846100.0%73.3%970.0
n8n_errorproduction0.84476.8%93.8%1270.0
compaction_qualityproduction0.84480.7%88.5%980.0
convergenceproduction0.84386.0%82.7%1200.0
contextproduction0.83887.0%80.8%3000.0
dispatch_asyncproduction0.82781.6%83.8%860.0
reasoning_consistencyproduction0.82083.7%80.4%990.0
memory_stalenessproduction0.81976.8%87.8%990.0
cowork_safetyproduction0.80070.2%93.0%850.0
coordinationproduction-watch0.79484.6%74.9%3000.0
langgraph_edge_misrouteproduction-watch0.79378.6%80.0%1000.0
context_precisionproduction-watch0.78965.2%100.0%300.0
routingproduction-watch0.77963.7%100.0%1000.0
persona_driftproduction-watch0.77276.9%77.5%1840.0
multi_agent_contagionproduction-watch0.76990.9%66.7%600.0
chunk_relevanceproduction-watch0.76565.0%92.9%280.0
groundingproduction-watch0.75867.6%86.2%3000.0
specificationproduction-watch0.75787.5%66.7%3000.0
completionproduction-watch0.73967.8%81.0%3000.0
loopproduction-watch0.73280.9%66.8%3000.0
exploration_safetyproduction-watch0.72780.0%66.7%1000.0
authority_gradientbeta0.67981.8%58.1%620.0
model_selectionbeta0.67561.9%74.3%1020.0
langgraph_checkpoint_corruptionbeta0.64050.8%86.5%1000.0
derailmentbeta0.63352.8%79.0%3000.0
communicationbeta0.629100.0%45.9%1490.0
decompositionbeta0.60245.4%89.3%3000.0
overflowbeta0.58881.0%46.1%1500.0
approval_bypassbeta0.55988.9%40.8%1490.0
workflowbeta0.43374.6%30.6%2100.0
role_usurpation_canonicalbeta0.42592.3%27.6%1470.0
role_usurpationexperimental0.226100.0%12.7%2630.0
chunk_attributionexperimental0.0000.0%0.0%200.0

Why publish this

Generic anomaly detection is becoming a feature in every hyperscaler agent platform. That makes “we have anomaly detection” a check-the-box claim, not a differentiator.

Pisama’s position is the opposite: a structured failure taxonomy with calibrated detectors, each tuned and reported on a per-mode basis. Coordination failure does not look like grounding failure does not look like persona drift, and the eval surface should reflect that.

Numbers above are from the Sprint 11 calibration run on golden datasets sized at 14,665 entries. Confidence intervals come from cross-validation. LLM-judge escalation cost is the total spend to recalibrate the entire bench in one pass.

Source code: github.com/Pisama-AI/pisama. Snapshot generated by backend/scripts/generate_scoreboard_snapshot.py.

© 2026 Pisama. All rights reserved.