Pisama versus the frontier, on independent benchmarks.
Two third-party academic benchmarks measuring the same question from different angles: did a failure happen? (TRAIL) and which agent failed, at which step?(Who&When, ICML 2025). Same traces, same labels.
Exhibit B.1Detection: did a failure happen? 59.9% vs 11.6% best frontierTRAIL benchmark
Pisama on TRAIL
59.9%
Joint accuracy: detector predictions matching ground-truth labels on the full TRAIL set (148 traces, 841 failures).
vs best frontier
+48 pts
p50 cost / trace
$0
Joint accuracy on TRAIL
Pisama (heuristic)59.9%
GPT-5.511.6%
Claude Opus 4.76.7%
Claude Haiku 4.55.2%
Gemini 3.5 Flash2.9%
Grok 4.31.7%
Source · TRAIL benchmark (arXiv:2505.08638)
148 traces · 841 labelled failures · frontier numbers from TRAIL paper
148 traces · 841 labelled failures · frontier numbers from TRAIL paper
Exhibit B.2Attribution: which agent failed, at which step?Who&When · ICML 2025
MethodAgent accuracyStep accuracy
●Pisama + Sonnet 460.3%24.1%
GPT-5.4 Mini60.3%22.4%
Gemini 3.1 Flash-Lite50.0%19.0%
Pisama (heuristic-only)31.0%16.8%
Source · Who&When: Automated Multi-Agent Failure Attribution (ICML 2025) · given a trace with a known failure, identify which agent failed and at which step.
Exhibit B.3Internal calibration · per-detector F1cross-validated · no mocks
60
Detectors calibrated
0.879
Mean F1 (Sprint 11)
$3.66
Judge cost / full run
7,816
Golden dataset entries
Frameworks covered
LangGraphn8nDifyOpenClawSemantic KernelManaged AgentsOpenAI AssistantsBedrock Agents
Per-detector F1, precision, recall, and confidence intervals on the full scoreboard. MAST-mapped view on the taxonomy page.