Exhibit B · Detection benchmarks

Pisama versus the frontier, on independent benchmarks.

Name: Pisama
Author: Pisama

Two third-party academic benchmarks measuring the same question from different angles: did a failure happen? (TRAIL) and which agent failed, at which step?(Who&When, ICML 2025). Same traces, same labels.

Exhibit B.1Detection: did a failure happen? 59.9% vs 11.6% best frontierTRAIL benchmark

Pisama on TRAIL

59.9%

Joint accuracy (heuristic detectors only, Tier 1 to 3, no LLM judge): detector predictions matching ground-truth labels on the full TRAIL set (148 traces, 841 failures).

vs best frontier

+48 pts

p50 cost / trace

Joint accuracy on TRAIL

Pisama (heuristic)59.9%

GPT-5.511.6%

Claude Opus 4.76.7%

Claude Haiku 4.55.2%

Gemini 3.5 Flash2.9%

Grok 4.31.7%

Source · TRAIL benchmark (arXiv:2505.08638)
148 traces · 841 labelled failures · frontier numbers from TRAIL paper
Best frontier shown is GPT-5.5 (current). The earlier GPT-5.4 scored marginally higher at 11.9%.

Exhibit B.2Attribution: which agent failed, at which step?Who&When · ICML 2025

MethodAgent accuracyStep accuracy

●Pisama + Sonnet 460.3%24.1%

GPT-5.4 Mini60.3%22.4%

Gemini 3.1 Flash-Lite50.0%19.0%

Pisama (heuristic-only)31.0%16.8%

Source · Who&When: Automated Multi-Agent Failure Attribution (ICML 2025) · given a trace with a known failure, identify which agent failed and at which step.

Exhibit B.3Internal calibration · per-detector F1cross-validated · no mocks

Detectors measured

Production grade

0.88

Mean F1 · production

$3.66

Judge cost / full run

7,816

Golden dataset entries

Frameworks covered

LangGraphn8nDifyOpenClawManaged AgentsOpenAI AssistantsBedrock Agents

Counts are the capability registry, the same source as the per-detector scoreboard: 87 detectors total, 52 measured on the real-trace (external-only) lane, 6 externally validated at production grade. The mean F1 above is over that production-grade set; the full measured set spans failing to production. Externally validated at production grade: real-trace F1 0.80 or higher, precision 0.70 or higher, 30 or more external traces, external-grounded thresholds, and no per-difficulty blind spot (capability registry, external-only lane, 2026-06-14). Per-detector F1, precision, recall, and confidence intervals on the full scoreboard. MAST-mapped view on the taxonomy page.