Did Pisama use the TRAIL dataset to calibrate?

TRAIL is the evaluation benchmark, not the calibration set. Pisama detectors are calibrated on 7,212 traces from 13 external sources (none of which are TRAIL). TRAIL is held out for measurement, which is why the 59.9% number is a fair test against the published benchmark.

Yes. Run Patronus for managed evals on outputs and Pisama for in-flight structural detectors on the trace. The categories are complementary.

Use both. Here's where each one wins.

Observability tools see failures. Pisama acts on them. They are precondition layers; we are the action layer above. The comparison below is honest about where each one is stronger, not a zero-sum claim.

Pisama vs Patronus AI

Name: Pisama
Author: Pisama

Patronus published the TRAIL benchmark (the canonical dataset for agent failure detection) and ships Percival, a proprietary agent eval product. They are the closest competitor by problem framing.

On their own benchmark, Pisama's 20 core heuristic detectors achieve 59.9% joint accuracy at $0 cost. The best frontier LLM (which underpins Percival's judge) achieves 11.6%. The 5x lead comes from heuristic-first design. Most failures have structural signatures that do not require an LLM to detect.

Where Patronus AI wins

Authored TRAIL: deepest expertise in agent failure taxonomy
Strong managed-service offering (Percival, Lynx, Glider)
Enterprise sales motion and design partner program

Where Pisama wins

59.9% on TRAIL vs 11.6% best frontier (the model class Percival relies on)
Open-source detectors: F1 published per detector from a versioned calibration program, dataset reproducible. Patronus publishes model-level accuracy for its judges and TRAIL scores for frontier LLMs, not per-failure-mode accuracy for Percival itself
Heuristic-first: median trace cost <$0.01, vs Percival's LLM-judge cost per call
Multi-framework: LangGraph, CrewAI, AutoGen, OpenAI Agents, Claude Agent SDK, Bedrock, ADK

At a glance

Dimension	Patronus AI	Pisama
TRAIL accuracy	11.6% (best frontier judge)	59.9% (heuristic detectors only)
Cost per trace	LLM judge cost per call	<$0.01 median (T1–T3 free)
Openness	Proprietary judges	MIT detectors, F1 per detector public
Framework coverage	API-based; framework-agnostic	12 first-class adapters + OTel ingest
Auditability	Black-box graders	Detector logic in repo, calibration data published

Externally validated at production grade: real-trace F1 0.80 or higher, precision 0.70 or higher, 30 or more external traces, external-grounded thresholds, and no per-difficulty blind spot (capability registry, external-only lane, 2026-06-14).

Benchmark note: the competitor LLM baselines we cite (for example 11.6% best frontier on TRAIL) were measured by Pisama in April 2026 against the published benchmarks, not self-reported by the vendors. Pisama's own 59.9% on TRAIL is the heuristic-only (Tier 1 to 3) result. Raw results are in the open-source repo.

Recommendation

For teams that want the deepest agent failure taxonomy in production, Pisama is the open implementation. Patronus is strong for enterprise teams that want a managed service and are comfortable with proprietary graders.

FAQ

Did Pisama use the TRAIL dataset to calibrate?: TRAIL is the evaluation benchmark, not the calibration set. Pisama detectors are calibrated on 7,212 traces from 13 external sources (none of which are TRAIL). TRAIL is held out for measurement, which is why the 59.9% number is a fair test against the published benchmark.
Can I use both?: Yes. Run Patronus for managed evals on outputs and Pisama for in-flight structural detectors on the trace. The categories are complementary.