Peer-reviewable scholarly output.
DOI-anchored preprints with full code and data release.
- 2026-05-08
Do Human Cognitive Failures Map onto AI Agent Failures? A Structured Literature Review
PRISMA 2020 structured review of 45 human cognitive failure categories mapped onto AI agent research. Headline: 6 / 24 / 12 / 0 / 3 across Substantial / Partial / Nascent / Absent / Substrate-absent. Appendix C contains a complete crosswalk between MAST’s 14 multi-agent failure modes and the 45-category human cognitive failure taxonomy.
- 2026-05-08
Tiered Detection of Multi-Agent LLM Failures: An Empirical Calibration on TRAIL and Who&When
Empirical companion. 18-detector tiered pipeline, calibrated on 8,338 trace entries; ‘in-distribution’ TRAIL evaluation, held-out Who&When attribution, four-substrate Cohen’s κ analysis against MAST-released labels. Concept DOI 10.5281/zenodo.20027840 covers v1–v3.
Empirical work on agent failure detection.
Reproducible experiments. Code and data linked from each report.
- 2026-06-02
Pisama vs Galileo Eval Engineer
A falsifiable side-by-side on three checkable differences. Pisama imports Langfuse, Arize / Phoenix, and raw JSON (OpenTelemetry and LangSmith behind a flag) where Eval Engineer reads its own logstream. Pisama localizes failures to a step and sub-agent in multi-agent traces. Pisama applies a fix to a live n8n workflow and verifies it, where Eval Engineer stops at a generated fix you apply in the IDE. Every claim points at a file or endpoint.
- 2026-05-28
Pisama Bench v1: process quality vs task outcome
We ran the post-Phase-1-5 detector stack against five real-trace corpora through the live orchestrator. On m500 (multi-agent math reasoning, n=50), trajectory_score correlates with solver correctness at Pearson r = +0.45. On AgentRewardBench (n=100, single-agent web), the same score reports 0.93 mean on tasks that fail at cum_reward 0.11. One scalar, three relationships — because the score measures process, not outcome.
- 2026-05-27
Analytical semantics: catching wrong queries over user data
AgentFuel (Rockfish Data and CMU, CAIS 2026) shows SOTA data analysis agents collapse from 73% on stateless queries to 10% on incident queries. The failures are semantic, not behavioral, and no Pisama detector caught them before. We ship a two-stage analytical_semantics detector at F1 0.97 across 32 hand-authored seeds.
- 2026-05-27
Tool-error recovery replicates on tau2-bench
Snorkel AI's UNDERWRITE (AAAI/CAIS 2026) found that tool-error recovery rate predicts task success while raw error count does not. We replicated on 10,832 tau2-bench sims. Direction holds across 14/14 model-domain pairs. Pisama ships the detector at F1 0.947.
- 2026-05-27
Pisama vs AgentPex on tau2-bench
AgentPex (MSR / UW, CAIS 2026) extracts behavioral specs from prompts. We ran Pisama on the same tau2 corpus. Pisama reproduces the paper’s per-model gradient and surfaces sycophancy that spec-extraction cannot.
- 2026-04-30
OpenAI vs Anthropic vs Google: same task, same detectors
One agentic task, three vendor APIs, the same Pisama detector pipeline. Cross-vendor parity in numbers.