Agent Reliability for Production AI

The thing that runs
while you’re not watching.

Name: Pisama
Author: Pisama

Your agents have no accountability layer. Pisama does the error analysis for you, reading every production trace and naming the first upstream failure across single-agent, multi-agent, and sub-agent runs, with framework-native coverage for LangGraph, OpenClaw, n8n, Dify, and Managed Agents. It catches the failures that still return 200: loops, silent corruption, scope creep, cascades.

Book a demo See the orchestration layer

Failure mode 1 of 5illustrative

Infinite loop

An agent repeats the same state transitions. Every call returns 200.

A research agent calls search, gets insufficient results, rephrases, gets similar results, rephrases again. No exception, no timeout, and your dashboard stays green.

●caught at T1 · hash & subsequence matching · no LLM

San Francisco·MIT · TypeScript + Python·87 detectors59.9% on TRAIL · +48 pts vs best frontier

§ 03 · What it does

Catch it. Explain it. Fix it.

The SDK runs on your machine and finds the failure. The platform writes the patch back into your repo. Both ship in the open — see exactly which parts cost money and which don't.

It catches what you missed

SDK · local · $0

Loops, hallucinated tool calls, persona drift, runaway costs, corrupted state. 87 detectors run on every trace. The ones tuned for structure (90%+ of them) cost nothing and never leave your machine.

It explains in plain English

SDK · per-issue

Each failure comes with what broke, where it broke (the exact agent and step), and a suggested fix. No stack-trace archaeology, no eyeballing a 4k-token transcript.

III

It writes the fix back into your code

Platform · LangGraph live

For LangGraph recursion limits, Pisama opens the fix as a GitHub PR you can review and merge — shipping today. Auto-fix for n8n, Dify, OpenClaw, and Anthropic Managed Agents on the roadmap. Hosted at pisama.ai.

Detect & Diagnose ship in the open-source SDK. Heal is live on pisama.ai for LangGraph; more runtimes shipping.

§ Two pressures, one runtime

Operations and accountability.

Pisama runs while you’re not watching. That matters to the team shipping agents and to the people who answer for them.

Platform teams

You can’t watch every agent run in production.

Silent
Agent says it finished. Output is wrong. You find out from a user.
Token burn
Loops uncontrolled while you sleep. The bill arrives later.
Trust break
After one bad incident you start checking every turn. Automation defeated.

Enterprise

Your agents have no accountability layer.

No audit
Legal asks what the agent did. The trail lives in four tools, not one.
Scope creep
Agent acquires more agency than it was shipped with. Nobody notices.
Cascade
One bad output poisons every downstream agent. No circuit breaker.

§ 04 · Compatibility

Reads what you already write.

Drop-in adapters for the agent frameworks, runtimes, and editors you already use — plus an MCP server and generic OpenTelemetry ingestion for everything else.

01Cursor / Claude Desktop / Windsurf● editor · MCP

02Claude Code● editor

03Lovable● AI builder

04v0● AI builder

05Bolt● AI builder

06Replit Agent● AI builder

07LangGraph● framework

08Claude Agent SDK● framework

09n8n● workflow

10Dify● workflow

11OpenClaw● workflow

12Claude Managed Agents● runtime

+ any framework emitting OpenTelemetry — OpenAI Assistants · AWS Bedrock · Google ADK · LangChain Deep Agents · CrewAI · AutoGen · …

§ 05 · Exhibit B

Five times more failures caught than the best general-purpose AI.

On the academic TRAIL benchmark — 148 traces, 841 hand-labelled failures — Pisama catches 60%. GPT-5.5 catches 12%. Same traces, same labels. The engineer-friendly chart and a second benchmark (attribution) are below.

Exhibit B.1Detection: did a failure happen? 59.9% vs 11.6% best frontierTRAIL benchmark

Pisama on TRAIL

59.9%

Joint accuracy: detector predictions matching ground-truth labels on the full TRAIL set.

vs best frontier

+48 pts

p50 cost / trace

Joint accuracy on TRAIL

Pisama59.9%

GPT-5.5 (best frontier)11.6%

Claude Opus 4.76.7%

Gemini 3.5 Flash2.9%

Source · TRAIL benchmark
148 traces · 841 labelled failures · frontier numbers from TRAIL paper

Exhibit B.2Attribution: which agent failed, at which step?Who&When · ICML 2025

MethodAgent accuracyStep accuracy

●Pisama + Sonnet 460.3%24.1%

GPT-5.4 Mini60.3%22.4%

Gemini 3.1 Flash-Lite50.0%19.0%

Pisama (heuristic-only)31.0%16.8%

Source · Who&When: Automated Multi-Agent Failure Attribution (ICML 2025) · given a trace with a known failure, identify which agent failed and at which step.

§ 06 · The catalogue

87 detectors for single-agent, multi-agent, and sub-agent systems. Six categories.

The full registry. Each detector is a calibrated pattern-match against a specific failure shape — not a generic rubric. Plus framework-specific packs that know what goes wrong inside the runtimes themselves.

Planning & Decomposition

decompositionspecificationdelegationworkflowroutingdispatch_async

Execution & State

loopcorruptionoverflowpropagationmemory_stalenessparallel_consistencycompletion

Coordination

coordinationcommunicationmulti_chainsubagent_boundaryorchestration_qualitytask_starvation

Verification & Quality

hallucinationgroundingcontextcitationentity_confusionretrieval_qualitycritic_quality

Behavior & Safety

persona_driftderailmentwithholdinginjectionapproval_bypasscowork_safetyexploration_safety

Reasoning & Observability

convergencereasoning_consistencyadaptive_thinkingcompaction_qualitymodel_selection

Framework-specific packs+5LangGraph+5OpenClaw+3n8n+2Dify= 15 framework-specific detectors

Total · 53 detectors in the registry

§ 07 · The method

Five tiers. Heuristics first. LLMs and humans only when forced.

Fast detectors handle 90%+ of detections at zero cost. The pipeline escalates only when a tier can't conclude.

Hash

Identity matching on transition graphs. Loops, deadlocks, repetition.

p50

~0 ms

cost

Delta

Type, null, oscillation tracking. Element coverage on cross-agent payloads.

p50

~1 ms

cost

Embeddings

Behavioral embedding of outputs vs. embedding of the role.

p50

~10 ms

cost

LLM judge

Escalation tier. Invoked only when T1–T3 disagree or are ambiguous.

p50

~200 ms

cost

~$0.02

Human

Async review for edge cases. Optional, opt-in.

p50

async

cost

—

90%+ of detections resolve in T1–T3 at $0. T4 uses your own ANTHROPIC_API_KEY when invoked. T5 is a human review queue.

§ 08 · Open source

Five public packages. MIT-licensed. Use what fits your stack.

pisama-core

Detection orchestrator and scoring engine. The 5-tier pipeline lives here.

github.com/Pisama-AI/pisama-core

pisama-detectors

87 detectors. Loops, hallucination, coordination, persona drift, withholding, injection.

github.com/Pisama-AI/pisama-detectors

pisama-auto

Zero-code auto-instrumentation. One line; LangGraph, CrewAI, AutoGen, OpenAI Agents SDK.

github.com/Pisama-AI/pisama-auto

pisama-agent-sdk

Real-time failure hooks for the Claude Agent SDK.

github.com/Pisama-AI/pisama-agent-sdk

pisama-claude-code

Trace capture for Claude Code sessions — tokens, cost, tool calls.

github.com/Pisama-AI/pisama-claude-code

§ 09 · Common questions

Common questions.

Q.1

How is this different from rubric-based LLM judges in Bedrock / Foundry / Vertex?

Those judge the artifact: was the output good? Pisama detects what happened during execution: loops, state corruption, persona drift, coordination breakdown. Different layer; complementary tools.

Q.2

Is Pisama an eval tool?

No. Evals grade output quality, and they sit a layer above Pisama. Pisama detects structural process failures in running agents and reports them as binary failure modes with calibrated confidence, not generic quality scores. If you run evals, keep them; Pisama watches the layer your rubrics cannot see.

Q.3

Does Pisama send my traces anywhere?

The T4 LLM judge is opt-in and uses your own API key. Pisama does not proxy your model traffic, and PII redaction runs before traces are stored.

Q.4

How is this different from a trace store like LangSmith or Langfuse?

Trace stores collect and visualize traces. Pisama is a detection layer. Point it at the same traces and you get specific failure-mode findings, not raw spans.

Q.5

What if I'm not on a supported framework?

If your traces have transitions, shared state, and message history, the detection methods apply. We ship dedicated adapters for 12 frameworks/runtimes/editors, plus generic OpenTelemetry ingestion; anything that emits OTel (CrewAI, AutoGen, others) works out of the box.

Q.6

Why heuristics over an LLM judge?

On TRAIL, Pisama reaches 59.9% joint accuracy. GPT-5.5 reaches 11.6%. Heuristics tuned to the structural shape of process failures simply see more, for $0.

Q.7

What does Pisama miss?

Genuinely ambiguous cases, where even careful human labellers disagree, are surfaced as advisory, not as flagged failures. We do not claim to catch everything.

§ Verdict · Platform

Stop finding out
from your users.

Pisama catches the failures that still return 200: loops, silent corruption, scope creep, cascades. Across every framework you orchestrate.

Book a demo See the platform

§ 10 · Verdict · Enterprise

Stop explaining to legal
what your agent did.

Signed audit trail, scope containment, regulator-grade retention. The accountability layer your customers will eventually require.

Book an accountability review →See the enterprise pitch

The thing that runswhile you’re not watching.

An agent repeats the same state transitions. Every call returns 200.

Catch it. Explain it. Fix it.

It catches what you missed

It explains in plain English

It writes the fix back into your code

Operations and accountability.

You can’t watch every agent run in production.

Your agents have no accountability layer.

Reads what you already write.

Five times more failures caught than the best general-purpose AI.

87 detectors for single-agent, multi-agent, and sub-agent systems. Six categories.

Planning & Decomposition

Execution & State

Coordination

Verification & Quality

Behavior & Safety

Reasoning & Observability

Five tiers. Heuristics first. LLMs and humans only when forced.

Hash

Delta

Embeddings

LLM judge

Human

Five public packages. MIT-licensed. Use what fits your stack.

Common questions.

How is this different from rubric-based LLM judges in Bedrock / Foundry / Vertex?

Is Pisama an eval tool?

Does Pisama send my traces anywhere?

How is this different from a trace store like LangSmith or Langfuse?

What if I'm not on a supported framework?

Why heuristics over an LLM judge?

What does Pisama miss?

Stop finding out from your users.

Stop explaining to legal what your agent did.

The thing that runs
while you’re not watching.

Stop finding out
from your users.

Stop explaining to legal
what your agent did.