Agent Reliability for Production AI

The thing that runs
while you’re not watching.

Your multi-agent systems have no accountability layer. Pisama is an orchestration layer with framework-native coverage for LangGraph, OpenClaw, n8n, Dify, and Managed Agents. It catches the failures that still return 200: loops, silent corruption, scope creep, cascades.

Failure mode 1 of 5illustrative
Infinite loop

An agent repeats the same state transitions. Every call returns 200.

A research agent calls search, gets insufficient results, rephrases, gets similar results, rephrases again. No exception, no timeout — and your dashboard stays green.

caught at T1 · hash & subsequence matching · no LLM
San Francisco·MIT · TypeScript + Python·34 production detectors59.9% on TRAIL · +48 pts vs best frontier
§ 03 · What it does

Catch it. Explain it. Fix it.

The SDK runs on your machine and finds the failure. The platform writes the patch back into your repo. Both ship in the open — see exactly which parts cost money and which don't.

I

It catches what you missed

SDK · local · $0

Loops, hallucinated tool calls, persona drift, runaway costs, corrupted state. 34 production detectors run on every trace. The ones tuned for structure (90%+ of them) cost nothing and never leave your machine.

II

It explains in plain English

SDK · per-issue

Each failure comes with what broke, where it broke (the exact agent and step), and a suggested fix. No stack-trace archaeology, no eyeballing a 4k-token transcript.

III

It writes the fix back into your code

Platform · LangGraph live

For LangGraph recursion limits, Pisama opens the fix as a GitHub PR you can review and merge — shipping today. Auto-fix for n8n, Dify, OpenClaw, and Anthropic Managed Agents on the roadmap. Hosted at pisama.ai.

Detect & Diagnose ship in the open-source SDK. Heal is live on pisama.ai for LangGraph; more runtimes shipping.
§ Two pressures, one runtime

Operations and accountability.

Pisama runs while you’re not watching. That matters to the team shipping agents and to the people who answer for them.

Platform teams

You can’t watch every agent run in production.

  • Silent

    Agent says it finished. Output is wrong. You find out from a user.

  • Token burn

    Loops uncontrolled while you sleep. The bill arrives later.

  • Trust break

    After one bad incident you start checking every turn. Automation defeated.

Enterprise

Your agents have no accountability layer.

  • No audit

    Legal asks what the agent did. The trail lives in four tools, not one.

  • Scope creep

    Agent acquires more agency than it was shipped with. Nobody notices.

  • Cascade

    One bad output poisons every downstream agent. No circuit breaker.

§ 04 · Compatibility

Reads what you already write.

Drop-in adapters for the agent frameworks, runtimes, and editors you already use — plus an MCP server and generic OpenTelemetry ingestion for everything else.

01Cursor / Claude Desktop / Windsurfeditor · MCP
02Claude Codeeditor
03LovableAI builder
04v0AI builder
05BoltAI builder
06Replit AgentAI builder
07LangGraphframework
08Claude Agent SDKframework
09n8nworkflow
10Difyworkflow
11OpenClawworkflow
12Claude Managed Agentsruntime
+ any framework emitting OpenTelemetry — OpenAI Assistants · AWS Bedrock · Google ADK · LangChain Deep Agents · CrewAI · AutoGen · Semantic Kernel · …
§ 05 · Exhibit B

Five times more failures caught than the best general-purpose AI.

On the academic TRAIL benchmark — 148 traces, 841 hand-labelled failures — Pisama catches 60%. GPT-5.5 catches 12%. Same traces, same labels. The engineer-friendly chart and a second benchmark (attribution) are below.

Exhibit B.1Detection: did a failure happen? 59.9% vs 11.6% best frontierTRAIL benchmark
Pisama on TRAIL
59.9%

Joint accuracy: detector predictions matching ground-truth labels on the full TRAIL set.

vs best frontier
+48 pts
p50 cost / trace
$0
Joint accuracy on TRAIL
Pisama59.9%
GPT-5.5 (best frontier)11.6%
Claude Opus 4.76.7%
Gemini 3.5 Flash2.9%
Source · TRAIL benchmark
148 traces · 841 labelled failures · frontier numbers from TRAIL paper
Exhibit B.2Attribution: which agent failed, at which step?Who&When · ICML 2025
MethodAgent accuracyStep accuracy
Pisama + Sonnet 460.3%24.1%
GPT-5.4 Mini60.3%22.4%
Gemini 3.1 Flash-Lite50.0%19.0%
Pisama (heuristic-only)31.0%16.8%
Source · Who&When: Automated Multi-Agent Failure Attribution (ICML 2025) · given a trace with a known failure, identify which agent failed and at which step.
§ 06 · The catalogue

34 production detectors for multi-agent systems. Six categories.

The full registry. Each detector is a calibrated pattern-match against a specific failure shape — not a generic rubric. Plus framework-specific packs that know what goes wrong inside the runtimes themselves.

Planning & Decomposition

6
decompositionspecificationdelegationworkflowroutingdispatch_async

Execution & State

7
loopcorruptionoverflowpropagationmemory_stalenessparallel_consistencycompletion

Coordination

6
coordinationcommunicationmulti_chainsubagent_boundaryorchestration_qualitytask_starvation

Verification & Quality

7
hallucinationgroundingcontextcitationentity_confusionretrieval_qualitycritic_quality

Behavior & Safety

7
persona_driftderailmentwithholdinginjectionapproval_bypasscowork_safetyexploration_safety

Reasoning & Observability

5
convergencereasoning_consistencyadaptive_thinkingcompaction_qualitymodel_selection
Framework-specific packs+5LangGraph+5OpenClaw+3n8n+2Dify= 15 framework-specific detectors
Total · 53 detectors in the registry
§ 07 · The method

Five tiers. Heuristics first. LLMs and humans only when forced.

Fast detectors handle 90%+ of detections at zero cost. The pipeline escalates only when a tier can't conclude.

T1

Hash

Identity matching on transition graphs. Loops, deadlocks, repetition.

p50
~0 ms
cost
$0
T2

Delta

Type, null, oscillation tracking. Element coverage on cross-agent payloads.

p50
~1 ms
cost
$0
T3

Embeddings

Behavioral embedding of outputs vs. embedding of the role.

p50
~10 ms
cost
$0
T4

LLM judge

Escalation tier. Invoked only when T1–T3 disagree or are ambiguous.

p50
~200 ms
cost
~$0.02
T5

Human

Async review for edge cases. Optional, opt-in.

p50
async
cost
90%+ of detections resolve in T1–T3 at $0. T4 uses your own ANTHROPIC_API_KEY when invoked. T5 is a human review queue.
§ 08 · Open source

Five public packages. MIT-licensed. Use what fits your stack.

§ 09 · Common questions

Common questions.

Q.1

How is this different from rubric-based LLM judges in Bedrock / Foundry / Vertex?

Those judge the artifact — was the output good? Pisama detects what happened during execution — loops, state corruption, persona drift, coordination breakdown. Different layer; complementary tools.

Q.2

Does Pisama send my traces anywhere?

The T4 LLM judge is opt-in and uses your own API key. Pisama does not proxy your model traffic, and PII redaction runs before traces are stored.

Q.3

How is this different from a trace store like LangSmith or Langfuse?

Trace stores collect and visualize traces. Pisama is a detection layer. Point it at the same traces and you get specific failure-mode findings, not raw spans.

Q.4

What if I'm not on a supported framework?

If your traces have transitions, shared state, and message history, the detection methods apply. We ship dedicated adapters for 12 frameworks/runtimes/editors, plus generic OpenTelemetry ingestion — anything that emits OTel (CrewAI, AutoGen, Semantic Kernel, others) works out of the box.

Q.5

Why heuristics over an LLM judge?

On TRAIL, Pisama reaches 59.9% joint accuracy. GPT-5.5 reaches 11.6%. Heuristics tuned to the structural shape of process failures simply see more, for $0.

Q.6

What does Pisama miss?

Genuinely ambiguous cases — where even careful human labellers disagree — are surfaced as advisory, not as flagged failures. We do not claim to catch everything.

§ Verdict · Platform

Stop finding out
from your users.

Pisama catches the failures that still return 200: loops, silent corruption, scope creep, cascades. Across every framework you orchestrate.

§ 10 · Verdict · Enterprise

Stop explaining to legal
what your agent did.

Signed audit trail, scope containment, regulator-grade retention. The accountability layer your customers will eventually require.