The thing that runs
while you’re not watching.
Your multi-agent systems have no accountability layer. Pisama is an orchestration layer with framework-native coverage for LangGraph, OpenClaw, n8n, Dify, and Managed Agents. It catches the failures that still return 200: loops, silent corruption, scope creep, cascades.
An agent repeats the same state transitions. Every call returns 200.
A research agent calls search, gets insufficient results, rephrases, gets similar results, rephrases again. No exception, no timeout — and your dashboard stays green.
Catch it. Explain it. Fix it.
The SDK runs on your machine and finds the failure. The platform writes the patch back into your repo. Both ship in the open — see exactly which parts cost money and which don't.
It catches what you missed
Loops, hallucinated tool calls, persona drift, runaway costs, corrupted state. 34 production detectors run on every trace. The ones tuned for structure (90%+ of them) cost nothing and never leave your machine.
It explains in plain English
Each failure comes with what broke, where it broke (the exact agent and step), and a suggested fix. No stack-trace archaeology, no eyeballing a 4k-token transcript.
It writes the fix back into your code
For LangGraph recursion limits, Pisama opens the fix as a GitHub PR you can review and merge — shipping today. Auto-fix for n8n, Dify, OpenClaw, and Anthropic Managed Agents on the roadmap. Hosted at pisama.ai.
Operations and accountability.
Pisama runs while you’re not watching. That matters to the team shipping agents and to the people who answer for them.
You can’t watch every agent run in production.
- Silent
Agent says it finished. Output is wrong. You find out from a user.
- Token burn
Loops uncontrolled while you sleep. The bill arrives later.
- Trust break
After one bad incident you start checking every turn. Automation defeated.
Your agents have no accountability layer.
- No audit
Legal asks what the agent did. The trail lives in four tools, not one.
- Scope creep
Agent acquires more agency than it was shipped with. Nobody notices.
- Cascade
One bad output poisons every downstream agent. No circuit breaker.
Reads what you already write.
Drop-in adapters for the agent frameworks, runtimes, and editors you already use — plus an MCP server and generic OpenTelemetry ingestion for everything else.
Five times more failures caught than the best general-purpose AI.
On the academic TRAIL benchmark — 148 traces, 841 hand-labelled failures — Pisama catches 60%. GPT-5.5 catches 12%. Same traces, same labels. The engineer-friendly chart and a second benchmark (attribution) are below.
Joint accuracy: detector predictions matching ground-truth labels on the full TRAIL set.
148 traces · 841 labelled failures · frontier numbers from TRAIL paper
34 production detectors for multi-agent systems. Six categories.
The full registry. Each detector is a calibrated pattern-match against a specific failure shape — not a generic rubric. Plus framework-specific packs that know what goes wrong inside the runtimes themselves.
Planning & Decomposition
6Execution & State
7Coordination
6Verification & Quality
7Behavior & Safety
7Reasoning & Observability
5Five tiers. Heuristics first. LLMs and humans only when forced.
Fast detectors handle 90%+ of detections at zero cost. The pipeline escalates only when a tier can't conclude.
Hash
Identity matching on transition graphs. Loops, deadlocks, repetition.
Delta
Type, null, oscillation tracking. Element coverage on cross-agent payloads.
Embeddings
Behavioral embedding of outputs vs. embedding of the role.
LLM judge
Escalation tier. Invoked only when T1–T3 disagree or are ambiguous.
Human
Async review for edge cases. Optional, opt-in.
Five public packages. MIT-licensed. Use what fits your stack.
Detection orchestrator and scoring engine. The 5-tier pipeline lives here.
34 production detectors. Loops, hallucination, coordination, persona drift, withholding, injection.
Zero-code auto-instrumentation. One line; LangGraph, CrewAI, AutoGen, OpenAI Agents SDK.
Real-time failure hooks for the Claude Agent SDK.
Trace capture for Claude Code sessions — tokens, cost, tool calls.
Common questions.
How is this different from rubric-based LLM judges in Bedrock / Foundry / Vertex?
Those judge the artifact — was the output good? Pisama detects what happened during execution — loops, state corruption, persona drift, coordination breakdown. Different layer; complementary tools.
Does Pisama send my traces anywhere?
The T4 LLM judge is opt-in and uses your own API key. Pisama does not proxy your model traffic, and PII redaction runs before traces are stored.
How is this different from a trace store like LangSmith or Langfuse?
Trace stores collect and visualize traces. Pisama is a detection layer. Point it at the same traces and you get specific failure-mode findings, not raw spans.
What if I'm not on a supported framework?
If your traces have transitions, shared state, and message history, the detection methods apply. We ship dedicated adapters for 12 frameworks/runtimes/editors, plus generic OpenTelemetry ingestion — anything that emits OTel (CrewAI, AutoGen, Semantic Kernel, others) works out of the box.
Why heuristics over an LLM judge?
On TRAIL, Pisama reaches 59.9% joint accuracy. GPT-5.5 reaches 11.6%. Heuristics tuned to the structural shape of process failures simply see more, for $0.
What does Pisama miss?
Genuinely ambiguous cases — where even careful human labellers disagree — are surfaced as advisory, not as flagged failures. We do not claim to catch everything.
Stop finding out
from your users.
Pisama catches the failures that still return 200: loops, silent corruption, scope creep, cascades. Across every framework you orchestrate.
Stop explaining to legal
what your agent did.
Signed audit trail, scope containment, regulator-grade retention. The accountability layer your customers will eventually require.