# Pisama: Agent Forensics

> Process-level failure detection for AI agent systems. The layer between observability and incident response. Catches what your weekend agent ships silently — loops, state corruption, persona drift, runaway costs, coordination breakdown. Open source. TypeScript + Python. Free for side projects.

Pisama tells you what went wrong during the run, by name. Observability shows you what happened. Rubric graders score the artifact. Pisama is the missing layer in between.

## Two audiences

- **Vibe coders / AI builders.** Cursor, Claude Code, Lovable, v0, Bolt, Replit Agent. Paste a prompt into the builder, get a live dashboard URL. No signup, no card. `@pisama/sdk` on npm, `pisama` on PyPI.
- **AI engineers and platform teams (50–200 HC).** LangGraph, CrewAI, AutoGen, Claude Agent SDK, n8n, Dify. 39 calibrated detectors (85 total, 24 production-grade), MIT-licensed, vendor-independent. Hosted platform with self-healing for LangGraph in private beta.

## Full version

- [Full Pisama LLM index (llms-full.txt)](https://pisama.ai/llms-full.txt): extended positioning, architecture, benchmark methodology, and per-framework detector lists.
- [Product documentation](https://docs.pisama.ai/): SDK reference, integration guides, deployment, API.

## Positioning

- **Process-level, not artifact-level.** Bedrock, Foundry, Vertex, and Anthropic Managed Agents all ship LLM-judge graders that score the output. They cannot tell you that two agents looped on each other for 14 turns or that shared state was corrupted at step 7.
- **Heuristics first.** A 5-tier pipeline: hash (~0 ms / $0), delta (~1 ms / $0), embeddings (~10 ms / $0), LLM judge (~200 ms / ~$0.02), human review (async). 90%+ of detections resolve in T1 to T3 at $0.
- **Vendor-independent.** Your model vendor cannot also be your eval vendor. Pisama is MIT-licensed; detector logic and calibration data are open.
- **TypeScript + Python.** `npm i @pisama/sdk` wraps any Vercel AI SDK / OpenAI / Anthropic model call with one `observe()`. `pip install pisama` ships library, CLI, and MCP server.
- **Live dashboard on every project.** Anonymous projects spin up the first time the SDK posts to `api.pisama.ai`. Each gets a public URL at `pisama.ai/live/ps_xxx`. Claim later by signing up.
- **Open source SDK + CLI; hosted Heal live for LangGraph, more runtimes shipping.**

## Capabilities

- 39 calibrated detectors (85 total, 24 production-grade) across 7 categories: planning & decomposition, execution & state, coordination, verification & quality, behavior & safety, reasoning & observability, agentic-safety (scope escalation, jailbreak compliance, refusal quality, impersonation, deception)
- Framework-specific detector packs: +6 LangGraph, +5 OpenClaw, +6 n8n, +3 Dify
- 5-tier escalation pipeline: T1 hash, T2 delta, T3 embeddings, T4 LLM judge, T5 human
- Each detection returns a DiagnosisResult: type, severity, detection status, suggested fix, scoped to the agent and step that produced it
- Live dashboard streams detection verdicts over SSE within ~1s of ingest
- Self-healing (live today for LangGraph): opens fix PRs to GitHub for recursion-limit bugs; auto-fix for n8n / Dify / OpenClaw / Anthropic Managed Agents on the roadmap

## Install paths

### AI builder (no terminal)

Paste this prompt into Lovable / v0 / Bolt / Replit Agent / Cursor / Claude Code chat:

> Add `@pisama/sdk` to this app so I can see live agent failures.
>
> 1. Run: `npm i @pisama/sdk`
> 2. Wrap every model call with `observe(...)` from `@pisama/sdk`
> 3. Add to `.env.local`: `PISAMA_PROJECT_ID=ps_yourapp`
> 4. Print this link in the README: `https://pisama.ai/live/ps_yourapp`

The builder edits the code for you. Hit your chat route once. Failures stream to the dashboard URL.

### TypeScript

```bash
npm i @pisama/sdk
```

```ts
import { observe } from "@pisama/sdk";
import { streamText } from "ai";
import { openai } from "@ai-sdk/openai";

const model = observe(openai("gpt-4o"));

// .env.local
// PISAMA_PROJECT_ID=ps_yourapp
```

### Python

```bash
pip install pisama
```

```python
from pisama.langgraph import instrument
from langgraph.graph import StateGraph

graph = StateGraph(...).compile()
instrument(graph)  # auto-emit OTel spans + run detectors
```

Python 3.10+. Library, CLI, and MCP server in one package. Detection runs locally for $0.

## Benchmarks (current)

- **TRAIL detection benchmark** (Patronus AI): Pisama 59.9% joint accuracy vs 11.9% best frontier judge, a 48 point lead. 148 traces, 841 labelled failures. Heuristic-only, T1 to T3.
- **Who&When attribution benchmark** (ICML 2025 Spotlight): Pisama + Sonnet 4 ties best LLM at 60.3% agent accuracy and leads on step accuracy at 24.1%.
- Heuristic-only Pisama on Who&When: 31% agent accuracy, 16.8% step accuracy. The LLM-judge tier matters for attribution; not for detection.
- Calibration set: 7,212 labelled traces from 13 external sources, separate from TRAIL/Who&When held-out evaluation.

## Cross-cutting detectors (apply to all frameworks)

| Detector              | Mode                           | F1     | Mechanism                                                |
|-----------------------|--------------------------------|--------|----------------------------------------------------------|
| loop                  | State recurrence               | 0.830  | Hash of (sender, receiver, fingerprint) tuple per turn   |
| state_corruption      | Schema/type drift              | 0.790  | State delta snapshot at every step transition            |
| persona_drift         | Role divergence                | 0.794  | Embedding similarity vs declared role/instructions       |
| coordination          | Cross-agent reference rate     | 0.746  | Entity reference count (sender to receiver) per window   |
| communication         | Back-and-forth without progress| 0.769  | Exchange count threshold per agent pair                  |
| context_neglect       | Critical entity dropped        | 0.731  | Upstream entity extraction + downstream coverage check   |
| context_overflow      | Context window saturation      | 0.769  | Token-density gradient over conversation window          |
| retrieval_quality     | Irrelevant retrieval           | 0.777  | Query and document semantic similarity threshold         |
| hallucination         | Unsupported claim              | 0.842  | Claim extraction + source coverage check                 |
| grounding             | Source vs output mismatch      | 0.850  | Word-overlap + entity grounding                          |
| injection             | Prompt injection signature     | 0.918  | Pattern library + structural anomaly                     |
| withholding           | Internal answer not surfaced   | 0.867  | Internal-state entity vs output coverage                 |
| derailment            | Output diverges from task      | 0.654  | Task and output entity overlap + LLM judge fallback      |
| decomposition         | Subtask coverage gap           | 0.768  | Required-element coverage in plan                        |
| specification         | Task interpretation drift      | 0.800  | User-intent vs spec entity match                         |
| workflow              | Bad workflow design            | 0.893  | DAG structure heuristics                                 |
| completion            | Premature completion           | 0.703  | Subtask completion + success-criteria coverage           |
| sycophancy            | Agreement without grounding    | 0.902  | Reference / agreement signal disagreement                |
| consensus_collapse    | Coordination herd behavior     | 0.967  | Cross-agent agreement variance over time                 |
| specification_compliance | AgentPex compliance check   | 0.966  | LLM-judge gated on declared task spec                    |

## Framework-specific detector packs

### LangGraph
- recursion_depth (F1 0.976): unbounded recursion in conditional edges
- tool_failure_cascade (F1 0.900): repeated tool errors across nodes
- parallel_sync (F1 0.874): diverged state across `Send` parallel branches
- checkpoint_corruption (F1 0.871): schema drift across checkpoints
- state_corruption_lg (F1 0.809): type/shape changes in `MessagesState`
- edge_misroute (F1 0.835): conditional edge to unintended node

### n8n
- timeout_analysis (F1 0.959), resource_exhaustion (F1 0.949)
- complexity_overflow (F1 0.841), error_propagation (F1 0.805)
- cycle_detection (F1 0.812), schema_validation (F1 0.765)

### Dify
- workflow_validation, retrieval_quality, communication_breakdown

### OpenClaw
- 5 platform-specific detectors for OpenClaw multi-agent runtime

## Supported surfaces

### Framework adapters (7)
- LangGraph: `pip install pisama-langgraph`
- CrewAI: via `pisama-auto`
- AutoGen: via `pisama-auto`
- OpenAI Agents SDK: via `pisama-auto`
- Claude Agent SDK: `pip install pisama-agent-sdk` (lifecycle hooks)
- n8n: webhook ingest
- Dify: webhook ingest

### AI builder paste-in prompts (6)
- Lovable, v0, Bolt, Replit Agent, Cursor, Claude Code — paste a prompt into the chat; the builder runs `npm i @pisama/sdk` and edits the model call. Full prompts: https://pisama.ai/install

### Generic OpenTelemetry
- Anything emitting `gen_ai.*` semantic conventions: Semantic Kernel, LlamaIndex, AWS Bedrock, Google ADK, LangChain Deep Agents, custom orchestrators

## Comparisons (honest)

- **vs Langfuse**: artifact-level tracing vs process-level forensics. Run both. Pisama emits OTel spans Langfuse ingests directly. https://pisama.ai/vs/langfuse
- **vs LangSmith**: vendor-independent vs LangChain-native. The deepest LangChain integration is theirs; the structural detectors are ours. https://pisama.ai/vs/langsmith
- **vs Arize Phoenix**: ML-observability lineage vs agent-failure-taxonomy lineage. Compatible spans; complementary. https://pisama.ai/vs/arize
- **vs Patronus AI**: closest competitor by problem framing. On their TRAIL benchmark Pisama hits 59.9% (heuristic-only) vs 11.9% (best frontier judge). https://pisama.ai/vs/patronus
- **vs Braintrust**: eval workflow tooling vs in-flight failure detection. Different jobs. Run both. https://pisama.ai/vs/braintrust

## Open-source packages (MIT-licensed)

- `pisama-core`: detection orchestrator and scoring engine; the 5-tier pipeline lives here. https://github.com/Pisama-AI/pisama-core
- `pisama-detectors`: 39 calibrated detectors (85 total, 24 production-grade). https://github.com/Pisama-AI/pisama-detectors
- `pisama-auto`: zero-code auto-instrumentation. https://github.com/Pisama-AI/pisama-auto
- `pisama-agent-sdk`: real-time failure hooks for the Claude Agent SDK. https://github.com/Pisama-AI/pisama-agent-sdk
- `pisama-claude-code`: trace capture for Claude Code sessions (tokens, cost, tool calls). https://github.com/Pisama-AI/pisama-claude-code
- `@pisama/sdk` (npm): TypeScript / Vercel AI SDK wrapper. https://www.npmjs.com/package/@pisama/sdk
- `@pisama/detectors` (npm): TypeScript detector library. https://www.npmjs.com/package/@pisama/detectors
- `@pisama/cli` (npm): `npx @pisama/cli init` for one-command setup. https://www.npmjs.com/package/@pisama/cli

## Pricing

- **Free** ($0, forever) — Unlimited anonymous projects, public `/live` dashboard, core detectors on every trace, last 7 days of traces, no card required.
- **Pro** ($19/mo or $15/mo annual) — 10 projects, 5,000 daily runs, all detectors, API access, webhooks, Slack alerts, 30-day retention.
- **Team** ($79/mo or $66/mo annual) — 50 projects, 25,000 daily runs, ML tier, SSO, RBAC, 5 seats, 90-day retention.
- **Enterprise** (custom) — Unlimited everything, self-healing across LangGraph Platform / n8n / Dify / OpenClaw / Managed Agents, GitHub PR generation, SSO, SLA, on-prem option.

Full pricing + comparison table: https://pisama.ai/pricing

## CLI

```bash
pisama analyze trace.json          # run all detectors against a saved trace
pisama watch ./traces              # tail-mode: detect on new traces in dir
pisama benchmark trail             # reproduce TRAIL benchmark numbers locally
pisama-mcp                         # MCP server for Cursor / Claude Desktop / Windsurf
```

## API

- Base URL: https://api.pisama.ai/api/v1
- Span ingest (no auth required for anonymous projects): `POST /spans` with `x-pisama-project-id` header
- REST endpoints (Bearer token): trace list, detection details, project management
- Interactive docs: https://api.pisama.ai/docs

## Privacy

- The SDK and CLI run locally. They do not transmit traces to Pisama unless you set `PISAMA_PROJECT_ID` and explicitly use the hosted ingest.
- The Tier-4 LLM judge is opt-in and uses your own API key. Pisama does not proxy traffic.
- Anonymous `/live` dashboards are publicly viewable by anyone with the URL — treat the project ID as a shared secret.
- `/live` URLs are `noindex` / `nofollow` to keep them out of search engines.
- PII redaction runs in-process before bytes leave the machine (emails, phones, SSNs, JWTs, OpenAI/Anthropic/AWS API keys).
- Hosted platform stores only what you send it.
- Customer data is not used to train ML models.
- Full policy: https://pisama.ai/privacy

## Links

- Website: https://pisama.ai
- Live demo dashboard: https://pisama.ai/live/ps_demo
- Install (AI builder prompts): https://pisama.ai/install
- Frameworks: https://pisama.ai/frameworks
- Pricing: https://pisama.ai/pricing
- Compare: https://pisama.ai/vs
- Blog: https://pisama.ai/blog
- Research: https://pisama.ai/research
- Writing: https://pisama.ai/writing
- Taxonomy: https://pisama.ai/taxonomy
- Benchmarks: https://pisama.ai/benchmarks
- Detector scoreboard: https://pisama.ai/benchmarks/detectors
- API: https://api.pisama.ai
- API Docs: https://api.pisama.ai/docs
- GitHub: https://github.com/Pisama-AI
- npm: https://www.npmjs.com/org/pisama
- Privacy: https://pisama.ai/privacy
- Terms: https://pisama.ai/terms
- Contact: tuomo@pisama.ai

## License

MIT (open-source SDK + CLI). Hosted platform is a separate offering. Self-healing GitHub PR generation for LangGraph is in private beta as of 2026-05-20.