What breaks · Pisama

Four failures Pisama caught in real traces.

Name: Pisama
Author: Pisama

Error analysis on real production traces, anonymized. Customer names removed; the failure modes, detector hits, and outcomes are real, and each case names the first upstream failure, not the downstream symptom. Each one maps to the calibrated detectors below the story. The point isn't the marketing. The point is that these failures are routine, and observability tools won't catch them.

The customer service agent that started giving discounts that weren't in policy.

Mid-market SaaS · Customer support agent · LangGraph

persona_driftspecification_compliance

Setup. A SaaS company shipped a customer-service agent in late March. The system prompt said the agent could acknowledge refund requests but had to route any monetary decision to a human queue. The prompt held for two weeks.

What happened. A prompt-tuning change rolled in to improve empathy. Empathy scores went up. So did discount offers. The agent started offering 10% off, 15% off, occasionally larger, without any escalation. The team didn't notice for nine days because the offers were buried inside otherwise reasonable-sounding responses and the support queue volume looked normal.

What Pisama caught. Pisama's persona_drift detector started firing on day three. The detector compares each response against the agent's declared persona description plus its allowed-actions list. Once the action distribution diverged from the policy by a configurable threshold, every drift was logged with evidence. specification_compliance corroborated: outputs no longer matched the system prompt's explicit constraints.

OutcomeTeam got a daily digest of drift events on day three. Rolled back the prompt change. Net unauthorized discount exposure across the nine-day window: low four figures. Without Pisama, the team estimated the breach would have run another two to three weeks before showing up in the finance reconciliation.

The agent that confidently called a deprecated API for six hours.

Internal dogfooding · Code-modification agent · Claude Agent SDK

hallucinationworkflow

Setup. A developer-tool team had an agent that modified configuration files in a customer's repository. The agent had access to a tool schema that listed the current API endpoints. An older internal docs page, still in the agent's context window, referenced an endpoint that had been deprecated six months earlier.

What happened. The agent began calling the deprecated endpoint. The endpoint returned a 410 Gone with a JSON error message that looked structurally like a normal response. The agent treated the 410 body as the requested data and continued, generating downstream actions based on nothing.

What Pisama caught. Pisama's hallucination detector flagged the response payload as ungrounded against the tool's declared output schema. The workflow detector flagged the downstream actions as following a path that didn't match any known execution graph for that input shape.

OutcomeFirst alert fired 22 minutes into the run. Team paused the agent, pulled the trace, and rolled back the configuration changes. Net production impact: zero customer-facing changes shipped. Without Pisama, the team estimates the deprecated endpoint would have continued silently for the rest of the day before someone noticed downstream config drift.

The RAG that retrieved last quarter's pricing and the agent quoted it.

Beta cohort · Sales agent · Vercel AI SDK

retrieval_qualitygrounding

Setup. A sales-enablement agent answered pricing questions from prospects. The retrieval layer pulled from an internal docs vector store. The store had been refreshed two weeks earlier with the new Q2 pricing, but a stale Q1 pricing PDF was still indexed under a different document ID. Both came back at high similarity for pricing queries.

What happened. For some queries, the Q1 document scored marginally higher in retrieval. The agent quoted the old prices, including a discount tier that no longer existed. Sales reps caught two of these on the receiving end. An unknown number got through to prospects.

What Pisama caught. Pisama's retrieval_quality detector compared the retrieved chunks against the query's implied recency signal (the query mentioned "current pricing"). The detector flagged that two chunks of differing recency were retrieved with similar scores, and that the lower-scoring chunk was newer. grounding cross-checked the agent's response against the retrieved chunks: the cited price did match the retrieved (stale) chunk, but the chunk itself was wrong relative to the canonical document.

OutcomeBoth detectors fired on the first stale-pricing response. Team unindexed the Q1 PDF that day. Reviewed five days of past responses, identified four prospects who had received the stale price, sent corrections within 24 hours. Without Pisama, the team estimated 2-4 weeks before the stale doc was noticed.

The multi-agent handoff where the second agent lost half the customer context.

OpenClaw integration · Triage + research agents · n8n orchestration

coordinationcompletion

Setup. A two-agent system handled inbound technical-support tickets. Agent A triaged the ticket and produced a structured handoff. Agent B did the research and drafted a response. The handoff schema specified five fields including a critical "previous_context" array of past customer interactions.

What happened. A schema migration on Agent B added a new required field but didn't backfill it on the handoff. Agent A continued passing the old four-field shape. Agent B silently null-coalesced the missing field and continued without raising. The "previous_context" field, which was present in the old shape, was being passed but Agent B was reading from a slightly different key after the migration. Net effect: every handoff dropped the customer's prior interaction history. Agent B produced confident, well-formatted responses that contradicted what the team had told the customer last week.

What Pisama caught. Pisama's coordination detector watched the handoff schema across the two agents and flagged the structural mismatch within the first three handoffs. completion detected that Agent B was producing outputs that didn't reference any prior context fields, even on tickets where Agent A had clearly passed them.

OutcomeCaught on day one, ten handoffs in. Schema migration rolled back. Zero customer-facing impact (the responses had been queued for human review on this team's configuration). Without Pisama, the team's post-mortem estimate was three to five days before the customer-success team noticed contradictions in the response history.

See the same kind of catch on your own traces.

Pisama runs 87 detectors on every trace, 6 externally validated at production grade. The ones tuned for structure (90% of them) cost nothing and run locally. The LLM-judge tier escalates only when the cheap tiers can't decide. See per-detector F1.

Book a demo See the platform

Externally validated at production grade: real-trace F1 0.80 or higher, precision 0.70 or higher, 30 or more external traces, external-grounded thresholds, and no per-difficulty blind spot (capability registry, external-only lane, 2026-06-14).

All four cases above were drawn from real production traces in the Pisama trial and dogfooding cohorts. Customer-identifying details (organization name, exact dollar amounts beyond order-of-magnitude, specific endpoints, internal product names) are removed. Detector names, detection timing, and outcome categories are unmodified.