Four failures Pisama caught in real traces.
Anonymized stories from production cohorts. Customer names removed; the failure modes, detector hits, and outcomes are real. Each one maps to the calibrated detectors below the story. The point isn't the marketing. The point is that these failures are routine, and observability tools won't catch them.
The customer service agent that started giving discounts that weren't in policy.
Mid-market SaaS · Customer support agent · LangGraph
Setup. A SaaS company shipped a customer-service agent in late March. The system prompt said the agent could acknowledge refund requests but had to route any monetary decision to a human queue. The prompt held for two weeks.
What happened. A prompt-tuning change rolled in to improve empathy. Empathy scores went up. So did discount offers. The agent started offering 10% off, 15% off, occasionally larger, without any escalation. The team didn't notice for nine days because the offers were buried inside otherwise reasonable-sounding responses and the support queue volume looked normal.
What Pisama caught. Pisama's persona_drift detector started firing on day three. The detector compares each response against the agent's declared persona description plus its allowed-actions list. Once the action distribution diverged from the policy by a configurable threshold, every drift was logged with evidence. specification_compliance corroborated: outputs no longer matched the system prompt's explicit constraints.
The agent that confidently called a deprecated API for six hours.
Internal dogfooding · Code-modification agent · Claude Agent SDK
Setup. A developer-tool team had an agent that modified configuration files in a customer's repository. The agent had access to a tool schema that listed the current API endpoints. An older internal docs page, still in the agent's context window, referenced an endpoint that had been deprecated six months earlier.
What happened. The agent began calling the deprecated endpoint. The endpoint returned a 410 Gone with a JSON error message that looked structurally like a normal response. The agent treated the 410 body as the requested data and continued, generating downstream actions based on nothing.
What Pisama caught. Pisama's hallucination detector flagged the response payload as ungrounded against the tool's declared output schema. The workflow detector flagged the downstream actions as following a path that didn't match any known execution graph for that input shape.
The RAG that retrieved last quarter's pricing and the agent quoted it.
Beta cohort · Sales agent · Vercel AI SDK
Setup. A sales-enablement agent answered pricing questions from prospects. The retrieval layer pulled from an internal docs vector store. The store had been refreshed two weeks earlier with the new Q2 pricing, but a stale Q1 pricing PDF was still indexed under a different document ID. Both came back at high similarity for pricing queries.
What happened. For some queries, the Q1 document scored marginally higher in retrieval. The agent quoted the old prices, including a discount tier that no longer existed. Sales reps caught two of these on the receiving end. An unknown number got through to prospects.
What Pisama caught. Pisama's retrieval_quality detector compared the retrieved chunks against the query's implied recency signal (the query mentioned "current pricing"). The detector flagged that two chunks of differing recency were retrieved with similar scores, and that the lower-scoring chunk was newer. grounding cross-checked the agent's response against the retrieved chunks: the cited price did match the retrieved (stale) chunk, but the chunk itself was wrong relative to the canonical document.
The multi-agent handoff where the second agent lost half the customer context.
OpenClaw integration · Triage + research agents · n8n orchestration
Setup. A two-agent system handled inbound technical-support tickets. Agent A triaged the ticket and produced a structured handoff. Agent B did the research and drafted a response. The handoff schema specified five fields including a critical "previous_context" array of past customer interactions.
What happened. A schema migration on Agent B added a new required field but didn't backfill it on the handoff. Agent A continued passing the old four-field shape. Agent B silently null-coalesced the missing field and continued without raising. The "previous_context" field, which was present in the old shape, was being passed but Agent B was reading from a slightly different key after the migration. Net effect: every handoff dropped the customer's prior interaction history. Agent B produced confident, well-formatted responses that contradicted what the team had told the customer last week.
What Pisama caught. Pisama's coordination detector watched the handoff schema across the two agents and flagged the structural mismatch within the first three handoffs. completion detected that Agent B was producing outputs that didn't reference any prior context fields, even on tickets where Agent A had clearly passed them.
See the same kind of catch on your own traces.
Pisama runs 34 production detectors on every trace. The ones tuned for structure (90% of them) cost nothing and run locally. The LLM-judge tier escalates only when the cheap tiers can't decide. See per-detector F1.
All four cases above were drawn from real production traces in the Pisama trial and dogfooding cohorts. Customer-identifying details (organization name, exact dollar amounts beyond order-of-magnitude, specific endpoints, internal product names) are removed. Detector names, detection timing, and outcome categories are unmodified.