The 17 ways AI agents break in production
A taxonomy of agent failure modes derived from 7,212 labelled traces across LangGraph, CrewAI, AutoGen, n8n, and Dify.
Most "agent failure" content lists three or four failure modes (hallucination, infinite loop, tool error) and stops. That is undercounting by an order of magnitude.
After labelling 7,212 traces from production deployments across LangGraph, CrewAI, AutoGen, n8n, and Dify, the failure space resolves into 17 distinct modes that recur consistently. Some are framework-specific. Most are not.
6 categories, 17 modes, all structural
The 17 modes group into 6 categories. Planning failures: specification mismatch, poor task decomposition, flawed workflow design. Execution failures: derailment, information withholding, coordination failure, communication breakdown, context neglect. Verification failures: completion misjudgment, retrieval quality, hallucination, grounding mismatch. Behavioural failures: persona drift, prompt injection. Reasoning failures: loop detection, state corruption, context overflow.
Specification mismatch (FC1.1): the agent acts on an incorrect interpretation of the task. The user asked for a summary; the agent wrote a recommendation. Detected by comparing task statement entities against agent action targets.
Poor task decomposition (FC1.2): a planner agent breaks a task into the wrong subtasks. Each subtask completes correctly; the overall outcome is wrong. Detected by checking subtask coverage of declared task requirements.
Coordination failure (FC2.1): agent A communicates information; agent B never references it. The classic "left hand does not know what the right hand is doing". Detected by counting cross-agent entity-reference rates.
Information withholding (FC2.2): agent has the answer in its working context and does not surface it. Detected by extracting key entities from internal state and checking output coverage.
Persona drift (FC4.1): agent output diverges from declared role, tone, or scope. The marketing agent starts writing legal copy. Detected by embedding-similarity drift against declared persona.
Loop detection (FC6.1): same state recurs across turns. Hash the (state-fingerprint) and flag recurrence within a window.
State corruption (FC6.2): shared-state schema or types mutate mid-run. Snapshot state at every step; flag type/shape changes.
Detection without an LLM judge
Each of these has a structural signature. None of them require an LLM to detect at the first pass; heuristics catch 90%+ of cases in milliseconds, at zero cost.
The full taxonomy with F1 numbers per detector and the calibration dataset is open-source at github.com/Pisama-AI/pisama-detectors.
Originally published on dev.to
More on the failure taxonomy at /taxonomy. Detector benchmarks at /benchmarks/detectors. Framework adapters at /frameworks.