F6 · Execution
Task Derailment
Detects when an agent goes off-topic or deviates from its assigned task. One of the most common failure modes (20% prevalence in MAST-Data).
Examples
- Agent asked to write authentication docs starts writing about authorization instead
- Research agent asked about pricing analysis delivers feature comparison instead
- Code review agent starts implementing new features rather than reviewing
- Agent asked for a blog post delivers API documentation
Detection methods
- Semantic Similarity
- Compares embedding distance between task description and output
- Topic Drift Detection
- Tracks topic focus using keyword clustering
- Task Substitution
- Identifies when agent addresses a related but different task
- Coverage Verification
- Checks whether the core task requirements are addressed
Calibration accuracy
F1
0.820
Precision
0.702
Recall
0.985
From the Pisama calibration set. See detector scoreboard for the full table.
Detect this in production with the framework adapters (LangGraph, CrewAI, AutoGen, OpenAI Agents SDK, Claude Agent SDK, n8n, Dify). See the full taxonomy at /taxonomy.