F6 · Execution

Task Derailment

Detects when an agent goes off-topic or deviates from its assigned task. One of the most common failure modes (20% prevalence in MAST-Data).

Examples

  • Agent asked to write authentication docs starts writing about authorization instead
  • Research agent asked about pricing analysis delivers feature comparison instead
  • Code review agent starts implementing new features rather than reviewing
  • Agent asked for a blog post delivers API documentation

Detection methods

Semantic Similarity
Compares embedding distance between task description and output
Topic Drift Detection
Tracks topic focus using keyword clustering
Task Substitution
Identifies when agent addresses a related but different task
Coverage Verification
Checks whether the core task requirements are addressed

Calibration accuracy

F1
0.820
Precision
0.702
Recall
0.985

From the Pisama calibration set. See detector scoreboard for the full table.

Detect this in production with the framework adapters (LangGraph, CrewAI, AutoGen, OpenAI Agents SDK, Claude Agent SDK, n8n, Dify). See the full taxonomy at /taxonomy.