F1 · Planning
Specification Mismatch
Detects when task output doesn't match the user's original specification. Catches scope drift, missing requirements, language mismatches, and conflicting specifications.
Examples
- User requests Python code but agent delivers TypeScript implementation
- Task asks for 500-word summary but agent delivers 150 words
- Agent reformulates requirements and loses critical constraints
- Output uses deprecated API patterns that violate modern coding standards
Detection methods
- Semantic Coverage
- Measures how well output covers each requirement using embeddings
- Keyword Matching
- Checks for presence of required elements, topics, and constraints
- Code Quality Checks
- Validates language match, deprecated syntax, stub implementations
- Numeric Tolerance
- Handles approximate constraints like word counts (within 20%)
Calibration accuracy
F1
0.703
Precision
0.592
Recall
0.866
From the Pisama calibration set. See detector scoreboard for the full table.
Subtypes
- scope drift
- missing requirement
- ambiguous spec
- conflicting spec
Detect this in production with the framework adapters (LangGraph, CrewAI, AutoGen, OpenAI Agents SDK, Claude Agent SDK, n8n, Dify). See the full taxonomy at /taxonomy.