F1 · Planning

Specification Mismatch

Detects when task output doesn't match the user's original specification. Catches scope drift, missing requirements, language mismatches, and conflicting specifications.

Examples

  • User requests Python code but agent delivers TypeScript implementation
  • Task asks for 500-word summary but agent delivers 150 words
  • Agent reformulates requirements and loses critical constraints
  • Output uses deprecated API patterns that violate modern coding standards

Detection methods

Semantic Coverage
Measures how well output covers each requirement using embeddings
Keyword Matching
Checks for presence of required elements, topics, and constraints
Code Quality Checks
Validates language match, deprecated syntax, stub implementations
Numeric Tolerance
Handles approximate constraints like word counts (within 20%)

Calibration accuracy

F1
0.703
Precision
0.592
Recall
0.866

From the Pisama calibration set. See detector scoreboard for the full table.

Subtypes

  • scope drift
  • missing requirement
  • ambiguous spec
  • conflicting spec

Detect this in production with the framework adapters (LangGraph, CrewAI, AutoGen, OpenAI Agents SDK, Claude Agent SDK, n8n, Dify). See the full taxonomy at /taxonomy.