F14 · Verification
Completion Misjudgment
Detects when an agent incorrectly determines task completion, including premature claims, partial delivery, and ignored subtasks. Most prevalent failure mode (40% in MAST-Data).
Examples
- Agent claims "all 10 endpoints documented" but only 8 are present
- Task marked complete with "planned for future work" items still pending
- JSON output has status: "complete" but documented: false for key items
- Agent delivers 80% of requirements and declares the task done
Detection methods
- Completion Markers
- Identifies explicit and implicit completion claims
- Quantitative Check
- Verifies numerical completeness ("all", "every", N items)
- Hedging Detection
- Flags qualifiers like "appears complete" or "seems done"
- JSON Indicators
- Checks structured output for incomplete flags
Calibration accuracy
F1
0.745
Precision
0.687
Recall
0.814
From the Pisama calibration set. See detector scoreboard for the full table.
Subtypes
- premature completion
- partial delivery
- ignored subtasks
- missed criteria
Detect this in production with the framework adapters (LangGraph, CrewAI, AutoGen, OpenAI Agents SDK, Claude Agent SDK, n8n, Dify). See the full taxonomy at /taxonomy.