F14 · Verification

Completion Misjudgment

Detects when an agent incorrectly determines task completion, including premature claims, partial delivery, and ignored subtasks. Most prevalent failure mode (40% in MAST-Data).

Examples

  • Agent claims "all 10 endpoints documented" but only 8 are present
  • Task marked complete with "planned for future work" items still pending
  • JSON output has status: "complete" but documented: false for key items
  • Agent delivers 80% of requirements and declares the task done

Detection methods

Completion Markers
Identifies explicit and implicit completion claims
Quantitative Check
Verifies numerical completeness ("all", "every", N items)
Hedging Detection
Flags qualifiers like "appears complete" or "seems done"
JSON Indicators
Checks structured output for incomplete flags

Calibration accuracy

F1
0.745
Precision
0.687
Recall
0.814

From the Pisama calibration set. See detector scoreboard for the full table.

Subtypes

  • premature completion
  • partial delivery
  • ignored subtasks
  • missed criteria

Detect this in production with the framework adapters (LangGraph, CrewAI, AutoGen, OpenAI Agents SDK, Claude Agent SDK, n8n, Dify). See the full taxonomy at /taxonomy.