Pisama
v0.1 · 10,832 traces · 2026-05-27· Research note

Tool-error recovery replicates on tau2-bench

UNDERWRITE (Dsouza, Ramakrishnan, Dickens, Pohani, Glaze; Snorkel AI; AAAI/CAIS 2026; arXiv:2602.00456) introduced a 300-task multi-turn insurance underwriting benchmark and found something counterintuitive about tool errors. Raw tool-error rate correlates only weakly with answer correctness (Pearson 0.256 max). The rate of recovery, defined as the share of tool errors followed by a successful retry of the same tool, correlates 0.41 to 0.84. Self-correction, not error avoidance, distinguishes top-tier agents.

That is a strong claim. If it generalizes, it changes what we should measure in agent observability. We tested whether the finding holds on tau2-bench, an entirely different domain (airline, retail, telecom customer support) across four models and 10,832 simulations.

Headline

Recovery rate is positively correlated with task reward in 14 of 14 model-domain pairs (signed). Tool-error count is negatively correlated in 13 of 14. Direction replicates cleanly. Magnitude is weaker than the paper. tau2 best per-model Pearson is 0.24 (gpt-4.1, errors-only); the strongest single (model, domain) pair is gpt-4.1 / telecom at 0.52 on 16 error-bearing sims. UNDERWRITE-grade signal (0.41+) shows up in pockets, not at the corpus level.

What we built

Pisama ships a new detector named tool_error_recovery. Given an ordered list of tool calls with status (ok or error), it walks the trace and, for each error, scans forward for the next same-tool call that succeeded. recovery_rate is recoveries divided by total errors, vacuous-pass at 1.0 when no errors occurred. The detector flags traces with recovery_rate below 0.5 (configurable per tenant). No LLM calls, no API budget. Calibrated on 20 hand-authored golden entries across easy / medium / hard difficulty:

MetricValueNotes
F10.947495% CI 0.82 to 1.00
Precision1.0000No false positives across 20 golden entries
Recall0.90001 false negative on a hard threshold-edge case
Readiness tierbetaMeets beta gate (F1 ≥ 0.65, samples ≥ 15)
pass^k consistency1All 4 trials cleared the tier gate. Deterministic, as expected.

tau2-bench replication

tau2-bench is a customer-support agent benchmark with three domains (airline, retail, telecom) and a task-success reward signal per simulation. We ran the detector across every released tau2 result file we had local: claude-3.7-sonnet, gpt-4.1, gpt-4.1-mini, and o4-mini. Tool errors in tau2 are marked by tool-message content beginning with Error. We paired tool_call.id with tool_message.id to associate each call with its outcome.

Overall stats over 10,832 sims:

Sims with at least one tool error1,067 / 10,832 (9.9%)
Mean reward (all sims)0.6629
Mean reward (sims with errors)0.5511
Pearson r(recovery_rate, reward), all sims+0.0997
Pearson r(recovery_rate, reward), errors-only+0.1875
Pearson r(error_count, reward)-0.0923
Pearson r(n_tool_calls, reward)-0.1661

Mean reward drops from 0.6629 to 0.5511 when a tool error appears in the trace. That alone is a 17-point absolute drop. Recovery rate among the error-bearing sims correlates +0.1875 with reward. The two effects compose: errors hurt, and recovering from them recovers some of the gap.

By model

ModelNN errorsr(rec, rew)r(rec, rew) errorsr(err#, rew)
claude-3.7-sonnet1112217+0.1879+0.2314-0.1480
gpt-4.14304266+0.0458+0.2381-0.0115
gpt-4.1-mini1112228+0.0368+0.1427-0.0038
o4-mini4304356+0.1354+0.0862-0.1625

Three of four models show error_count negatively correlated with reward (the fourth, gpt-4.1, is effectively zero). All four show recovery_rate positively correlated. claude-3.7-sonnet and gpt-4.1 reach the UNDERWRITE lower bound (0.23) on errors-only.

Strongest signals

The corpus-level numbers smooth over real structure. A few (model, domain) pairs do reach UNDERWRITE territory:

  • gpt-4.1 / telecom: errors-only r = +0.5222 (1824 sims, 16 with errors). Strongest errors-only Pearson; small error sample (n=16) so confidence interval is wide.
  • gpt-4.1 / telecom-workflow: errors-only r = +0.3380 (1824 sims, 150 with errors). Larger error sample; correlation in UNDERWRITE's lower range.
  • gpt-4.1 / retail: errors-only r = +0.2986 (456 sims, 88 with errors). Within UNDERWRITE's lower range.
  • o4-mini / telecom: all-sims r = +0.2273 (1824 sims, 222 with errors). Strongest all-sims recovery_rate correlation in the corpus.
  • claude-3.7-sonnet / telecom: all-sims r = +0.1918 (456 sims, 117 with errors). Recovery and error-count signals are similar magnitude, opposite sign — both informative.

What we make of it

UNDERWRITE finds recovery rate is the dominant signal in insurance underwriting because almost every task succeeds or fails on a chain of tool calls. tau2 customer support has many ways for a sim to fail that have nothing to do with tools (identity verification, policy compliance, dialogue adversariality). Only about 10% of tau2 sims have any tool errors at all. Recovery rate carries less of the variance there than it does in UNDERWRITE.

That is not a refutation. It is a generalization boundary. Direction is universal in our sample. Magnitude is sensitive to domain. The actionable takeaway is the same: if your application has frequent tool errors, measuring recovery is more informative than measuring error count. If your application rarely sees tool errors at all, recovery is a small-sample metric and you should weight it accordingly.

What Pisama added beyond the paper

  • A runnable detector. tool_error_recovery ships in the open Pisama detector library at F1 0.947, slotting in next to 53 other production detectors. UNDERWRITE measures recovery; Pisama operationalizes it.
  • pass^k consistency in the calibration pipeline.Per UNDERWRITE's reliability framing (20% drop from k=1 to k=4), Pisama's multi-trial calibration now reports a tier-aware pass^k binary alongside mean F1. A production detector failing pass^k means at least one trial dipped below the production F1 floor, an invisible reliability gap in single-run F1.
  • Trace fingerprinting at zero LLM cost. UNDERWRITE Fig. 6 showed transition counts (tool to tool, tool to user) cleanly distinguish reflective from rushed agents. Pisama now computes these on every ingested trace and stores them in detection_metadata, no judge call required.

Reproducibility

All numbers come from papers/head_to_head/run_tool_error_recovery.py against tau2 result files at /tmp/tau2-bench/data/tau2/results/final. Per-sim output at tau2_tool_error_recovery.jsonl, summary at tau2_tool_error_recovery_summary.json. The detector itself is app.detection.tool_error_recovery with unit tests at tests/detection/test_tool_error_recovery.py. No LLM cost, no API keys, deterministic across runs.