Tool-error recovery replicates on tau2-bench
UNDERWRITE (Dsouza, Ramakrishnan, Dickens, Pohani, Glaze; Snorkel AI; AAAI/CAIS 2026; arXiv:2602.00456) introduced a 300-task multi-turn insurance underwriting benchmark and found something counterintuitive about tool errors. Raw tool-error rate correlates only weakly with answer correctness (Pearson 0.256 max). The rate of recovery, defined as the share of tool errors followed by a successful retry of the same tool, correlates 0.41 to 0.84. Self-correction, not error avoidance, distinguishes top-tier agents.
That is a strong claim. If it generalizes, it changes what we should measure in agent observability. We tested whether the finding holds on tau2-bench, an entirely different domain (airline, retail, telecom customer support) across four models and 10,832 simulations.
Recovery rate is positively correlated with task reward in 14 of 14 model-domain pairs (signed). Tool-error count is negatively correlated in 13 of 14. Direction replicates cleanly. Magnitude is weaker than the paper. tau2 best per-model Pearson is 0.24 (gpt-4.1, errors-only); the strongest single (model, domain) pair is gpt-4.1 / telecom at 0.52 on 16 error-bearing sims. UNDERWRITE-grade signal (0.41+) shows up in pockets, not at the corpus level.
What we built
Pisama ships a new detector named tool_error_recovery. Given an ordered list of tool calls with status (ok or error), it walks the trace and, for each error, scans forward for the next same-tool call that succeeded. recovery_rate is recoveries divided by total errors, vacuous-pass at 1.0 when no errors occurred. The detector flags traces with recovery_rate below 0.5 (configurable per tenant). No LLM calls, no API budget. Calibrated on 20 hand-authored golden entries across easy / medium / hard difficulty:
| Metric | Value | Notes |
|---|---|---|
| F1 | 0.9474 | 95% CI 0.82 to 1.00 |
| Precision | 1.0000 | No false positives across 20 golden entries |
| Recall | 0.9000 | 1 false negative on a hard threshold-edge case |
| Readiness tier | beta | Meets beta gate (F1 ≥ 0.65, samples ≥ 15) |
| pass^k consistency | 1 | All 4 trials cleared the tier gate. Deterministic, as expected. |
tau2-bench replication
tau2-bench is a customer-support agent benchmark with three domains (airline, retail, telecom) and a task-success reward signal per simulation. We ran the detector across every released tau2 result file we had local: claude-3.7-sonnet, gpt-4.1, gpt-4.1-mini, and o4-mini. Tool errors in tau2 are marked by tool-message content beginning with Error. We paired tool_call.id with tool_message.id to associate each call with its outcome.
Overall stats over 10,832 sims:
| Sims with at least one tool error | 1,067 / 10,832 (9.9%) |
| Mean reward (all sims) | 0.6629 |
| Mean reward (sims with errors) | 0.5511 |
| Pearson r(recovery_rate, reward), all sims | +0.0997 |
| Pearson r(recovery_rate, reward), errors-only | +0.1875 |
| Pearson r(error_count, reward) | -0.0923 |
| Pearson r(n_tool_calls, reward) | -0.1661 |
Mean reward drops from 0.6629 to 0.5511 when a tool error appears in the trace. That alone is a 17-point absolute drop. Recovery rate among the error-bearing sims correlates +0.1875 with reward. The two effects compose: errors hurt, and recovering from them recovers some of the gap.
By model
| Model | N | N errors | r(rec, rew) | r(rec, rew) errors | r(err#, rew) |
|---|---|---|---|---|---|
| claude-3.7-sonnet | 1112 | 217 | +0.1879 | +0.2314 | -0.1480 |
| gpt-4.1 | 4304 | 266 | +0.0458 | +0.2381 | -0.0115 |
| gpt-4.1-mini | 1112 | 228 | +0.0368 | +0.1427 | -0.0038 |
| o4-mini | 4304 | 356 | +0.1354 | +0.0862 | -0.1625 |
Three of four models show error_count negatively correlated with reward (the fourth, gpt-4.1, is effectively zero). All four show recovery_rate positively correlated. claude-3.7-sonnet and gpt-4.1 reach the UNDERWRITE lower bound (0.23) on errors-only.
Strongest signals
The corpus-level numbers smooth over real structure. A few (model, domain) pairs do reach UNDERWRITE territory:
gpt-4.1 / telecom: errors-only r = +0.5222 (1824 sims, 16 with errors). Strongest errors-only Pearson; small error sample (n=16) so confidence interval is wide.gpt-4.1 / telecom-workflow: errors-only r = +0.3380 (1824 sims, 150 with errors). Larger error sample; correlation in UNDERWRITE's lower range.gpt-4.1 / retail: errors-only r = +0.2986 (456 sims, 88 with errors). Within UNDERWRITE's lower range.o4-mini / telecom: all-sims r = +0.2273 (1824 sims, 222 with errors). Strongest all-sims recovery_rate correlation in the corpus.claude-3.7-sonnet / telecom: all-sims r = +0.1918 (456 sims, 117 with errors). Recovery and error-count signals are similar magnitude, opposite sign — both informative.
What we make of it
UNDERWRITE finds recovery rate is the dominant signal in insurance underwriting because almost every task succeeds or fails on a chain of tool calls. tau2 customer support has many ways for a sim to fail that have nothing to do with tools (identity verification, policy compliance, dialogue adversariality). Only about 10% of tau2 sims have any tool errors at all. Recovery rate carries less of the variance there than it does in UNDERWRITE.
That is not a refutation. It is a generalization boundary. Direction is universal in our sample. Magnitude is sensitive to domain. The actionable takeaway is the same: if your application has frequent tool errors, measuring recovery is more informative than measuring error count. If your application rarely sees tool errors at all, recovery is a small-sample metric and you should weight it accordingly.
What Pisama added beyond the paper
- A runnable detector.
tool_error_recoveryships in the open Pisama detector library at F1 0.947, slotting in next to 53 other production detectors. UNDERWRITE measures recovery; Pisama operationalizes it. - pass^k consistency in the calibration pipeline.Per UNDERWRITE's reliability framing (20% drop from k=1 to k=4), Pisama's multi-trial calibration now reports a tier-aware pass^k binary alongside mean F1. A production detector failing pass^k means at least one trial dipped below the production F1 floor, an invisible reliability gap in single-run F1.
- Trace fingerprinting at zero LLM cost. UNDERWRITE Fig. 6 showed transition counts (tool to tool, tool to user) cleanly distinguish reflective from rushed agents. Pisama now computes these on every ingested trace and stores them in
detection_metadata, no judge call required.
Reproducibility
All numbers come from papers/head_to_head/run_tool_error_recovery.py against tau2 result files at /tmp/tau2-bench/data/tau2/results/final. Per-sim output at tau2_tool_error_recovery.jsonl, summary at tau2_tool_error_recovery_summary.json. The detector itself is app.detection.tool_error_recovery with unit tests at tests/detection/test_tool_error_recovery.py. No LLM cost, no API keys, deterministic across runs.
See also Pisama vs AgentPex on tau2-bench for an adjacent replication on the same corpus.