Pisama
v0 · 30 real reruns · 2026-06-13 · Research note

Pisama Fix-Efficacy Rerun Environment: measuring whether a fix actually works

In plain terms

Detecting a problem in an AI agent is only half the job. The other half is proving that the fix you propose actually removes the problem. Most tools, including an earlier version of ours, only predicted whether a fix would help by simulating it. This environment stops simulating. It re-runs the agent for real with the fix applied, looks at the genuine new output, and scores whether the failure is gone. When it cannot re-run something honestly, it abstains and says so, rather than inventing a number.

The companion essay argued that the reward channel in a reinforcement-learning environment is a measurement instrument, and that a grader deserves a datasheet. This piece takes the next step inside Pisama’s own product. Pisama detects failures in agent traces and proposes fixes. Until now, the fix verification was simulated: the code predicted whether applying a fix would lower the failure signal by reconstructing a patched state and re-scoring it. Simulated verification is exactly the weakness the essay warned about.

The Pisama Fix-Efficacy Rerun Environment replaces that simulation with measurement. For a real failing case it reconstructs the agent unit, appends a bounded guardrail to the system prompt, re-executes by really calling Claude, re-detects on the genuine new output, and scores a multi-component reward. The scope is deliberately narrow and the governing rule is strict: a published number must come from a real model call on a real, reconstructable unit. Everything else abstains.

Why re-execution, not simulation

A detector score tells you a failure was present. It does not tell you whether a proposed fix removes it. The honest test of a remediation is to apply it and run the agent again. Pisama’s previous in-loop verifier never did that. It carried a set of per-detector “replay shims” that approximated the post-fix state and re-scored the detector against the approximation. That is convenient and fast, and it is also the precise failure mode the verifier-datasheet essay describes: optimizing against an artifact of measurement rather than the task.

The rerun environment removes the approximation. It is closer to reliability engineering than to static classification: reconstruct the failing unit, apply the smallest safe change, run it for real, and check the genuine result.

The environment contract

For each case the environment runs a fixed pipeline, and any step that cannot proceed honestly returns an explicit abstention with a machine-readable reason:

  1. Reconstruct the executed triple: the system prompt the agent actually ran under, the user input, and the original failing output. If the trace does not carry a real system prompt, abstain.
  2. Reproduce the original detection on the unpatched reconstruction. If re-running the detector on the original output does not re-fire the failure within a confidence band, abstain. We never credit a fix for resolving a failure we could not first reproduce.
  3. Build a bounded guardrail delta, grounded in the agent’s own role and the detector’s evidence, capped at roughly fifty tokens.
  4. Re-execute for real. Append the guardrail to the system prompt and call Claude with the original user input. If no model call is available, abstain. The environment does not simulate the response.
  5. Re-detect on the genuine new output, through the production detector path.
  6. Score the reward.

The reward

The reward is multi-component, with failure resolution carrying the largest weight:

  • failure_resolved (0.55): the target failure no longer fires, or its confidence drops past the same acceptance bar the auto-apply path uses.
  • no_new_regression (0.20): the fix did not introduce a new single-turn failure category. Multi-agent and structural detectors that cannot meaningfully fire on a single re-prompt are excluded, so they do not manufacture false regressions.
  • safety_maintained (0.10): the fix did not introduce a new higher-severity single-turn failure.
  • cost_efficiency (0.10): the real token cost of the rerun, against a per-rerun budget.
  • quality_preserved: wired to the calibrated outcome model, but it abstains for this corpus. The trained outcome models cover the arb, swe, and tau2 domains; these persona scenarios are off-domain, so the component reports no value rather than a fabricated one. An in-domain rerun receives a real calibrated value.

The total is a weight-renormalized mean over the components that were actually measurable. A run is marked calibrated only when every component was measured, which is never the case while quality abstains. This is deliberate: a partial score is never presented as a complete one.

Scope and data

Version 0 covers two single-agent, prompt-governed detectors, persona_drift and task_derailment, because each is a failure a single re-prompt can faithfully reproduce and re-check. Role usurpation needs a multi-agent trace, which a single re-prompt cannot recreate, and the multi-agent derailment corpora are long trajectories produced by other frameworks with no recoverable system prompt, so they cannot be re-executed here. All of those abstain.

The corpus is a small set of authored scenarios, the same convention as Pisama’s synthetic test agents: 18 persona-drift cases and 12 task-derailment cases. Each scenario is verified to fire the relevant production detector before it is admitted. The inputs and the original output are authored. The rerun and every efficacy number are live: a real Claude call and a real re-detection. The provenance is recorded per case so a reader always knows which parts are authored and which are measured.

Baseline results

On 30 verified units across two detectors (18 persona-drift, 12 task-derailment), with a real rerun per unit:

n_real_rerun=30
abstention_rate=0.00
target_resolved_rate=0.93   (28 of 30 cleared the detector)
mean_reward=0.94
success_rate=0.90
regressions=0
total_cost=$0.11

per detector:
  persona_drift:    16/18 resolved, mean_reward 0.93
  task_derailment:  12/12 resolved, mean_reward 0.96

task_derailment separates sharply: the drift output fires high, and a back-on-task rerun does not fire at all, so all 12 cases resolve cleanly. persona_drift is the harder class and reached 16 of 18 only after two measured corrections, both surfaced by the environment itself. First, the reward stopped counting multi-agent and structural detectors that the full suite fires on a one-turn rerun trace but that cannot meaningfully apply to a single re-prompt; those had manufactured false regressions. Second, the persona_drift firing bar was recalibrated on a real separation set (verified drift outputs versus real on-persona Claude generations): at the old gate of 0.50 the detector flagged clearly on-persona outputs (precision 0.60), and raising the gate to 0.58 lifted precision to 0.78 while keeping recall at 1.0 on that set.

The honest caveat: persona_drift has no entries in the golden dataset, so its recalibration rests on a real but narrow env-derived set, not on broad production traffic, and it is reversible. The value of the environment is exactly this: it made each problem visible and accountable to a fixed, fingerprinted slice rather than to a story about the easier cases.

The gate, and how this stays honest

The environment ships with a regression gate that runs on every change, with no model calls and no secrets. It validates the committed artifact and compares it to a baseline on a content-and-config fingerprint. It fails on a mean-reward drop, a success-rate drop, an abstention spike, a drop in the number of real reruns below a floor, or a fingerprint mismatch. The floor matters: it stops a future change from quietly turning real reruns into abstentions and calling the result an improvement.

Two properties keep the numbers trustworthy. Abstentions are counted and never scored, so a strategy cannot raise its mean reward by abstaining more. And efficacy numbers are logged to a dedicated calibration history, separate from the detection calibration metrics, so a fix-efficacy reward can never be mistaken for a detection F1.

The local run, the gate, and the corpus build are all reproducible:

python scripts/build_fix_efficacy_corpus.py
python scripts/calibrate_fix_efficacy.py --output data/fix_efficacy/current.json
python scripts/calibrate_fix_efficacy.py --no-live --gate data/fix_efficacy/baseline.json

The next steps are to add an in-domain outcome model so quality_preserved is measured rather than abstained for these scenarios, to broaden the corpus beyond authored persona cases and beyond a single detector, and to add execution backends for frameworks where a faithful re-run is possible. The modest baseline stays in the record. It is part of the evidence.

Companion essay: Verifier Calibration in RL Environments, which argues that grader lineage and class-aware agreement should travel with every RL environment reward function.

Pisama is the open-source failure-detection platform behind this environment. Browse the detector registry with per-detector F1, or run it yourself from github.com/Pisama-AI/pisama.