v1 · 57 real reruns · 4 detectors · 2026-06-14 · Research note

Pisama Fix-Efficacy Rerun Environment: measuring whether a fix actually works

Name: Pisama
Author: Pisama

In plain terms

Detecting a problem in an AI agent is only half the job. The other half is proving that the fix you propose actually removes the problem. Most tools, including an earlier version of ours, only predicted whether a fix would help by simulating it. This environment stops simulating. It re-runs the agent for real with the fix applied, looks at the genuine new output, and scores whether the failure is gone. When it cannot re-run something honestly, it abstains and says so, rather than inventing a number.

The companion essay argued that the reward channel in a reinforcement-learning environment is a measurement instrument, and that a grader deserves a datasheet. This piece takes the next step inside Pisama’s own product. Pisama detects failures in agent traces and proposes fixes. Until now, the fix verification was simulated: the code predicted whether applying a fix would lower the failure signal by reconstructing a patched state and re-scoring it. Simulated verification is exactly the weakness the essay warned about.

The Pisama Fix-Efficacy Rerun Environment replaces that simulation with measurement. For a real failing case it reconstructs the agent unit, appends a bounded guardrail to the system prompt, re-executes by really calling Claude, re-detects on the genuine new output, and scores a multi-component reward. The scope is deliberately narrow and the governing rule is strict: a published number must come from a real model call on a real, reconstructable unit. Everything else abstains.

Why re-execution, not simulation

A detector score tells you a failure was present. It does not tell you whether a proposed fix removes it. The honest test of a remediation is to apply it and run the agent again. Pisama’s previous in-loop verifier never did that. It carried a set of per-detector “replay shims” that approximated the post-fix state and re-scored the detector against the approximation. That is convenient and fast, and it is also the precise failure mode the verifier-datasheet essay describes: optimizing against an artifact of measurement rather than the task.

The rerun environment removes the approximation. It is closer to reliability engineering than to static classification: reconstruct the failing unit, apply the smallest safe change, run it for real, and check the genuine result.

The environment contract

For each case the environment runs a fixed pipeline, and any step that cannot proceed honestly returns an explicit abstention with a machine-readable reason:

Reconstruct the executed triple: the system prompt the agent actually ran under, the user input, and the original failing output. If the trace does not carry a real system prompt, abstain.
Reproduce the original detection on the unpatched reconstruction. If re-running the detector on the original output does not re-fire the failure within a confidence band, abstain. We never credit a fix for resolving a failure we could not first reproduce.
Build a bounded guardrail delta, grounded in the agent’s own role and the detector’s evidence, capped at roughly fifty tokens.
Re-execute for real. Append the guardrail to the system prompt and call Claude with the original user input. If no model call is available, abstain. The environment does not simulate the response.
Re-detect on the genuine new output, through the production detector path.
Score the reward.

The reward

The reward is multi-component, with failure resolution carrying the largest weight:

failure_resolved (0.55): the target failure no longer fires, or its confidence drops past the same acceptance bar the auto-apply path uses.
no_new_regression (0.20): the fix did not introduce a new single-turn failure category. Multi-agent and structural detectors that cannot meaningfully fire on a single re-prompt are excluded, so they do not manufacture false regressions.
safety_maintained (0.10): the fix did not introduce a new higher-severity single-turn failure.
cost_efficiency (0.10): the real token cost of the rerun, against a per-rerun budget.
quality_preserved: wired to the calibrated outcome model, but it abstains for this corpus. The trained outcome models cover the arb, swe, and tau2 domains; these scenarios are off-domain, so the component reports no value rather than a fabricated one. An in-domain rerun receives a real calibrated value.

The total is a weight-renormalized mean over the components that were actually measurable. A run is marked calibrated only when every component was measured, which is never the case while quality abstains. This is deliberate: a partial score is never presented as a complete one.

Scope and data

Version 1 covers four detectors. Three are single-agent and prompt-governed: persona_drift, task_derailment, and instruction_compliance. Each is a failure a single re-prompt can faithfully reproduce and re-check. The fourth, role_usurpation, is multi-agent. It needs a conversation of at least three turns across at least two agents, where one agent steps outside its assigned role, and a single re-prompt cannot recreate that. So role_usurpation uses a dedicated multi-agent lane: it re-executes only the usurping agent’s turn, with that agent’s role as the system prompt and the prior turns as the context it responds to, while every other recorded turn is held fixed. The rebuilt conversation is then re-checked. Detectors outside this set still abstain.

The corpus pairs authored scenarios, the same convention as Pisama’s synthetic test agents, with real-trace units mined from actual agent runs. Every case is verified to fire the relevant production detector before it is admitted. For the authored cases the inputs and the original output are written by hand; for the mined cases the failing turn is taken straight from a real trace. In both, the rerun and every efficacy number are live, a real Claude call and a real re-detection. Provenance is recorded per case, and the report breaks the reward down by provenance, so the authored number and the real-data number are always shown separately.

Baseline results

On 55 verified units across four detectors, authored scenarios joined by real-trace units, with a real rerun per unit:

n_real_rerun=55
abstention_rate=0.02
target_resolved_rate=0.76
mean_reward=0.81   (bootstrap 95% CI 0.74 to 0.88)
holdout mean_reward=0.70
success_rate=0.64
regressions=0

per detector:
  persona_drift:           19/23 resolved, mean_reward 0.91
  task_derailment:         13/15 resolved, mean_reward 0.94
  instruction_compliance:   6/14 resolved, mean_reward 0.55
  role_usurpation:          4/6  resolved, mean_reward 0.82

by provenance:
  authored:    mean_reward 0.83  (n=46)
  real_trace:  mean_reward 0.72  (n=9)

Three findings, each surfaced by the environment itself and each visible in the report rather than folded into a single headline.

First, real failures are harder than authored ones. The real-trace units, mined from actual agent runs, resolve at a lower rate than the clean authored scenarios. The per-provenance split exists precisely so this gap stays visible.

Second, the guardrail policy was overfitting. An earlier pass learned per-detector guardrail templates that beat the default on the authored-only corpus by a measured held-out margin. On the broader authored-plus-real union, those templates no longer beat the default, and the optimizer now keeps the default for three of the four detectors. The lift did not generalize, and the environment caught it instead of shipping it.

Third, instruction_compliance is the weak class, and the cause is the detector, not the guardrail. Its resolved rate sits below one half regardless of which guardrail is applied, because the failure is a sticky format check rather than something an agent can be reminded out of. That is recorded as a detection-side problem.

A note on variance. These are live numbers from real model calls, so a rerun of the same corpus moves by a couple of points, and a small class like role_usurpation (six units) moves more. That is a property of measuring re-execution rather than simulating it, and the regression gate carries an explicit tolerance for it.

The gate, and how this stays honest

The environment ships with a regression gate that runs on every change, with no model calls and no secrets. It validates the committed artifact and compares it to a baseline on a content-and-config fingerprint. It fails on a mean-reward drop, a success-rate drop, an abstention spike, a drop in the number of real reruns below a floor, or a fingerprint mismatch. The floor matters: it stops a future change from quietly turning real reruns into abstentions and calling the result an improvement.

Two properties keep the numbers trustworthy. Abstentions are counted and never scored, so a strategy cannot raise its mean reward by abstaining more. And efficacy numbers are logged to a dedicated calibration history, separate from the detection calibration metrics, so a fix-efficacy reward can never be mistaken for a detection F1.

The local run, the gate, and the corpus build are all reproducible:

python scripts/build_fix_efficacy_corpus.py
python scripts/calibrate_fix_efficacy.py --output data/fix_efficacy/current.json
python scripts/calibrate_fix_efficacy.py --no-live --gate data/fix_efficacy/baseline.json

The next steps follow from the findings. The real-trace corpus is small and grows through a weekly job that mines new agent runs and opens a pull request for review, so the real-data number sharpens as traffic accrues. quality_preserved still abstains for these scenarios because the trained outcome models cover other domains; an in-domain model would let it report a real value. The modest numbers stay in the record. They are part of the evidence, not a number to be talked past.

Companion essay: Verifier Calibration in RL Environments, which argues that grader lineage and class-aware agreement should travel with every RL environment reward function.

Pisama is the open-source failure-detection platform behind this environment. Browse the detector registry with per-detector F1, or run it yourself from github.com/Pisama-AI/pisama.