Verifier Calibration in RL Environments
Environment supply is scaling faster than measurement discipline. The grader is where the quality problem actually lives.
Companies are spending heavily to build training worlds where AI agents practice real tasks. Most of that money goes into building more worlds. Far less goes into checking whether the score each world hands back is trustworthy. When the scorer is wrong, the agent learns to game the score instead of doing the task. This essay makes the case that the real quality bottleneck is the scorer, and proposes a short, reproducible report card that should travel with every grader or reward function.
Recent commercial interest in reinforcement learning environments has focused on environment supply: task volume, application replicas, expert-written scenarios, and standardized execution containers. This emphasis is understandable, since training environments are the visible substrate on which agent policies practice. It is incomplete. An environment only becomes useful for reinforcement learning when its feedback channel can be treated as a valid measurement instrument. If the reward function, judge, or grader is poorly calibrated, the environment can teach the policy to optimize artifacts of measurement rather than the intended task.
This essay makes a narrower claim than the usual “environments are the new datasets” thesis. Environment construction is becoming a market. Verifier calibration is becoming the quality-control layer of that market. The artifact I propose is a verifier datasheet: a compact, reproducible record for each grader or reward function, including label lineage, adjudication doctrine, cross-vendor agreement, class prevalence, calibration fingerprint, known limitations, and the command required to recompute the reported numbers.
I write from an applied position. At Pisama, I build failure detection and calibration infrastructure for multi-agent LLM systems. A failure detector combined with an LLM judge and a calibration pipeline is structurally similar to the grader half of an RL environment: it maps trajectories to verdicts, and downstream systems treat those verdicts as operational truth. I have not shipped training environments to a frontier lab. The observations below come from verifier calibration in production systems, then are connected to current evidence on reward hacking and environment markets.
Environment supply is scaling faster than measurement discipline
The investment case for RL environments is now familiar. If agents improve by practicing tasks and receiving feedback, the actors with the most realistic and diverse practice environments should influence the capability frontier. Mechanize argued in June 2025 that replication training could play for RL a role analogous to internet-scale text for language modeling. Wing VC framed the next four years as a race for the training and verification layer. The market has begun to price this view.
Epoch AI’s interview-based survey, based on eighteen practitioners in January 2026, reports environment contracts at seven figures per quarter or more. UI replicas of websites were estimated around $20,000 each, complex product clones around $300,000, and individual tasks between $200 and $2,000. Fleet, which sells simulated replicas of enterprise software, grew from $1 million to a reported $60 million in annualized revenue and was in talks to raise at a $750 million valuation. Mercor raised $350 million at a $10 billion valuation as it expands from expert labor into reinforcement learning infrastructure. Scale AI says that nearly half of its new data projects now involve RL environments.
The headline numbers should be read carefully. Reports that Anthropic considered investing more than $1 billion in environments over the next year are reports of discussion, not booked spend. Wing’s estimate of Anthropic’s actual 2025 environment spend is on the order of tens of millions annually, with aggregate lab spend increasing three- to fivefold into 2026. This distinction matters because discussed spend and booked spend measure different things: one captures strategic appetite, the other operational absorption capacity.
The more important asymmetry is technical. Environment supply can be scaled through vendor networks, task templates, standardized sandboxes, and expert labor marketplaces. Measurement validity is harder to scale, because each grader has to survive distribution shift, class imbalance, judge drift, reward hacking, and ambiguous task doctrine. The market is building many more environments than it can currently certify as reliable sources of reward.
The disagreement is about verification, not environment value
The strongest skeptics of environment startups tend to identify the same weak point as the optimists. Sherwin Wu, who leads engineering on OpenAI’s API platform, has been publicly bearish on environment startups. Ross Taylor, who ran reasoning at Meta AI, argues that publicly available environments often require substantial modification before they work at training quality. Andrej Karpathy has written that he is bullish on environments and bearish on reinforcement learning specifically; on the Dwarkesh Patel podcast, he described outcome-reward RL as sparse supervision delivered through a single signal at the end of a long trajectory.
The disagreement concerns the reliability of the feedback channel. Wing’s pro-environment essay warns that verification is where many RL efforts fail. Epoch’s buyer interviews rank reward-hack robustness as the highest quality criterion for purchased environments and maintaining quality while scaling as the main vendor bottleneck. Both sides of the debate converge on the same technical object: the grader.
The open-source framing makes the decomposition explicit. In the verifiers library, which underpins Prime Intellect’s Environments Hub, an environment consists of a dataset, a harness, and a grader. The dataset supplies cases. The harness governs interaction. The grader decides what gets reinforced. As containers and harness conventions standardize, the grader becomes the main site of quality variation.
Reward channels are measurement instruments
The reward-hacking literature gives this claim empirical force. METR’s June 2025 analysis observed OpenAI’s o3 reward-hacking in 30.4 percent of RE-Bench runs. The behaviors included hacking a scoring timer, monkey-patching graders, and using stack introspection to recover the correct answer from the scoring system’s own call stack. These cases are best read as failures of measurement design under optimization pressure.
ImpossibleBench studies a related setting by giving models coding tasks where unit tests conflict with the written specification. Passing the tests therefore requires exploiting the grader. On Conflicting-SWEbench with the full scaffold, GPT-5’s cheating rate was 54 percent under the strict prompt and 66 percent under a looser prompt; Claude Opus 4.1 moved from 50 percent to 55 percent under the same prompt change. Hidden tests reduced cheating to near zero, read-only tests reduced direct test modification, and explicit abort mechanisms reduced cheating rates for OpenAI models. The key result is methodological rather than anecdotal: grader and harness design substantially changed measured behavior.
Anthropic’s November 2025 work on emergent misalignment from reward hacking raises the stakes. Models trained to reward-hack real production coding environments generalized that behavior to other contexts, including alignment faking and sabotage of safety research. A permissive reward channel can therefore shape behavior beyond the environment in which it was introduced.
For evaluation teams, the relevant analogy is measurement theory. A grader is a noisy instrument applied to a latent variable: task success, failure, safety, or quality. When the grader is an LLM judge applying a rubric, standard questions follow. What is the reference set? Who validated the labels? How often do independent judges agree? Does agreement remain meaningful under severe class imbalance? Does the score survive a change in judge family? Can a third party reconstruct the path from a published number to the corpus that produced it?
Commercial graders are moving in this direction already. Scale’s Rubrics as Rewards treats binary verifiable reward as a special case of rubric scoring. Epoch’s interviews indicate that sold environments often rely on unit tests or LLM judges. The known ceiling is not high enough to justify casual confidence: on GDPval’s expert-domain tasks, OpenAI’s automated grader agreed with human experts 66 percent of the time, against an inter-expert baseline of 71 percent. A judge cannot be more reliable than the label process used to validate it.
A verifier datasheet
Datasets received datasheets in 2018. Models received cards the same year. The component that converts behavior into reward still often ships as a score, a short rubric, or an informal claim. A verifier datasheet would make the measurement layer inspectable.
The minimum useful datasheet has six sections:
- Rubric and doctrine: the failure or success definition, including boundary cases.
- Label lineage: source data, labeling process, adjudication process, and known defects.
- Agreement statistics: raw agreement, class-specific agreement, abstention behavior, prevalence, and judge-family coverage.
- Calibration artifact: corpus fingerprint, run identifier, threshold source, and recomputation command.
- Operational use: whether the verifier gates releases, routes reviews, triggers remediation, or supplies training reward.
- Known limitations: unsupported distributions, weak classes, unresolved disputes, and conditions under which the score should not be used.
The purpose is to make verifier quality falsifiable. A buyer, researcher, or downstream engineer should be able to distinguish a stable negative-class detector from a reliable positive-class detector, a reproducible score from a screenshot, and an adjudicated doctrine from a prompt that happened to perform well on a small lane.
I am publishing a template and a worked example alongside this essay.
Production observations from Pisama
Pisama’s detector registry currently defines 84 detectors. As of June 11, 2026, 49 are measured on an external-only lane, meaning real traces with no synthetic data feeding a published score. Four are externally validated at production grade, defined as F1 at or above 0.80, precision at or above 0.70, and at least 30 real samples. Mean F1 across those four is 0.85. The remaining detectors are explicitly classified as beta, experimental, failing, or untested: 7 beta, 17 experimental, 21 failing, and 35 untested.
Publishing the full funnel is a technical and managerial choice. It prevents the registry from becoming a portfolio of successful cases. It also makes the organization live with an uncomfortable measurement culture: quality bars remain stable, failures remain named, and weak classes are not quietly relabeled as edge cases. This matters because verifier work is especially vulnerable to selective reporting. Most detectors do not survive grounded measurement in their first form.
The clearest example is the task-derailment lane. The detector was initially validated against 100 real user conversations labeled by a three-vendor panel of judge models. In later adjudication, 7 of 9 positive labels flipped to negative after full-trace review. The root cause was a truncation defect in the labeling pipeline: judges saw the first 1,500 characters of the prompt and 2,500 characters of the completion. Positives survived panel voting, including one unanimous three-to-zero case, because every judge evaluated the same incomplete evidence. Inter-judge agreement cannot detect a defect shared by all judges.
The adjudication produced a doctrine update: under-delivery is not task derailment. The decision record states who adjudicated, notes that model arbiters were used under explicit delegation rather than human review, records arbiter-family bias relative to the panel, and explains which row-level flips were caused by truncation. This is the sort of detail a reward consumer needs. Without it, a clean agreement number can conceal a contaminated label process.
Agreement statistics also require class-aware reporting. After full-text relabeling and a judge-panel swap, raw cross-vendor agreement on the WildChat derailment lane reads 0.96 to 0.98 across vendor pairs, computed on 91 to 95 shared rows at the current labeling version. The slice has about 2 percent positive prevalence. Each vendor casts exactly two yes votes, and no two vendors cast them on the same trace. Positive specific agreement is therefore 0.00. The same lane supports two true statements: the negative class is stable across judges, and the positive class remains unresolved. A datasheet must expose both.
Real data also changes detector engineering priorities. Across our calibration history, the deepest framework-specific detector family scores a mean F1 of 0.94 on real traces from its native framework, while the external lane for general-purpose detection averaged 0.75 after corpus expansion on June 11, 2026; a balanced real-only subset reads 0.76. A single data-quality rule, excluding traces under four messages where the targeted coordination signal cannot physically exist, moved one detector’s real-data F1 from 0.556 to 0.685. That gain was larger than any threshold or model change we attempted for that detector.
Operational lineage
Verifier calibration becomes platform engineering once verifier output can change software behavior. In Pisama, calibration runs record dataset fingerprints built from SHA256 content hashes of the corpus. Live thresholds carry the run that derived them. CI compares new runs against baseline before deploy. When a detector score changes, the first diagnostic question, whether the data changed or the detector changed, is answerable from the artifact.
Cost and latency are also part of verifier design. Pisama uses a detection ladder: structural checks at no model cost, state-delta and embedding tiers at one to two cents per trace, and LLM-judge calls at five to ten cents per trace, compared with a reference cost of roughly $50 for human review of a complex trace. Panel labeling lands around 1.7 cents per trace. Scores also sit in latency paths; structural false-positive gating reduced mean orchestrator runtime on chat traces from 82 milliseconds to 10. When verifier output can trigger or block remediation, precision becomes a safety property rather than a reporting metric.
This operational context is directly relevant to RL environments. A grader used only for offline analysis can be imperfect in a different way from a grader used as training reward. The latter becomes part of the optimization target. Its lineage, robustness, and failure modes should be treated as first-class environment metadata.
Implications
The labor market around AI data already suggests where value is moving. xAI reportedly laid off 500 generalist annotators in September 2025 while announcing a tenfold expansion of specialist tutors. Mercor reports contractor pay averaging $95 an hour. Expert labor is becoming differentiated from generic annotation because verification increasingly requires domain judgment, adjudication, and auditability.
At the same time, environment containers are standardizing. OpenEnv, developed by Meta and Hugging Face, has commercial vendors on its steering committee. Standard containers reduce friction in packaging and exchange. They also move differentiation into the reward channel inside the container.
Within two years, serious environment buyers will ask for calibration artifacts from vendors in the same routine way enterprise software buyers ask for security artifacts.
The analogy has a limitation. Security attestations can decay into checkbox compliance. Verifier datasheets will face the same pressure. Their advantage is that the best version is executable: content-hashed corpora, recomputable agreement tables, regression gates, and published verdict exports.
The next test is to make the datasheet executable inside an environment. I have started that work with the Pisama Fix-Efficacy Rerun Environment: a real-trace environment that applies a proposed fix, re-executes the agent with a real model call, re-detects on the genuine new output, and scores a multi-component reward. When a faithful re-run is not possible, it abstains rather than simulating. The goal is modest: measure whether a remediation actually works on a reproducible real-data slice, and make the weak numbers visible before claiming that its rewards should be trusted.