v0.2 · 2026-05-27 · Research note

Pisama vs AgentPex on tau2-bench

Name: Pisama
Author: Pisama

AgentPex (Sharma, Barke, Zorn; Microsoft Research and University of Washington; CAIS 2026) extracts behavioral specifications from agent system prompts and tool schemas, then uses LLM judges to check trace compliance. The paper evaluates tau2-bench traces across three models and reports that a majority of Claude perfect-reward traces still contain procedural violations.

We ran Pisama detectors on the same tau2-bench corpus at paper scope, balanced across the three domains. Two questions:

Does Pisama reproduce the directional findings the paper reports?
Does Pisama surface failures the spec-extraction approach cannot?

Headline

Pisama reproduces AgentPex's headline findings directionally on the same substrate. Simultaneous-text-tool-call violations concentrate in Claude with a sharp drop-off across models, matching the paper's ordering. Omitted-calculation instances (paper section 4.3, bypassed calculator use in Claude traces) land close to the paper's count. Perfect-reward violation rate direction matches too: a majority of Claude's perfect-tau2-reward traces still carry a Pisama flag. Pisama additionally flags sycophancy, a behavioral failure mode AgentPex's spec-extraction cannot surface.

Method

Pisama-bench is a per-detector fixture set, not an agent-trace corpus, so running AgentPex on Pisama-bench would be a category mismatch. We use tau2-bench as the common substrate. Pisama detectors run against the same trace shape AgentPex consumes (system prompt, tool schema, message log).

Sample: traces per model across three domains, balanced to match the paper's success-rate target, max 2 trials per task_id, fixed seed.
Models: Claude 3.7 Sonnet (closest available variant to the paper's Claude 3.5 Sonnet in the released tau2 result files), GPT-4.1, o4-mini.
Domains: airline, retail, telecom.
Free detector panel: simultaneous_call_text, sycophancy, derailment, loop, auth_bypass, omitted_calculation. All run with zero LLM cost.
LLM detector: specification_compliance on a trace subset, at a modest per-trace cost.

1. Reproduction: per-model gradient

AgentPex reports the simultaneous-text-tool-call violation concentrates in Claude with a sharp drop-off across models. Pisama at paper scope reproduces the same gradient and ordering: Claude highest, GPT-4.1 low, o4-mini at or near zero. The Claude rate is higher in our reproduction than in the paper; Claude 3.7 has a higher base rate of this protocol violation than Claude 3.5.

2. Omitted-calculation reproduction (paper section 4.3)

The paper reports bypassed required-calculator use in Claude traces. Pisama's omitted_calculation detector flags when the agent does arithmetic in user-facing text without invoking the calculate tool. Our count on the same substrate lands close to the paper's.

3. Coverage extension: behavioral + structural detectors

Sycophancy, derailment, and loops are not encoded as policy rules in any tau2 domain prompt. AgentPex's spec extractor has nothing to extract for them; AgentPex's compliance checker has no rule to enforce. Pisama flags all three across the corpus: sycophancy at a meaningful minority rate across all three models, derailment rare (consistent with narrow customer-service tasks), and loops rare under the current calibration (expected for short-horizon, policy-bounded interactions).

4. Loop detector calibration

An earlier pilot over-flagged tau2 traces as loops. Root cause: a single-key state_delta ({tool: name}) made every same-tool turn hash identically under structural matching. The current calibration uses a richer state representation (tool name, args-hash, content-fingerprint) that discriminates distinct turns, bringing the flag rate down to a realistic minority of traces.

5. Perfect-reward traces still have procedural issues

AgentPex reports that a large majority of Claude tau2 traces that received a perfect tau2 reward still contain procedural violations. Pisama's six-detector free panel flags a comparable majority of perfect-reward Claude traces, and a smaller but still substantial share of perfect-reward traces overall across the three models.

6. LLM-backed specification_compliance

Pisama's specification_compliance detector implements the AgentPex pattern in Pisama's own stack: an LLM extracts behavioral rules from the domain policy, then per-rule deterministic and LLM checks evaluate trace events. It is externally validated at production grade. On the tau2 subset, after fixing a silent truncation bug in the rule extractor (a low default max-response-token setting cut off multi-rule policies mid-output), the detector successfully extracts several dozen rules per domain at a low per-trace cost.

7. Cross-validation: AgentPex on the same sims

We ran AgentPex's full seven-evaluator pipeline (paper config) on the same simulations Pisama's spec_compliance evaluated, then joined per-trace flags. AgentPex treats aggregate_absolute_score < 100 as flagged; Pisama treats spec_compliance detected=true as flagged.

The large majority of sims get the same flag decision from both systems. The single disagreement is a Pisama spec_compliance false negative: AgentPex's output_spec_eval was driven by simultaneous text-and-tool-call violations that the LLM-judge spec_compliance missed. Pisama's deterministic simultaneous_call_text detector independently flagged the same trace, so on Pisama's combined free + LLM panel every sim is caught by at least one layer.

No sim is missed by both systems. Either the LLM-as-judge layer fires (most cases) or the deterministic structural check fires (the rare disagreement). This is direct empirical support for running both layers rather than picking one.

Cost and runtime

AgentPex reports several API calls, tens of thousands of tokens, and roughly two minutes per trace. Pisama's six free detectors run with zero LLM cost across the full corpus. Pisama's LLM detector (specification_compliance) costs cents per trace and runs in seconds. The two pipelines target different points on the cost / depth curve; running both is additive, not exclusive.

Where the two pipelines diverge

AgentPex's strength is automated specification extraction from system prompts. Anything that can be stated as a prompt rule (must / must not / always / never / if-then) becomes a checkable constraint. The approach is general for single-agent, single-prompt traces with well-formed tool schemas.

Pisama covers a wider failure surface that is structurally outside spec extraction:

Behavioral failures not encoded as prompt rules: sycophancy, persona drift, consensus collapse, deception.
Metric-aware failures over numeric streams: convergence (plateau, regression, thrashing, divergence) and adaptive thinking (cost / latency disproportionate to task).
Multi-agent failures: coordination, communication mismatch, authority gradient, multi-chain entanglement.
In-app traces where there is no clean system prompt to extract specs from: the application is the spec.

The two are complementary on the same trace. For the procedural surface AgentPex covers, Pisama also has a specification_compliance detector built on the AgentPex pattern, externally validated at production grade. On the surface AgentPex does not cover, Pisama has no equivalent in the literature we are aware of as of the date of this note.

Limitations

Claude version mismatch: paper used Claude 3.5 Sonnet, the released tau2 result files we used contain Claude 3.7 Sonnet. We accept this drift for a directional reproduction.
Authority-bypass count is lower than the paper's for o4-mini. Our heuristic is a per-domain identity-tool allowlist; the paper's number derives from spec-extracted forbidden_edges. These are different signals.
Single benchmark: both AgentPex and this work evaluate on tau2-bench only. Generalization to SWE-bench, OSWorld, GAIA, or HardLLMAgentEval is unvalidated.
specification_compliance LLM detector runs on a trace subset; full-corpus spec-compliance is feasible and is the v0.3 target. We also identified and fixed a Pisama bug where the rule extractor silently truncates output on policies that produce unusually long rule JSON.

Reproducing

Pilot script and aggregated results:

papers/head_to_head/run_v2.py
papers/head_to_head/pisama_on_tau2_v2.jsonl
papers/head_to_head/manuscript.md (workshop draft)

The AgentPex paper and source code: arXiv 2603.23806 and github.com/microsoft/agentpex. tau2-bench: github.com/sierra-research/tau2-bench.

Citation

Nikulainen, T. (2026). Pisama vs AgentPex on tau2-bench, v0.2.
Pisama research note, 2026-05-27.
https://pisama.ai/research/agentpex-head-to-head