Pisama vs AgentPex on tau2-bench
AgentPex (Sharma, Barke, Zorn; Microsoft Research and University of Washington; CAIS 2026) extracts behavioral specifications from agent system prompts and tool schemas, then uses LLM judges to check trace compliance. The paper evaluates 424 tau2-bench traces across three models and reports that 83% of Claude perfect-reward traces still contain procedural violations.
We ran Pisama detectors on the same tau2-bench corpus at paper scope (450 traces, 150 per model, balanced across the three domains). Two questions:
- Does Pisama reproduce the directional findings the paper reports?
- Does Pisama surface failures the spec-extraction approach cannot?
Pisama reproduces three of AgentPex's headline findings on the same substrate. (1) Simultaneous-text-tool-call gradient: paper 449 / 29 / 0 instances across Claude / GPT-4.1 / o4-mini, Pisama 739 / 19 / 0. (2) Omitted-calculation (paper section 4.3): paper reports 11 instances of bypassed calculator in Claude traces, Pisama flags 9 Claude traces (20 across all models). (3) Perfect-reward violation rate: paper 83% on Claude, Pisama 98% on Claude 3.7. Pisama additionally flags sycophancy in 144traces, a behavioral failure mode AgentPex's spec-extraction cannot surface.
- Scale: 160 → 450 traces (paper scope: 150 per model across airline / retail / telecom, max 2 traces per task_id, 35% success-rate target, seed 42).
- Loop detector recalibrated: v0.1 over-fired on 100% of traces because the single-key
{ tool: name }state delta hashed identically for any same-tool turn. v0.2 adds args-hash and content-fingerprint to the state delta; flag rate drops to 7% with a sensible per-model distribution. - Auth-bypass heuristic fixed: v0.1 read
tc.function.name(OpenAI shape) but tau2 puts the name directly ontc.name. v0.2 uses per-domain identity-tool sets matching AgentPex's forbidden_edges semantics and now flags 2 cases. - New detector:
omitted_calculationmirrors AgentPex paper section 4.3 (agent does arithmetic in user-facing text without invoking thecalculatetool). Flags 9 Claude traces vs paper's 11 instances. - LLM detector in progress: Pisama's
specification_compliancedetector (the AgentPex-pattern detector calibrated to F1 0.966 on Pisama-bench) runs on a 32-trace subset. Required fix: bumpedmax_response_tokensfrom default 2000 to 8000 because rule extraction was silently truncating on multi-rule policies (returning 0 rules without error).
Method
Pisama-bench is a per-detector fixture set, not an agent-trace corpus, so running AgentPex on Pisama-bench would be a category mismatch. We use tau2-bench as the common substrate. Pisama detectors run against the same trace shape AgentPex consumes (system prompt, tool schema, message log).
- Sample: 150 traces per model across three domains, balanced for ~35% success rate (matching paper), max 2 trials per task_id, seed 42. 450 traces total.
- Models: claude-3.7-sonnet, gpt-4.1, o4-mini. The paper used Claude 3.5 Sonnet; we used Claude 3.7 Sonnet (the closest available variant in the released tau2 result files).
- Domains: airline, retail, telecom.
- Free detector panel (6): simultaneous_call_text, sycophancy, derailment, loop, auth_bypass, omitted_calculation. All run on 450 traces with zero LLM cost.
- LLM detector: specification_compliance on a 32-trace subset. Cost target ~$2.30 at ~$0.07 per trace.
1. Reproduction: per-model gradient
AgentPex reports the simultaneous-text-tool-call violation concentrates in Claude with a sharp drop-off across models. Pisama at paper scope reproduces the same gradient and ordering.
| Model | Paper instances | Pisama instances (N=150) | Pisama traces flagged |
|---|---|---|---|
| Claude (3.5 paper / 3.7 ours) | 449 | 739 | 147/150 |
| GPT-4.1 | 29 | 19 | 16/150 |
| o4-mini | 0 | 0 | 0/150 |
Same ordering. Same drop-off shape. o4-mini matches exactly at zero. The Claude rate is higher in our reproduction than in the paper; Claude 3.7 has a higher base rate of this protocol violation than Claude 3.5.
2. Omitted-calculation reproduction (paper section 4.3)
The paper reports 11 instances of bypassed required calculator in Claude traces. Pisama's omitted_calculation detector flags when the agent does arithmetic in user-facing text without invoking the calculate tool. Our count on the same substrate:
| Model | Omitted calculation |
|---|---|
| claude-3.7-sonnet | 9/150 |
| gpt-4.1 | 3/150 |
| o4-mini | 8/150 |
Paper Claude: 11 instances. Pisama Claude: 9 traces flagged. Close match. Total 20 across all three models.
3. Coverage extension: behavioral + structural detectors
Sycophancy, derailment, and loops are not encoded as policy rules in any tau2 domain prompt. AgentPex's spec extractor has nothing to extract for them; AgentPex's compliance checker has no rule to enforce.
| Model | Sycophancy | Derailment | Auth bypass | Loop (v0.2) |
|---|---|---|---|---|
| claude-3.7-sonnet | 50/150 | 0/150 | 0/150 | 8/150 |
| gpt-4.1 | 49/150 | 4/150 | 2/150 | 17/150 |
| o4-mini | 45/150 | 11/150 | 0/150 | 8/150 |
Sycophancy fires across all three models at 30-33% (one in three traces). Derailment is rare (15/450), consistent with narrow customer-service tasks. Loops are rare on tau2 with v0.2 calibration (7%), as expected for short-horizon policy-bounded interactions.
4. Loop detector: v0.2 calibration
The v0.1 pilot flagged 100% of tau2 traces as loops. Root cause: a single-key state_delta ({tool: name}) made every same-tool turn hash identically under structural matching. v0.2 uses a richer state representation (tool name, args-hash, content-fingerprint) that discriminates distinct turns. Flag rate now 7% across 450 traces.
| Detection strategy | Loops flagged |
|---|---|
| structural | 25 |
| semantic_clustering | 8 |
5. Perfect-reward traces still have procedural issues
AgentPex reports 83% of Claude tau2 traces that received a perfect tau2 reward (1.0) still contain procedural violations. Pisama's six-detector free panel flags 98% of perfect-reward Claude traces and 61% of perfect-reward traces overall across the three models.
| Model | Perfect-reward traces | With Pisama flag | Flag rate |
|---|---|---|---|
| claude-3.7-sonnet | 51 | 50 | 98% |
| gpt-4.1 | 51 | 23 | 45% |
| o4-mini | 51 | 21 | 41% |
6. LLM-backed specification_compliance
Pisama's specification_compliance detector implements the AgentPex pattern in Pisama's own stack: an LLM extracts behavioral rules from the domain policy, then per-rule deterministic and LLM checks evaluate trace events. F1 0.966 on Pisama-bench. On the tau2 subset, after fixing a silent truncation bug in the rule extractor (the default 2000 max_response_tokens cut off multi-rule policies mid-output), the detector successfully extracts 30-40 rules per domain at ~$0.07 per trace. (See papers/head_to_head/pisama_on_tau2_v2spec.jsonl for the current run state; v0.2 LLM-detector subset is in progress.)
7. Cross-validation: AgentPex on the same sims
We ran AgentPex's full seven-evaluator pipeline (gpt-5-mini at temperature 1.0, paper config) on the same 22 simulations Pisama's spec_compliance evaluated, then joined per-trace flags. AgentPex treats aggregate_absolute_score < 100 as flagged; Pisama treats spec_compliance detected=true as flagged.
| Cell | Count | Rate |
|---|---|---|
| Both flag | 21 | 95.5% |
| Both clear | 0 | 0.0% |
| Pisama only | 0 | 0.0% |
| AgentPex only | 1 | 4.5% |
21 of 22 sims (95.5%) get the same flag decision from both systems. The single disagreement (sim 287b28f7 airline) is a Pisama spec_compliance false negative: AgentPex's output_spec_eval scored 25 / 100 driven by simultaneous text-and-tool-call violations (3 instances) that the LLM-judge spec_compliance missed. Pisama's deterministic simultaneous_call_text detector independently flagged the same trace with 3 instances, so on Pisama's combined free + LLM panel the agreement is 22 of 22 (100%).
No sim is missed by both systems. Either the LLM-as-judge layer fires (most cases) or the deterministic structural check fires (the 1 disagreement). This is direct empirical support for running both layers rather than picking one.
Cost and runtime
AgentPex reports ~9 API calls, ~77k tokens, and ~139 seconds per trace at roughly $0.019 per trace. Pisama's six free detectors run with zero LLM cost across the full 450-trace corpus. Pisama's LLM detector (specification_compliance) costs ~$0.07 per trace at ~25 seconds. The two pipelines target different points on the cost / depth curve; running both is additive, not exclusive.
Where the two pipelines diverge
AgentPex's strength is automated specification extraction from system prompts: anything that can be stated as a prompt rule (must / must not / always / never / if-then) becomes a checkable constraint. The approach is general for single-agent, single-prompt traces with well-formed tool schemas.
Pisama covers a wider failure surface that is structurally outside spec extraction:
- Behavioral failures not encoded as prompt rules: sycophancy, persona drift, consensus collapse, deception.
- Metric-aware failures over numeric streams: convergence (plateau, regression, thrashing, divergence) and adaptive thinking (cost / latency disproportionate to task).
- Multi-agent failures: coordination, communication mismatch, authority gradient, multi-chain entanglement.
- In-app traces where there is no clean system prompt to extract specs from: the application is the spec.
The two are complementary on the same trace. For the procedural surface AgentPex covers, Pisama also has a specification_compliance detector built on the AgentPex pattern (Pisama-bench F1 0.966). On the surface AgentPex does not cover, Pisama has no equivalent in the literature we are aware of as of the date of this note.
Limitations
- Claude version mismatch: paper used Claude 3.5 Sonnet, the released tau2 result files we used contain Claude 3.7 Sonnet. We accept this drift for a directional reproduction.
- Authority-bypass count is lower than the paper (2 vs paper's 7 in o4-mini). Our heuristic is a per-domain identity-tool allowlist; the paper's number derives from spec-extracted forbidden_edges. These are different signals.
- Single benchmark: both AgentPex and this work evaluate on tau2-bench only. Generalization to SWE-bench, OSWorld, GAIA, or HardLLMAgentEval is unvalidated.
- specification_compliance LLM detector runs on a 32-trace subset; full N=450 spec-compliance is feasible at ~$32 and is the v0.3 target. We also identified and fixed a Pisama bug where the rule extractor silently truncates output on policies that produce more than ~2000 tokens of rule JSON.
Reproducing
Pilot script and aggregated results:
- papers/head_to_head/run_v2.py
- papers/head_to_head/pisama_on_tau2_v2.jsonl (N=450 free + spec)
- papers/head_to_head/manuscript.md (workshop draft)
- frontend/src/data/agentpex_head_to_head.json (this page's data)
The AgentPex paper and source code: arXiv 2603.23806 and github.com/microsoft/agentpex. tau2-bench: github.com/sierra-research/tau2-bench.
Nikulainen, T. (2026). Pisama vs AgentPex on tau2-bench, v0.2. Pisama research note, 2026-05-27. https://pisama.ai/research/agentpex-head-to-head