Pisama
v0.2 · 450traces · 2026-05-27 · Research note

Pisama vs AgentPex on tau2-bench

AgentPex (Sharma, Barke, Zorn; Microsoft Research and University of Washington; CAIS 2026) extracts behavioral specifications from agent system prompts and tool schemas, then uses LLM judges to check trace compliance. The paper evaluates 424 tau2-bench traces across three models and reports that 83% of Claude perfect-reward traces still contain procedural violations.

We ran Pisama detectors on the same tau2-bench corpus at paper scope (450 traces, 150 per model, balanced across the three domains). Two questions:

  1. Does Pisama reproduce the directional findings the paper reports?
  2. Does Pisama surface failures the spec-extraction approach cannot?
Headline (v0.2, N=450)

Pisama reproduces three of AgentPex's headline findings on the same substrate. (1) Simultaneous-text-tool-call gradient: paper 449 / 29 / 0 instances across Claude / GPT-4.1 / o4-mini, Pisama 739 / 19 / 0. (2) Omitted-calculation (paper section 4.3): paper reports 11 instances of bypassed calculator in Claude traces, Pisama flags 9 Claude traces (20 across all models). (3) Perfect-reward violation rate: paper 83% on Claude, Pisama 98% on Claude 3.7. Pisama additionally flags sycophancy in 144traces, a behavioral failure mode AgentPex's spec-extraction cannot surface.

v0.2 changelog vs v0.1 (2026-05-27)
  • Scale: 160 → 450 traces (paper scope: 150 per model across airline / retail / telecom, max 2 traces per task_id, 35% success-rate target, seed 42).
  • Loop detector recalibrated: v0.1 over-fired on 100% of traces because the single-key { tool: name } state delta hashed identically for any same-tool turn. v0.2 adds args-hash and content-fingerprint to the state delta; flag rate drops to 7% with a sensible per-model distribution.
  • Auth-bypass heuristic fixed: v0.1 read tc.function.name (OpenAI shape) but tau2 puts the name directly on tc.name. v0.2 uses per-domain identity-tool sets matching AgentPex's forbidden_edges semantics and now flags 2 cases.
  • New detector: omitted_calculation mirrors AgentPex paper section 4.3 (agent does arithmetic in user-facing text without invoking the calculatetool). Flags 9 Claude traces vs paper's 11 instances.
  • LLM detector in progress: Pisama's specification_compliance detector (the AgentPex-pattern detector calibrated to F1 0.966 on Pisama-bench) runs on a 32-trace subset. Required fix: bumped max_response_tokens from default 2000 to 8000 because rule extraction was silently truncating on multi-rule policies (returning 0 rules without error).

Method

Pisama-bench is a per-detector fixture set, not an agent-trace corpus, so running AgentPex on Pisama-bench would be a category mismatch. We use tau2-bench as the common substrate. Pisama detectors run against the same trace shape AgentPex consumes (system prompt, tool schema, message log).

  • Sample: 150 traces per model across three domains, balanced for ~35% success rate (matching paper), max 2 trials per task_id, seed 42. 450 traces total.
  • Models: claude-3.7-sonnet, gpt-4.1, o4-mini. The paper used Claude 3.5 Sonnet; we used Claude 3.7 Sonnet (the closest available variant in the released tau2 result files).
  • Domains: airline, retail, telecom.
  • Free detector panel (6): simultaneous_call_text, sycophancy, derailment, loop, auth_bypass, omitted_calculation. All run on 450 traces with zero LLM cost.
  • LLM detector: specification_compliance on a 32-trace subset. Cost target ~$2.30 at ~$0.07 per trace.

1. Reproduction: per-model gradient

AgentPex reports the simultaneous-text-tool-call violation concentrates in Claude with a sharp drop-off across models. Pisama at paper scope reproduces the same gradient and ordering.

ModelPaper instancesPisama instances (N=150)Pisama traces flagged
Claude (3.5 paper / 3.7 ours)449739147/150
GPT-4.1291916/150
o4-mini000/150

Same ordering. Same drop-off shape. o4-mini matches exactly at zero. The Claude rate is higher in our reproduction than in the paper; Claude 3.7 has a higher base rate of this protocol violation than Claude 3.5.

2. Omitted-calculation reproduction (paper section 4.3)

The paper reports 11 instances of bypassed required calculator in Claude traces. Pisama's omitted_calculation detector flags when the agent does arithmetic in user-facing text without invoking the calculate tool. Our count on the same substrate:

ModelOmitted calculation
claude-3.7-sonnet9/150
gpt-4.13/150
o4-mini8/150

Paper Claude: 11 instances. Pisama Claude: 9 traces flagged. Close match. Total 20 across all three models.

3. Coverage extension: behavioral + structural detectors

Sycophancy, derailment, and loops are not encoded as policy rules in any tau2 domain prompt. AgentPex's spec extractor has nothing to extract for them; AgentPex's compliance checker has no rule to enforce.

ModelSycophancyDerailmentAuth bypassLoop (v0.2)
claude-3.7-sonnet50/1500/1500/1508/150
gpt-4.149/1504/1502/15017/150
o4-mini45/15011/1500/1508/150

Sycophancy fires across all three models at 30-33% (one in three traces). Derailment is rare (15/450), consistent with narrow customer-service tasks. Loops are rare on tau2 with v0.2 calibration (7%), as expected for short-horizon policy-bounded interactions.

4. Loop detector: v0.2 calibration

The v0.1 pilot flagged 100% of tau2 traces as loops. Root cause: a single-key state_delta ({tool: name}) made every same-tool turn hash identically under structural matching. v0.2 uses a richer state representation (tool name, args-hash, content-fingerprint) that discriminates distinct turns. Flag rate now 7% across 450 traces.

Detection strategyLoops flagged
structural25
semantic_clustering8

5. Perfect-reward traces still have procedural issues

AgentPex reports 83% of Claude tau2 traces that received a perfect tau2 reward (1.0) still contain procedural violations. Pisama's six-detector free panel flags 98% of perfect-reward Claude traces and 61% of perfect-reward traces overall across the three models.

ModelPerfect-reward tracesWith Pisama flagFlag rate
claude-3.7-sonnet515098%
gpt-4.1512345%
o4-mini512141%

6. LLM-backed specification_compliance

Pisama's specification_compliance detector implements the AgentPex pattern in Pisama's own stack: an LLM extracts behavioral rules from the domain policy, then per-rule deterministic and LLM checks evaluate trace events. F1 0.966 on Pisama-bench. On the tau2 subset, after fixing a silent truncation bug in the rule extractor (the default 2000 max_response_tokens cut off multi-rule policies mid-output), the detector successfully extracts 30-40 rules per domain at ~$0.07 per trace. (See papers/head_to_head/pisama_on_tau2_v2spec.jsonl for the current run state; v0.2 LLM-detector subset is in progress.)

7. Cross-validation: AgentPex on the same sims

We ran AgentPex's full seven-evaluator pipeline (gpt-5-mini at temperature 1.0, paper config) on the same 22 simulations Pisama's spec_compliance evaluated, then joined per-trace flags. AgentPex treats aggregate_absolute_score < 100 as flagged; Pisama treats spec_compliance detected=true as flagged.

CellCountRate
Both flag2195.5%
Both clear00.0%
Pisama only00.0%
AgentPex only14.5%

21 of 22 sims (95.5%) get the same flag decision from both systems. The single disagreement (sim 287b28f7 airline) is a Pisama spec_compliance false negative: AgentPex's output_spec_eval scored 25 / 100 driven by simultaneous text-and-tool-call violations (3 instances) that the LLM-judge spec_compliance missed. Pisama's deterministic simultaneous_call_text detector independently flagged the same trace with 3 instances, so on Pisama's combined free + LLM panel the agreement is 22 of 22 (100%).

No sim is missed by both systems. Either the LLM-as-judge layer fires (most cases) or the deterministic structural check fires (the 1 disagreement). This is direct empirical support for running both layers rather than picking one.

Cost and runtime

AgentPex reports ~9 API calls, ~77k tokens, and ~139 seconds per trace at roughly $0.019 per trace. Pisama's six free detectors run with zero LLM cost across the full 450-trace corpus. Pisama's LLM detector (specification_compliance) costs ~$0.07 per trace at ~25 seconds. The two pipelines target different points on the cost / depth curve; running both is additive, not exclusive.

Where the two pipelines diverge

AgentPex's strength is automated specification extraction from system prompts: anything that can be stated as a prompt rule (must / must not / always / never / if-then) becomes a checkable constraint. The approach is general for single-agent, single-prompt traces with well-formed tool schemas.

Pisama covers a wider failure surface that is structurally outside spec extraction:

  • Behavioral failures not encoded as prompt rules: sycophancy, persona drift, consensus collapse, deception.
  • Metric-aware failures over numeric streams: convergence (plateau, regression, thrashing, divergence) and adaptive thinking (cost / latency disproportionate to task).
  • Multi-agent failures: coordination, communication mismatch, authority gradient, multi-chain entanglement.
  • In-app traces where there is no clean system prompt to extract specs from: the application is the spec.

The two are complementary on the same trace. For the procedural surface AgentPex covers, Pisama also has a specification_compliance detector built on the AgentPex pattern (Pisama-bench F1 0.966). On the surface AgentPex does not cover, Pisama has no equivalent in the literature we are aware of as of the date of this note.

Limitations

  • Claude version mismatch: paper used Claude 3.5 Sonnet, the released tau2 result files we used contain Claude 3.7 Sonnet. We accept this drift for a directional reproduction.
  • Authority-bypass count is lower than the paper (2 vs paper's 7 in o4-mini). Our heuristic is a per-domain identity-tool allowlist; the paper's number derives from spec-extracted forbidden_edges. These are different signals.
  • Single benchmark: both AgentPex and this work evaluate on tau2-bench only. Generalization to SWE-bench, OSWorld, GAIA, or HardLLMAgentEval is unvalidated.
  • specification_compliance LLM detector runs on a 32-trace subset; full N=450 spec-compliance is feasible at ~$32 and is the v0.3 target. We also identified and fixed a Pisama bug where the rule extractor silently truncates output on policies that produce more than ~2000 tokens of rule JSON.

Reproducing

Pilot script and aggregated results:

  • papers/head_to_head/run_v2.py
  • papers/head_to_head/pisama_on_tau2_v2.jsonl (N=450 free + spec)
  • papers/head_to_head/manuscript.md (workshop draft)
  • frontend/src/data/agentpex_head_to_head.json (this page's data)

The AgentPex paper and source code: arXiv 2603.23806 and github.com/microsoft/agentpex. tau2-bench: github.com/sierra-research/tau2-bench.

Citation
Nikulainen, T. (2026). Pisama vs AgentPex on tau2-bench, v0.2.
Pisama research note, 2026-05-27.
https://pisama.ai/research/agentpex-head-to-head