Pisama
v1 · 270 traces across 5 corpora · 2026-05-28· Research note

Pisama Bench v1: process quality vs task outcome

The same Pisama trajectory_score correlates differently with task outcome on different corpus types. Positive on multi-agent reasoning (m500, r=+0.45), inverse on roleplay dialogue (sotopia, r=-0.44), uncorrelated on single-agent web tasks (AgentRewardBench, r=-0.03). One scalar, three relationships — because the score measures process, not outcome.

TL;DR
  • m500 (50 multi-agent math reasoning, 4 agents/trace): trajectory_score correlates with solver correctness at Pearson r = +0.45. First positive correlation across any labeled corpus.
  • AgentRewardBench (100 single-agent web tasks): trajectory_score mean 0.93; cum_reward mean 0.11. The gap — customers running web agents would see green dashboards while their agents fail at the user's job.
  • Three detectors fire on essentially every multi-agent trace (workflow, loop, communication; ~100% on m500). They create a uniform penalty floor that doesn't differentiate runs.

Per-corpus results

Corpusnscore meanOQ meanmulti-agent OQPearson r
m500 (multi-agent math reasoning)500.3600.65550 / 50+0.45
Sotopia (multi-agent role-play dialogue)500.2970.47745 / 50-0.44
AgentRewardBench (single-agent web tasks)1000.9331.0000 / 100-0.03
MAST (human-annotated multi-agent traces)190.9350.9672 / 19-0.08
Harbor baseline_52 (mixed corpus, post-Phase-1 reference)51n/an/an/an/a

Process quality vs task outcome

Pisama's trajectory_score is a process-quality scalar. It measures whether the run executed cleanly (consistent agents, no loops, well-structured workflow, complete handoffs). It does NOT directly measure whether the agent solved the user's task. The three corpora show three different relationships between the two axes:

m500 · r = +0.45
process and outcome entangled
Multi-agent reasoning where bad coordination → wrong answer. Score correctly predicts outcome.
sotopia · r = -0.44
process noise IS the task
Long roleplay dialogues that the system flags as loopy / derailing are exactly the ones that reach the social goal. Inverse by design, not a bug.
AgentRewardBench · r = -0.03
process orthogonal to outcome
Single-agent web tasks where coordination doesn't apply. Pisama clean, task-completion catastrophic. The gap — motivates a separate task_completion_score scalar.

Per-corpus details

m500 (multi-agent math reasoning)

n=50 · shape: 4 agents (expert_recruiter, problem_solver, critic, evaluator)

Positive correlation. First labeled multi-agent corpus where Pisama's process score predicts task outcome.

Top detector fires: communication 100% · loop 100% · workflow 100% · task_derailment 46% · decomposition 34% · completion_misjudgment 24%
Correct (28) mean score 0.372; incorrect (22) mean score 0.343

Sotopia (multi-agent role-play dialogue)

n=50 · shape: 2 agents (named role-play characters)

Inverse correlation by design. Long noisy roleplay dialogues that score high on sotopia (goal reached, relationship developed) fire loop / workflow / derailment — the back-and-forth IS the social interaction.

Top detector fires: workflow 90% · loop 90% · task_derailment 66% · context 54% · communication 50%

AgentRewardBench (single-agent web tasks)

n=100 · shape: 1 agent per trace (assistantbench / webarena / visualwebarena / workarena)

The gap. Pisama scores clean (mean 0.93) on tasks that fail at outcome (cum_reward mean 0.11). Customers running web agents would see green dashboards while agents fail at the user's job. Motivates the outcome-detector family planned for Tier 2 follow-up.

Top detector fires: completion_misjudgment 41%

MAST (human-annotated multi-agent traces)

n=19 · shape: 2+ agents (current parser handles 2 of 19; mast_full's MetaGPT/ChatDev format needs Tier 3.3 parser)

Parser-limited. Of the 2 traces that parsed multi-agent, both labeled failure and both scored low. Mast_full (1,242 traces) unlocks the real comparison once Tier 3.3 ships.

Harbor baseline_52 (mixed corpus, post-Phase-1 reference)

n=51 · shape: chat single-agent (obaydata) + 2-agent harbor-golden + 1 synthetic

Score separates success from failure on this mixed corpus (+0.255 mean separation). Phase 1's chat-trace gate fixes are evident here — pre-Phase-1 baseline had every trace firing ≥1 detector.

What's next

Tier 1
Ship this note + the public page; fix marketing copy across SDK and frontend to qualify trajectory_score as process quality.
3-5 days
Tier 2
Build outcome-oriented detectors (task_failure, silent_failure, objective_unmet) populating a separate task_completion_score scalar. Early validation shows a text-only outcome score does not by itself close the AgentRewardBench gap; tool-state-aware detection is the open next step.
2-3 weeks
Tier 3
Sharpen the existing signal: per-source_format severity_weight_profiles, detector-floor baseline adjustment, MetaGPT/ChatDev parser for MAST.
5-7 days
Tier 4
Validate at scale on mpbench (175 labeled multi-agent math) + whowhen (507 multi-agent with failure tags). Refresh this note with 6-corpus table.
1-2 weeks

Reproducibility

Orchestrator commit 26960437. Full per-trace data committed under backend/data/*_production_scoring.json. Internal source-of-truth note at docs/research/bench-v1-multi-corpus-production-scoring-2026-05.md. Per-corpus scoring scripts: backend/scripts/score_sotopia_production.py and backend/scripts/score_multi_agent_production.py.

Bench v1 will be refreshed after Tier 4 lands (mpbench + whowhen sweeps). If the multi-corpus positive-correlation story strengthens, this note becomes the canonical Pisama Bench v1 entry.