v1 · 270 traces across 5 corpora · 2026-05-28· Research note

Pisama Bench v1: process quality vs task outcome

Name: Pisama
Author: Pisama

The same Pisama trajectory_score correlates differently with task outcome on different corpus types. Positive on multi-agent reasoning (m500, r=+0.45), inverse on roleplay dialogue (sotopia, r=-0.44), uncorrelated on single-agent web tasks (AgentRewardBench, r=-0.03). One scalar, three relationships — because the score measures process, not outcome.

TL;DR

m500 (50 multi-agent math reasoning, 4 agents/trace): trajectory_score correlates with solver correctness at Pearson r = +0.45. First positive correlation across any labeled corpus.
AgentRewardBench (100 single-agent web tasks): trajectory_score mean 0.93; cum_reward mean 0.11. The gap — customers running web agents would see green dashboards while their agents fail at the user's job.
Three detectors fire on essentially every multi-agent trace (workflow, loop, communication; ~100% on m500). They create a uniform penalty floor that doesn't differentiate runs.

Per-corpus results

Corpus	n	score mean	OQ mean	multi-agent OQ	Pearson r
m500 (multi-agent math reasoning)	50	0.360	0.655	50 / 50	+0.45
Sotopia (multi-agent role-play dialogue)	50	0.297	0.477	45 / 50	-0.44
AgentRewardBench (single-agent web tasks)	100	0.933	1.000	0 / 100	-0.03
MAST (human-annotated multi-agent traces)	19	0.935	0.967	2 / 19	-0.08
Harbor baseline_52 (mixed corpus, post-Phase-1 reference)	51	n/a	n/a	n/a	n/a

Process quality vs task outcome

Pisama's trajectory_score is a process-quality scalar. It measures whether the run executed cleanly (consistent agents, no loops, well-structured workflow, complete handoffs). It does NOT directly measure whether the agent solved the user's task. The three corpora show three different relationships between the two axes:

m500 · r = +0.45

process and outcome entangled

Multi-agent reasoning where bad coordination → wrong answer. Score correctly predicts outcome.

sotopia · r = -0.44

process noise IS the task

Long roleplay dialogues that the system flags as loopy / derailing are exactly the ones that reach the social goal. Inverse by design, not a bug.

AgentRewardBench · r = -0.03

process orthogonal to outcome

Single-agent web tasks where coordination doesn't apply. Pisama clean, task-completion catastrophic. The gap — motivates a separate task_completion_score scalar.

Per-corpus details

m500 (multi-agent math reasoning)

n=50 · shape: 4 agents (expert_recruiter, problem_solver, critic, evaluator)

Positive correlation. First labeled multi-agent corpus where Pisama's process score predicts task outcome.

Top detector fires: communication 100% · loop 100% · workflow 100% · task_derailment 46% · decomposition 34% · completion_misjudgment 24%

Correct (28) mean score 0.372; incorrect (22) mean score 0.343

Sotopia (multi-agent role-play dialogue)

n=50 · shape: 2 agents (named role-play characters)

Inverse correlation by design. Long noisy roleplay dialogues that score high on sotopia (goal reached, relationship developed) fire loop / workflow / derailment — the back-and-forth IS the social interaction.

Top detector fires: workflow 90% · loop 90% · task_derailment 66% · context 54% · communication 50%

AgentRewardBench (single-agent web tasks)

n=100 · shape: 1 agent per trace (assistantbench / webarena / visualwebarena / workarena)

The gap. Pisama scores clean (mean 0.93) on tasks that fail at outcome (cum_reward mean 0.11). Customers running web agents would see green dashboards while agents fail at the user's job. Motivates the outcome-detector family planned for Tier 2 follow-up.

Top detector fires: completion_misjudgment 41%

MAST (human-annotated multi-agent traces)

n=19 · shape: 2+ agents (current parser handles 2 of 19; mast_full's MetaGPT/ChatDev format needs Tier 3.3 parser)

Parser-limited. Of the 2 traces that parsed multi-agent, both labeled failure and both scored low. Mast_full (1,242 traces) unlocks the real comparison once Tier 3.3 ships.

Harbor baseline_52 (mixed corpus, post-Phase-1 reference)

n=51 · shape: chat single-agent (obaydata) + 2-agent harbor-golden + 1 synthetic

Score separates success from failure on this mixed corpus (+0.255 mean separation). Phase 1's chat-trace gate fixes are evident here — pre-Phase-1 baseline had every trace firing ≥1 detector.

What's next

Tier 1

Ship this note + the public page; fix marketing copy across SDK and frontend to qualify trajectory_score as process quality.

3-5 days

Tier 2

Build outcome-oriented detectors (task_failure, silent_failure, objective_unmet) populating a separate task_completion_score scalar. Early validation shows a text-only outcome score does not by itself close the AgentRewardBench gap; tool-state-aware detection is the open next step.

2-3 weeks

Tier 3

Sharpen the existing signal: per-source_format severity_weight_profiles, detector-floor baseline adjustment, MetaGPT/ChatDev parser for MAST.

5-7 days

Tier 4

Validate at scale on mpbench (175 labeled multi-agent math) + whowhen (507 multi-agent with failure tags). Refresh this note with 6-corpus table.

1-2 weeks

Reproducibility

Orchestrator commit 26960437. Full per-trace data committed under backend/data/*_production_scoring.json. Internal source-of-truth note at docs/research/bench-v1-multi-corpus-production-scoring-2026-05.md. Per-corpus scoring scripts: backend/scripts/score_sotopia_production.py and backend/scripts/score_multi_agent_production.py.

Bench v1 will be refreshed after Tier 4 lands (mpbench + whowhen sweeps). If the multi-corpus positive-correlation story strengthens, this note becomes the canonical Pisama Bench v1 entry.