Pisama Bench v1: process quality vs task outcome
The same Pisama trajectory_score correlates differently with task outcome on different corpus types. Positive on multi-agent reasoning (m500, r=+0.45), inverse on roleplay dialogue (sotopia, r=-0.44), uncorrelated on single-agent web tasks (AgentRewardBench, r=-0.03). One scalar, three relationships — because the score measures process, not outcome.
- m500 (50 multi-agent math reasoning, 4 agents/trace): trajectory_score correlates with solver correctness at Pearson r = +0.45. First positive correlation across any labeled corpus.
- AgentRewardBench (100 single-agent web tasks): trajectory_score mean 0.93; cum_reward mean 0.11. The gap — customers running web agents would see green dashboards while their agents fail at the user's job.
- Three detectors fire on essentially every multi-agent trace (workflow, loop, communication; ~100% on m500). They create a uniform penalty floor that doesn't differentiate runs.
Per-corpus results
| Corpus | n | score mean | OQ mean | multi-agent OQ | Pearson r |
|---|---|---|---|---|---|
| m500 (multi-agent math reasoning) | 50 | 0.360 | 0.655 | 50 / 50 | +0.45 |
| Sotopia (multi-agent role-play dialogue) | 50 | 0.297 | 0.477 | 45 / 50 | -0.44 |
| AgentRewardBench (single-agent web tasks) | 100 | 0.933 | 1.000 | 0 / 100 | -0.03 |
| MAST (human-annotated multi-agent traces) | 19 | 0.935 | 0.967 | 2 / 19 | -0.08 |
| Harbor baseline_52 (mixed corpus, post-Phase-1 reference) | 51 | n/a | n/a | n/a | n/a |
Process quality vs task outcome
Pisama's trajectory_score is a process-quality scalar. It measures whether the run executed cleanly (consistent agents, no loops, well-structured workflow, complete handoffs). It does NOT directly measure whether the agent solved the user's task. The three corpora show three different relationships between the two axes:
Per-corpus details
m500 (multi-agent math reasoning)
Positive correlation. First labeled multi-agent corpus where Pisama's process score predicts task outcome.
Sotopia (multi-agent role-play dialogue)
Inverse correlation by design. Long noisy roleplay dialogues that score high on sotopia (goal reached, relationship developed) fire loop / workflow / derailment — the back-and-forth IS the social interaction.
AgentRewardBench (single-agent web tasks)
The gap. Pisama scores clean (mean 0.93) on tasks that fail at outcome (cum_reward mean 0.11). Customers running web agents would see green dashboards while agents fail at the user's job. Motivates the outcome-detector family planned for Tier 2 follow-up.
MAST (human-annotated multi-agent traces)
Parser-limited. Of the 2 traces that parsed multi-agent, both labeled failure and both scored low. Mast_full (1,242 traces) unlocks the real comparison once Tier 3.3 ships.
Harbor baseline_52 (mixed corpus, post-Phase-1 reference)
Score separates success from failure on this mixed corpus (+0.255 mean separation). Phase 1's chat-trace gate fixes are evident here — pre-Phase-1 baseline had every trace firing ≥1 detector.
What's next
Reproducibility
Orchestrator commit 26960437. Full per-trace data committed under backend/data/*_production_scoring.json. Internal source-of-truth note at docs/research/bench-v1-multi-corpus-production-scoring-2026-05.md. Per-corpus scoring scripts: backend/scripts/score_sotopia_production.py and backend/scripts/score_multi_agent_production.py.