Phoenix already has agent evals, what is different?

Name: Pisama
Author: Pisama

Phoenix agent evals are LLM-judge graders that score outputs against a rubric. Pisama detectors run on the trace structure itself: state-hash recurrence catches loops, subsequence matching catches cycles, embedding similarity catches persona drift. These run before the LLM judge ever fires.

Use both. Here's where each one wins.

Observability tools see failures. Pisama acts on them. They are precondition layers; we are the action layer above. The comparison below is honest about where each one is stronger, not a zero-sum claim.

Pisama vs Arize Phoenix

Arize is the strongest pure-play ML observability vendor and Phoenix is its open-source surface. Arize's strength is traditional ML monitoring (drift, performance, embeddings) extended to LLMs with OpenInference traces.

Pisama starts from the agent-failure problem rather than the ML-monitoring problem. The 87-detector taxonomy maps failures to known modes (loops, persona drift, coordination breakdown) rather than statistical anomalies. The two stacks compose well.

Where Arize Phoenix wins

Industrial-strength ML observability lineage and customer base
Embedding drift detection on managed feature stores
OpenInference standard contributor: traces are portable
Phoenix self-host story is mature
Per-template benchmark tables (for example hallucination on HaluEval: precision 0.93, recall 0.72, F1 0.82) make Phoenix the closest incumbent to publishing evidence

Where Pisama wins

87 detectors covering structural agent failures Phoenix does not target
Heuristic-first pipeline: 90%+ at T1–T3 for free, vs Phoenix's LLM-judge default
Single-agent, multi-agent, and sub-agent failures as first-class concepts (Phoenix focuses on per-call traces)
Per-template benchmark numbers are one-off runs; Pisama publishes per-detector F1 from a versioned calibration program (dataset fingerprints, public changelog), including the multi-agent failure modes eval templates cannot see

At a glance

Dimension	Arize Phoenix	Pisama
Origin	ML observability extended to LLMs	Agent failure taxonomy as ground truth
Trace standard	OpenInference (Phoenix-native)	Ingest OTel `gen_ai.` + OpenInference; emit `pisama.` OTel spans
Detection	Drift + LLM-judge evals	Heuristic + embeddings + LLM-judge tiered pipeline
Agent failure coverage	Trace UI, no named detectors	Single-agent, multi-agent, and sub-agent: cross-cutting plus 6 framework-specific detector packs
Published accuracy	Per-template benchmark tables (one-off)	Per-detector F1, versioned calibration program
License	Elastic 2.0 (Phoenix)	MIT

Externally validated at production grade: real-trace F1 0.80 or higher, precision 0.70 or higher, 30 or more external traces, external-grounded thresholds, and no per-difficulty blind spot (capability registry, external-only lane, 2026-06-14).

Recommendation

Phoenix for trace storage, embedding drift, and LLM-judge evals. Pisama for the structural detectors. They emit compatible spans; run both.

FAQ

Phoenix already has agent evals, what is different?: Phoenix agent evals are LLM-judge graders that score outputs against a rubric. Pisama detectors run on the trace structure itself: state-hash recurrence catches loops, subsequence matching catches cycles, embedding similarity catches persona drift. These run before the LLM judge ever fires.