v0.1 · Comparison · 2026-06-02 · Research note

Pisama vs Galileo Eval Engineer

Name: Pisama
Author: Pisama

Galileo Eval Engineer brings agent evaluation into the IDE. It adds slash commands to Claude Code and Codex that pull a trace, run a root-cause pass, and produce a generated fix you can apply without leaving the editor. Galileo describes the flow as going from "improvement insights and root causes, directly to generated solutions." It is a clean developer-experience play, and for a single agent wired into Galileo it is genuinely useful.

Pisama is failure detection and self-healing for multi-agent LLM systems. The two tools overlap on one question (my agent misbehaved, what do I do about it) and diverge on three that are easy to check. This note states each difference as a falsifiable claim and points at the exact file or endpoint that backs it.

Every claim about Pisama below resolves to a path in the public repo. Every claim about Galileo Eval Engineer is scoped to its published materials as of June 2026, and is written so that a reader who opens Galileo's own docs can confirm or refute it.

Three checkable differences

Trace portability.Galileo Eval Engineer reads agent traces from Galileo's own logstream. Its published IDE integration documents no path to import OpenTelemetry, Langfuse, Arize, Phoenix, or raw-JSON traces from another vendor. Pisama imports Langfuse, Arize / Phoenix, and raw JSON with no flag, OpenTelemetry and LangSmith behind a feature flag, and exposes a vendor-neutral POST /api/v1/atif/analyze.
Multi-agent localization.Galileo Eval Engineer's published worked example is a single agent. Pisama localizes each detected failure to a specific step and sub-agent (mistake_step, mistake_agent) in a multi-agent trace. (Accuracy caveat below.)
Closing the loop.Galileo Eval Engineer's workflow ends at a generated fix you apply in your editor to source. It documents no live-runtime apply, no verification pass against the running system, no pull request, no auto-merge. Pisama applies a fix to the live n8n workflow, runs a verification pass against it, and can roll back.

1. Trace portability

Eval Engineer fetches traces from Galileo. To use it, the agent has to be logging to Galileo first. That is a reasonable design for a Galileo customer and a hard stop for everyone else: if your traces live in Langfuse, Arize, Phoenix, an OpenTelemetry collector, or a home-grown JSON log, there is no documented importer to bring them in.

Pisama starts from the opposite assumption. The trace already exists somewhere, so the job is to read whatever shape it is in. The importer registry (backend/app/ingestion/importers/__init__.py) ships these formats:

Format	Availability	Notes
raw / json / generic	always	RawJSONImporter, vendor-neutral JSON traces
conversation	always	Conversation-shaped message logs
langfuse	always	Langfuse export
phoenix / arize	always	Arize Phoenix spans (one importer, two aliases)
mast / mast-data	always	MAST benchmark format
langsmith / langchain	flag	Behind otel_ingestion
otel / opentelemetry / otlp	always behind flag	OpenTelemetry spans, behind otel_ingestion

On top of the importers, Pisama exposes POST /api/v1/atif/analyze, which takes a vendor-neutral trajectory and runs the detector stack against it without any prior logging integration. If Galileo ships an OpenTelemetry, Langfuse, or Arize importer for Eval Engineer, this difference narrows, so it is worth watching. As published in June 2026, it holds.

2. Multi-agent localization

Eval Engineer's published example is one agent. Single-agent root-cause is a tractable problem: there is one actor, and the question is which of its steps went wrong. Multi-agent systems add a second axis. When a planner hands off to a researcher that hands off to a writer, and the final answer is wrong, the failure has both a step and an owner, and they are often not the same place the symptom appears.

Pisama's orchestrator localizes every detected failure to a (mistake_step, mistake_agent) pair before it returns. The fields are populated in place during detection (see the WS-C localization pass in backend/app/detection_enterprise/orchestrator.py), so a flagged trace tells you which agent and which turn to open first.

Honest caveat: primary attribution on adversarial multi-agent benchmarks (the Who&When attribution set) is hard, and Pisama's primary-localization accuracy there is well below 1.0. Read the pointer as "the step and agent to inspect first," not as a guaranteed blame assignment. The asymmetry is that Pisama emits a step-and-agent pointer at all on multi-agent traces; the published Eval Engineer example does not exercise that case.

3. Closing the loop

This is the sharpest difference. Eval Engineer produces a diagnosis and a generated fix inside the IDE, and you (or your IDE assistant) tab-complete it into your source. That is the end of the documented workflow. There is no step that applies the fix to the running system, no verification pass that re-checks the live trace after the change, no pull request, and no auto-merge.

Pisama's healing API closes that loop for n8n workflows. Concretely:

POST /api/v1/healing/apply-to-n8n/{detection_id} applies the generated fix to the live n8n workflow.
POST /api/v1/healing/{healing_id}/verify runs a verification pass against the workflow after the change.
POST /api/v1/healing/{healing_id}/rollback and /api/v1/healing/versions/{version_id}/restore revert if the verification regresses.

Scope this claim precisely: the live apply-and-verify path runs on n8n workflows, which is the integration where Pisama can act on a real running system end to end. Pisama's broader healing engine generates fixes for many detectors, but n8n is where apply, verify, and rollback are wired to a live target today. The point of comparison is the verb: Eval Engineer recommends a fix, Pisama can apply one to a running workflow and check whether it worked.

A fourth, smaller difference: published accuracy

Pisama publishes per-detector F1 with calibration provenance on the public scoreboard. Those numbers are external-corpus measurements, calibrated on labeled trace datasets, and they are not production-validated on live customer traffic; treat them as capability evidence, not a field SLA. Eval Engineer publishes no detector-level accuracy for its diagnosis step. A reader who wants to know how often a tool is right has something to read on the Pisama side and nothing to read on the Galileo side. That is a difference in disclosure, and a smaller one than the three above.

What this comparison is not

Not a claim that Pisama is cheaper. Galileo ships a small-model evaluator (Luna) aimed squarely at cheap, high-volume scoring, and Pisama makes no cost-advantage claim against it. The two tools sit at different points on the price and depth curve.

Not a claim that Eval Engineer is bad at what it does. The IDE-native workflow is a real strength, the slash-command ergonomics are good, and for a Galileo customer running a single agent it removes real friction. The three differences above are about reach (which traces, how many agents, how far into the fix), not about whether the tool works.

Not a static snapshot. Eval Engineer is evolving. If it gains an OpenTelemetry or Langfuse importer, a multi-agent example with step-and-agent localization, or a live apply-and-verify step, the relevant row here narrows or closes. The claims are dated on purpose so they stay checkable.

Check it yourself

Every Pisama claim above maps to a file or endpoint in the public repo:

Claim	Where to look
Importer registry	backend/app/ingestion/importers/__init__.py
Vendor-neutral trajectory analyze	POST /api/v1/atif/analyze (backend/app/api/v1/atif.py)
Span-level localization	mistake_step / mistake_agent in backend/app/detection_enterprise/orchestrator.py
Apply fix to live n8n workflow	POST /api/v1/healing/apply-to-n8n/{detection_id} (backend/app/api/v1/healing.py)
Verification pass	POST /api/v1/healing/{healing_id}/verify (backend/app/api/v1/healing.py)
Rollback / version restore	POST /api/v1/healing/{healing_id}/rollback, /api/v1/healing/versions/{version_id}/restore

For the Galileo side, read Galileo's own announcement of its IDE agent-evals integration: Bringing Agent Evals Into Your IDE. Look for an external-trace importer, a multi-agent example, and a step that applies a fix to a running system. If you find one, this note is out of date, and we will say so.

See also Pisama vs AgentPex on tau2-bench for a head-to-head on a shared corpus.

Pisama / 2026-06-02 / Research note v0.1