Can I use Langfuse and Pisama together?

Yes, this is the recommended setup. Pisama emits OTel spans with `gen_ai.*` semantic conventions. Configure Langfuse as one OTel exporter and Pisama as another; the same traces flow to both.

Why not just write detectors as Langfuse evals?

Langfuse evals run post-hoc on stored traces and use LLM judges by default. Pisama detectors run synchronously during execution, are heuristic-first (free), and are calibrated on a labelled dataset of 7,212 traces. Different problem.

Use both. Here's where each one wins.

Observability tools see failures. Pisama acts on them. They are precondition layers; we are the action layer above. The comparison below is honest about where each one is stronger, not a zero-sum claim.

Pisama vs Langfuse

Name: Pisama
Author: Pisama

Langfuse is the leading open-source LLM observability platform. It captures traces, manages prompts, and runs evals on outputs. Pisama is a process-level failure detector: it tells you when two agents looped on each other, when shared state was corrupted, or when an agent drifted from its persona, while the run is happening.

These are different layers. Langfuse answers "what happened in this trace?" Pisama answers "did something go wrong during execution, and which step caused it?". Most teams running agents in production run both.

Langfuse was acquired by ClickHouse in January 2026; the open-source project remains MIT-licensed.

Where Langfuse wins

Mature trace UI, session management, and prompt versioning
Larger ecosystem of integrations (LangChain-native, OpenAI-native)
Self-hostable with battle-tested ClickHouse backend
LLM-judge eval framework with managed datasets

Where Pisama wins

87 detectors, 6 externally validated at production grade (Langfuse has none of these out of the box)
Heuristic-first pipeline: 90%+ of failures detected at $0 / sub-10ms
Process-level detection caught structurally, not via LLM judge: loops, recursion, persona drift, coordination
59.9% on TRAIL benchmark vs 11.6% best frontier; Langfuse evals depend on the LLM you wire in

At a glance

Dimension	Langfuse	Pisama
Layer	Artifact-level (output scoring)	Process-level (execution forensics)
Detection mechanism	LLM-judge graders + manual rules	Heuristic detectors, embeddings, LLM judge, human (5 tiers)
Cost per trace	$0 (storage) + LLM cost for evals	Median <$0.01 (90%+ caught at T1–T3 for free)
Agent failure coverage	Trace tree visualization	Single-agent, multi-agent, and sub-agent: coordination, loops, persona drift, withholding, silent cascade, by name
TRAIL benchmark	Depends on judge model wired in	59.9% joint accuracy (best frontier: 11.6%)
License	MIT (acquired by ClickHouse)	MIT

Externally validated at production grade: real-trace F1 0.80 or higher, precision 0.70 or higher, 30 or more external traces, external-grounded thresholds, and no per-difficulty blind spot (capability registry, external-only lane, 2026-06-14).

Benchmark note: the competitor LLM baselines we cite (for example 11.6% best frontier on TRAIL) were measured by Pisama in April 2026 against the published benchmarks, not self-reported by the vendors. Pisama's own 59.9% on TRAIL is the heuristic-only (Tier 1 to 3) result. Raw results are in the open-source repo.

Recommendation

Run both. Langfuse for trace storage, prompt management, and the UI. Pisama for the structural detectors that catch the failures Langfuse cannot see. Pisama emits standard OTel spans that Langfuse ingests directly, so no double-instrumentation.

FAQ

Can I use Langfuse and Pisama together?: Yes, this is the recommended setup. Pisama emits OTel spans with `gen_ai.*` semantic conventions. Configure Langfuse as one OTel exporter and Pisama as another; the same traces flow to both.
Why not just write detectors as Langfuse evals?: Langfuse evals run post-hoc on stored traces and use LLM judges by default. Pisama detectors run synchronously during execution, are heuristic-first (free), and are calibrated on a labelled dataset of 7,212 traces. Different problem.