Company · Pisama

Pisama exists because production AI agents fail in ways that humans alone can't keep up with.

Name: Pisama
Author: Pisama

Founder

Pisama is built by Tuomo Nikulainen, in San Francisco. After years shipping AI systems where the failure modes were obvious in retrospect but invisible in flight — agents looping, drifting, hallucinating, burning through tokens — the gap became clear. Observability tools were watching the right signals at the wrong layer.

The agent-semantic layer was missing. Most teams were trying to catch agent failures with LLM-as-judge or generic tracing, both of which read fluent prose and miss structural failures. Heuristics tuned to how agents actually fail— state recurrence, persona drift, retrieval gap, workflow divergence — were the obvious move that nobody had made yet.

Pisama is that move. 87 detectors, 6 externally validated at production grade with public per-detector F1 and an open MIT-licensed SDK.

Mission

They become the feature. We become the category.

The category Pisama is building is agent reliability — the action layer above observability. Where existing tools (Patronus, Galileo, Langfuse, Phoenix) see failures, Pisama acts on them: detection tuned to agent behavior, self-healing in loop where it's safe, evidence-rich escalation where it isn't.

Observability becomes a precondition, not a competitor. The existing category sees; Pisama acts. That's the seam, and it's where the next durable category gets built.

Proof

0187 detectors, 6 externally validated at production grade with public per-detector F1, precision, and recall — full scoreboard.
02Pisama-bench v0 published (96 entries across 13 external sources). Reproducible calibration set.
0359.9% on TRAIL (Patronus benchmark) vs 11.6% for the best frontier LLM judge — see benchmarks.
—Pre-seed. Small team. Operating in the open. The numbers above are the public substrate everything else is built on.

Externally validated at production grade: real-trace F1 0.80 or higher, precision 0.70 or higher, 30 or more external traces, external-grounded thresholds, and no per-difficulty blind spot (capability registry, external-only lane, 2026-06-14).

Above the noise

There are a thousand companies right now claiming to have the next great AI tool. The honest question every buyer is asking (and every advisor we trust keeps asking us) is how you cut through.

Our answer is three things you can verify in under five minutes without taking our word for it.

Per-detector F1 published. 87 detectors, 6 externally validated at production grade, each with public precision, recall, and F1 against a calibration set. None of the four best-funded competitors in this space publish equivalent numbers. Scoreboard.
Reproducible benchmark. Pisama-bench v1-lite is 1,774 hard-difficulty entries sampled from a 32,100-entry internal golden set. Both the public slice and the methodology are open. Anyone can reproduce the numbers we report.
Self-healing in loop. Observability tools see failures. Pisama acts. Eleven framework applicators in production code, fixing the safe ones before they reach the user. No surveyed competitor ships this combination at the detection layer.

See the failures we catch, anonymized: What breaks.

Values

Open source by default. The SDK ships under MIT. Calibration data, benchmarks, detector taxonomy are published. The default direction is more open, not less.
Calibrated honesty. Per-detector F1 is published, tier (production / beta / experimental) is named on every detector, and we tell visitors when something isn't verified end-to-end yet.
Agent semantics first. Tools that work on artifacts (tokens, traces, prompts) miss the structural shape of agent failures. We tune detectors to how agents actually fail.
Action layer, not observation layer. Observability sees failures. Pisama acts. Detection, self-healing in loop, evidence-rich escalation — in that order.
Pre-seed honesty. We say where we are. No oversold platform claims, no fake enterprise polish, no design-partner names we haven't earned.