Essay · 2026-05-04

Runtime, not evals

Name: Pisama
Author: Pisama

The PocketOS frame, and what it tells you about where agent monitoring is headed.

A car-rental SaaS called PocketOS recently lost a production database in nine seconds. The agent was not adversarial, the model did not hallucinate, and no prompt-injection attack was involved. An engineer asked Cursor’s Plan Mode, marketed as read-only, to fix a credential mismatch in staging. The agent found an old CLI token from a different feature. One volumeDelete call to Railway erased the source data and every snapshot in the same operation. The latest off-volume backup was three months old. Recovery took thirty hours of manual rebuild from Stripe logs and email confirmations.

The post that documented this incident does the field a service by refusing the comfortable framing. This was not an alignment failure. The model did not go rogue. Three small assumptions stacked: a Plan Mode that was a UX hint rather than a permission system, a CLI token whose scope was broader than the operation it was authorizing, and a backup architecture that co-located snapshots with the volume they were meant to protect against. Each assumption was individually defensible. Together they were nine seconds.

This essay argues that the PocketOS event names a market gap that the agent-monitoring industry has been politely refusing to acknowledge. The current generation of vendors, grouped under the label “evals,” sells post-hoc trace scoring. They tell you what an agent did, in a transcript, after the fact. The market is moving past them. What customers shipping agents to production this quarter actually need is runtime enforcement: the ability to block a destructive tool call before it lands, not the ability to score the loss after it occurred.

Plan Mode is not a permission system. It is a UX hint.

This is the cleanest line in the original post and it deserves to be repeated until agent platforms change their defaults. A label on a UI tab that says “read only” does nothing to constrain the dispatcher underneath. The model can still generate tool calls; the dispatcher can still execute them; the API on the receiving end is still authorized to act. The Plan Mode promise is rhetorical, not architectural. The PocketOS team relied on it as if it were architectural, and it was not.

The distinction the field needs to internalize is between prompt-encoded constraint and harness-enforced constraint. A prompt that tells the model “do not delete production resources” is a suggestion. A harness that intercepts every tool-call dispatch, scores the operation against a destructive-action policy, and blocks calls above a configured threshold is a constraint. Suggestions are necessary for steerability. Constraints are necessary for safety. Treating the first as the second is exactly the substitution PocketOS made.

Token surface is the actual blast radius

The Railway CLI token PocketOS’s agent picked up had root permissions across the entire GraphQL API. The agent did not need root permissions; it needed a narrow staging-cluster credential mismatch resolved. Token surface is the gap between the privilege a token carries and the privilege the immediate operation requires. Most production agent deployments today carry tokens whose scope is one to three orders of magnitude broader than the operation they nominally authorize.

This is, to be precise, an API-design problem before it is an agent-monitoring problem. A platform that issues only root tokens, regardless of intended use, has already decided the failure mode for its customers. The customer’s mitigation is not to argue with the platform; it is to instrument the agent layer with per-call scope inflation scoring. A monitor that reads the agent’s outgoing tool calls, knows the privilege scope each token carries, and computes the ratio between privilege available and privilege used is a tractable detector. It produces a per-call score. Above a threshold, escalate to approval. Above a higher threshold, block.

We shipped two new detectors at Pisama this week: planning-fallacy detection (agents systematically underestimating cost and time) and authority-gradient detection (sub-agents suppressing contradicting evidence to defer to an orchestrator). The next detector after those, by direct line from the PocketOS event, is scope-inflation detection. Same shape. Different layer of the privilege hierarchy.

Three small assumptions are the actual category of failure

No single failure caused the nine-second loss. The model bypassed Plan Mode because Plan Mode had no enforcement. The token was over-scoped because Railway issued root tokens. The backups disappeared because the backup volume was the same volume the destructive operation targeted. Three independent layers, each individually survivable, simultaneously failed. This is the Swiss-cheese model of accident causation that James Reason developed for industrial safety in 1990. The multi-agent-LLM context inherits the failure pattern unchanged.

The current agent-monitoring market is good at single-failure detection. Each detector fires on a single signal: a loop signature, a hallucinated citation, a persona drift. The harder problem the PocketOS event surfaces is stack detection: noticing when multiple detectors are simultaneously elevated on adjacent traces in ways that compound. An agent session showing elevated authority-gradient suppression plus elevated planning-fallacy underestimation plus an over-scoped token in the recent tool-call history is materially riskier than any one of those signals alone. The compound risk is not the sum; it is the product. A session-level aggregator that watches per-detector scores and surfaces compound elevation is a small extension of the existing detector dispatch and a large reframe of the customer narrative.

The product question is whether the customer wants to read a list of every detector that fired this week or wants a single number per agent session telling them how many independent failure-mode dimensions were simultaneously elevated. The second framing is what stacked-risk monitoring looks like. The first framing is what current evals deliver. The market is past the first one.

What this means for the eval-first vendors

Patronus, Galileo, Langfuse, and the long tail of trace-observability tools are all positioned around the question “did the agent perform well?” They score finished transcripts. They produce dashboards. They flag anomalies after the fact. For the deployment-stage customer who has just read the PocketOS post, the after-the-fact answer is the wrong shape. Nine seconds does not give you an after-the-fact remediation window.

Two responses are open to the eval-first vendors. The first is to extend their product into the runtime-enforcement layer, which means hooking into customers’ tool-call dispatch path rather than just ingesting traces. This is a meaningful engineering investment and a different sales motion. The second is to partner with a runtime-enforcement vendor and concede that evaluation and enforcement are different products at different stages of the agent lifecycle. Either response is credible; the third option, continuing to claim that post-hoc scoring is sufficient for production agent deployment, is the option the market is currently rejecting.

What Pisama is doing about it

The detectors we ship today read agent traces and produce per-trace scores. That is the correct first product. It is not the correct last product. The healing module in Pisama already has the right primitives for runtime enforcement: checkpoints, rollback, and approval policies for high-risk fixes. What is missing is wiring those primitives to ingest-time enforcement on tool-call dispatch, so a destructive call that exceeds policy is intercepted before it lands rather than diagnosed after it does.

The PocketOS frame gives a clear roadmap. Token-surface scoring is the next detector. Stacked-risk aggregation is the next product surface. Pre-action blast-radius enforcement is the next architectural commitment. We are taking them in that order because each step compounds the leverage of the next.

Evals tell you what an agent did. Runtime enforcement tells you what an agent should not be allowed to do.

That distinction is the through-line of the PocketOS frame, and it is the wedge that separates the next generation of agent monitoring from the post-hoc-scoring incumbents. If you are shipping agents to production this quarter, your blast radius is not your code. It is your token surface, your API design, and your backup co-location, mediated by whatever runtime layer you can place between your agent and your tools. Choose that layer carefully. Nine seconds is not very long.