Essay · 2026-04-30

The conflicted incumbent

Name: Pisama
Author: Pisama

Why your model vendor cannot be your eval vendor.

In April 2026 at Google Cloud Next, Google reorganized its enterprise AI portfolio under a single brand. Vertex AI, the platform Google had spent four years positioning as the home of agentic workloads, gave way to the Gemini Enterprise Agent Platform. The announcement included a long list of new products: Agent Identity, Registry, Gateway, Anomaly Detection, Simulation, Evaluation, Observability, Optimizer. The center of gravity moved from “build agents” to “govern, evaluate, and monitor agents.” That was the right move strategically. It was also a moment that should make any serious AI buyer pause, because it formalizes a question the industry has been avoiding: can the company that sells you the model also be the company that grades it?

The same week, the same question was being asked, in a different register, on the public side of Cisco’s accounts. Cisco completed its acquisition of Galileo earlier this year and is integrating Galileo’s evaluation and observability suite into its security and platform stack. Galileo and Gemini Enterprise, two of the loudest brands in agent observability, are now both subsidiaries of vendors whose primary product the observability layer is meant to evaluate. The pattern is identical: the incumbent bundles a check on its own behavior.

This essay argues that for any organization shipping agents into a high-stakes context, that arrangement is structurally unacceptable. The reasoning is not novel. Every other domain where money or safety is at stake settled this question decades ago. AI evaluation is the last category to relearn the lesson.

Auditor independence is universal

The principle that a check should not be performed by the entity being checked is older than computing. It is the reason public companies have external auditors. Sarbanes-Oxley, passed in 2002 in response to Enron and Arthur Andersen, made auditor independence a federal requirement, with the SEC and PCAOB enforcing rules about which non-audit services an audit firm may simultaneously provide. The FDA’s GMP regulations require manufacturing quality testing performed by personnel independent of production. ISO 17025 makes laboratory accreditation conditional on independence and impartiality of testing personnel. SOC 2 attestation, required for any SaaS vendor selling into regulated buyers, can only be issued by an auditor with no material relationship to the audited entity.

These are not stylistic preferences. They are the load-bearing assumptions of every modern regulatory regime that touches money, health, or safety. The reason is empirical. Same-supplier checks fail in predictable ways. They fail under quarterly revenue pressure. They fail under defect-rate denial. They fail under the simple human dynamic of an evaluator whose career depends on the evaluated party’s success. Every domain that has tried internal-only verification has eventually found it insufficient and moved to a separation-of-duties regime. The cost of that lesson is measured in financial collapses, drug recalls, and security breaches.

The agent industry is now arriving at the same juncture. The current default in the market, set by the bundled offerings of the largest cloud and the largest network-security vendor, is that one party sells the model, runs the agent, and grades the result. By the standards of every other regulated industry, this is the configuration that should be disqualified first.

Single-supplier failure mode

The independence argument is structural. There is also a practical one, and for engineering buyers it tends to land harder. When the same vendor supplies the model, the runtime, and the evaluator, your operational picture becomes coupled in ways that are invisible until something goes wrong.

Consider three concrete failure modes. First, the vendor outage. If Gemini’s API has a bad afternoon, your agents fail and your evaluator that runs on the same infrastructure fails alongside them. You have no way to confirm whether agents are silently degraded or are simply unobserved. The eval becomes useful only when both the model and the eval agree they are healthy, which is exactly the case in which you needed the eval least.

Second, the silent reconfiguration. When the bundled vendor changes its escalation logic or its judge model, your historical detection rates may shift without any change in your agent code. You cannot distinguish a real shift in agent behavior from an artifact of the evaluator’s retuning. This problem will be recognizable to anyone who has tried to compare year-over-year metrics from a single SaaS dashboard whose definitions changed mid-year.

Third, the policy change. Vendors update content policies, refusal behavior, and safety filters on their own cadence. When the same vendor evaluates compliance with policies it sets, you are buying a tautology. The vendor’s eval will report compliance with the vendor’s current policy, regardless of whether that policy still serves your buyers.

In each case, an independent evaluator running on different infrastructure offers a way to triangulate. A single-supplier setup does not.

Calibration is a moat the platform won’t ship

The third argument is the one buyers seem to discover last and like best. The bundled offerings are not particularly good at evaluation, because depth is not what they are optimized for.

Read any current product page from Galileo or from Gemini Enterprise’s evaluation suite. The headline phrase is “anomaly detection.” The technical claim is that the platform will surface unusual or low-confidence outputs and route them for review. What you will not find is a per-failure-mode F1 number. You will not find a published list of distinct failure types the platform can detect. You will not find a cross-validated precision-recall curve per failure mode. You will find vague gestures at “guardrails” and “drift,” and you will be asked to take it on faith that those gestures are quantitatively meaningful.

The opacity is not laziness. Calibration is a different product than feature-checkbox observability, and platform vendors are not structurally incentivized to ship it. A platform’s job is breadth: every customer, every framework, every workload, on the largest possible surface area. Per-mode calibration is the opposite kind of work. It requires a structured failure taxonomy, golden datasets per failure mode, cross-validated thresholds, and the discipline to publish the resulting F1 numbers even when they are humbling. That work scales linearly with the number of failure modes, and platforms tend not to invest in linear-scaling depth when there is exponential breadth still to capture.

Pisama publishes per-detector F1 for 84 detectors, 6 externally validated at production grade. The platform vendors do not publish equivalent numbers because publishing them would invite a comparison they cannot win.

The asymmetry shows up in numbers. Pisama publishes 84 detectors covering distinct failure modes including coordination, grounding, persona drift, completion, withholding, decomposition, and convergence. 49 are measured with publicly calibrated per-detector F1 and 95 percent confidence intervals; 6 are externally validated at production grade at mean F1 0.86. The full LLM-judge escalation cost for an end-to-end recalibration of the bench is $3.66 per pass. Those are public numbers, reproducible from open source. The platform vendors do not publish equivalent numbers because publishing them would invite a comparison they cannot win on the surface area where calibration matters most.

This is the moat. It is unsexy. It is the discipline to commit to per-mode calibration and publish the results. Compute and proprietary data do not substitute for that work. It is the only category of moat that survives the platform vendor’s natural advantage in distribution.

Where this lands

Step back from the specifics. The pattern playing out in agent evaluation is the same pattern that has played out in every other category where the largest vendor in a market eventually offers a check on its own behavior. The first generation of buyers accepts the bundle because it is convenient. The second generation, the regulated and the high-stakes, demands separation of duties and pays a premium for independence. The third generation watches the first two and switches.

Vertical SaaS platforms shipping agentic features into healthcare, insurance, field services, legal, and financial workflows are about to become the second generation. Their customers do not tolerate “the platform vendor said it was fine” as a defense. Their auditors and underwriters will ask, in the same language they have asked it of every other technology in the past forty years, who graded this and what is their relationship to the vendor.

When that conversation arrives, the answer “we use the model vendor’s bundled eval” will close as many sales as “we self-report our financial statements” closed in 2002. The conflicted incumbent will not be banned. It will simply lose the segment of the market that pays the most.