Analytical semantics: catching wrong queries over user data
AgentFuel (Maddi, Naval, Mande, Duan, Girish, Sekar; Rockfish Data and Carnegie Mellon; CAIS 2026; arXiv 2603.12483) evaluates six state-of-the-art data analysis agents (Databricks Genie, Snowflake Cortex Analyst, Nao on GPT-4.1, PandasAI on Sonnet 4.6 / Opus 4.6 / O4-mini) on three timeseries domains (e-commerce, IoT, telecom). Accuracy collapses with query complexity: 73% on stateless queries, 34% on stateful queries, 10% on incident-specific queries.
The paper documents the gap convincingly. Existing text-to-SQL benchmarks (Spider2-Snow, BIRD LiveSQLBench, BEAVER) are 92 to 96 percent stateless one-shot lookups. Real practitioners ask stateful, conditional, incident-aware questions. The agents have not been measured against that shape, and they fail at it.
More importantly, the paper's failure listings (Listings 2 through 17) show that the failures are semantic, not behavioral. The agent does not derail, does not loop, does not violate spec. It produces a SQL query that looks plausible but translates the user's analytical intent incorrectly: wrong time window, exact-sequence match when subsequence was implied, aggregation over the wrong slice, filter applied too broadly.
Pisama ships analytical_semantics, a new detector that catches this gap. F1 0.9697 on 32 hand-authored seeds, precision 0.9412, recall 1.0000, threshold 0.85. Confusion: 16 TP, 15 TN, 1 FP, 0 FN. Hard slice (17 entries) F1 1.0000. Production tier on the calibration registry. The detector covers five sub-modes lifted directly from the AgentFuel failure taxonomy.
The behavioral-failure gap
Pisama's catalog covers 54 production failure modes. Almost all of them are behavioral: persona drift, decomposition error, coordination breakdown, derailment, hallucination, loop, scope escalation, sycophancy, consensus collapse. The agent does something it should not, or fails to do something it should.
AgentFuel's failures do not fit this shape. The agent generates a query, the query runs without error, the result comes back as a number, and the number is wrong. Nothing in the trace looks anomalous. A behavioral-failure detector cannot catch this.
This matters most for the in-app analytics chatbot. A SaaS founder ships a small "ask your data" assistant over their Postgres tables; the assistant answers user questions about their own data (workouts, sessions, orders, payments). When the assistant gets the query wrong, no behavioral detector fires. The user gets a confidently-stated wrong number.
What we built
Two-stage LLM pipeline, mirroring the AgentPex / specification compliance pattern (haiku-4.5 default, Anthropic-only). Stage 1 extracts the user's analytical intent into structured fields (time window, entities, aggregation type, filter predicates, expected result shape) and caches per query hash. Stage 2 takes the extracted intent plus the agent's SQL plus the data schema, runs a deterministic schema-existence check, and escalates to a second LLM call for semantic judgment.
The checker prompt is structured as a hard checklist. Before flagging any violation, the LLM must answer yes to three questions: did the user explicitly say the thing the agent contradicts; is the agent's choice clearly wrong rather than a valid alternative; can you cite a specific span of the user's question as evidence. A list of forbidden flag patterns (SQL conventions, calendar-week interpretation, unspecified years, lowercase filters) sits at the top of the prompt where the judge sees it first.
Sub-mode taxonomy
Every violation carries a typed sub-mode tag. The five sub-modes are stable strings referenced by golden seeds, tests, and downstream reporting:
time_window_error: Wrong time interval, missing time filter, or filter on a window the user did not name.sequence_mismatch: Exact-adjacency match when the user asked for an ordered subsequence, or reversed ordering.schema_error: Reference to a column or table that is provably not in the data schema.scope_error: Aggregation over a strictly larger slice than the user named (e.g. whole dataset vs incident window).filter_error: Filter predicate missing, too broad, too narrow, or matching the wrong sub-population.
Calibration
32 hand-authored seeds: 16 positives covering each sub-mode at easy / medium / hard difficulty, and 16negatives where the agent's SQL correctly translates the intent. The negatives include the deliberate trap cases: ambiguous "last week" that the conservative judge must NOT flag as a window error, ILIKE-broadening that's ambiguous-broad rather than wrong-broad, schema-aware platform filters that look suspicious but are correct given the column dictionary.
| Metric | Value | Notes |
|---|---|---|
| F1 | 0.9697 | 95% CI 0.90 to 1.00 |
| Precision | 0.9412 | 1 false positive on a borderline ILIKE filter |
| Recall | 1.0000 | Zero false negatives at threshold 0.85 |
| Samples | 32 | 16 positives, 16 negatives, 17 hard |
| Hard slice F1 | 1.0000 | Up from 0.00 when the bench was 18 seeds with only 3 hard |
| Readiness tier | production | Clears 30 samples, F1 >= 0.80, P >= 0.70 |
| Latency | 3.2s mean, 6.0s p95 | Two haiku-4.5 calls per detection |
What this is not
Not an evaluation framework for data analysis agents. AgentFuel is that. This is a runtime detector that catches a category of failure AgentFuel surfaces.
Not a SQL linter. The detector requires the user's question to ground every violation. SQL that looks ugly but answers the user's question correctly is not a violation.
Not yet validated on real-world traces. The 32 seeds are hand-authored in Pisama domains (a fitness app, a SaaS marketplace). AgentFuel itself uses synthetic data, and we follow that convention for the seed bench. Real-world Pisama integrations are next.
Limitations
- Synthetic seed bench. AgentFuel argues this is acceptable (deterministic ground truth) and we agree, but the next iteration should pull real failed-query traces from a deployed chatbot.
- Two-call latency. Each detection costs roughly 3 seconds of wall time and a small fraction of a cent. The intent-extraction cache amortizes the cost when many users ask the same question. For high-volume deployments, a cheaper grader pre-filter is worth adding.
- Hard ILIKE positive is the lone false positive. The judge cannot distinguish "intentionally broad" from "accidentally broad" when the user's question is itself somewhat ambiguous. World knowledge would help; the current detector does not have it.
- Wide CI on the lower bound. The 95% confidence interval is 0.90 to 1.00. We are comfortably above the production tier threshold but more seeds will tighten the bound.
Code and data
- Detector source:
app/detection/analytical_semantics.py - Golden seeds:
data/golden_dataset_analytical_semantics.json - Unit tests:
tests/detection/test_analytical_semantics.py(13 tests, all green) - Calibration entry:
python -m app.detection_enterprise.calibrate_cli --only analytical_semantics --tiered - Capability registry:
data/capability_registry.json, entryanalytical_semantics - AgentFuel paper: arXiv 2603.12483, code at github.com/Rockfish-Data/agentfuel_paper
Try it
The detector is behind a feature flag. Set FEATURE_ANALYTICAL_SEMANTICS=true and pass user_query, system_context, data_schema, and agent_queryto the detection API. Pisama returns a structured result with a sub-mode tag, confidence, evidence span from the SQL, and the span from the user's question that grounds the call.