43% accuracy with Opus-4.6 & friends - will Text-to-SQL ever be good enough?

The recent Data Agent Benchmark (DAB) by Berkeley's EPIC data lab shows disappointing accuracy on state of the art frontier models on real world data questions.

Model	Score
Opus-4.6	43%
Gemini-3-Pro	38%
GPT-5.2	25%

Given the criticality of data to run day to day business operations and make strategic business decisions, this accuracy seems to indicate that AI is a far cry away from becoming useful with internal data.

What's perhaps most surprising is that AI models feel like nascent AGI for coding – yet somehow can't crack SQL, one of the oldest, most widely known, and closest-to-natural-language programming languages there is.

What gives?

Consider these quotes from the paper:

Agents typically select the right data, but fail at planning the computation or implementing it correctly.

GPT-5-mini outperforms GPT-5.2 despite being the smaller and cheaper model, suggesting that model scale does not alone determine agent performance.

It seems like the bottleneck is not the "IQ" of the AI and its ability to write sophisticated SQL but its lack of business and technical context required to solve a data task and do analysis that is sensible, trustworthy and accurate to be used for business operations.

Data tasks != coding tasks

In hindsight perhaps, it is not surprising that data tasks are not coding tasks.

Writing software is technically complicated but the domain context required to guide a frontier model to produce software that is functionally accurate is quite light in comparison. Write a product spec and boom, LLMs generate software that works.

"English is the new programming language" - Andrej Karpathy

And yes, we're still quite a way off from seeing code that is perfect or that aligns with our sense of taste, but getting generated software that adheres to the desired product spec is a massive achievement in of itself. And given that AI every single day is the worst its ever going to be, the future only looks brighter for code generation.

Yet this giant leap forward in software engineering tasks doesn't translate to data tasks.

If you think about it, this does pass the sniff test:

A good data analyst does a lot more business context gathering work and less technical execution work compared to a good software engineer that does a lot less context gathering and lot more technical execution in comparison.

The market numbers reflect this reality too. Let's compare market sizes and market spend on "context layer" vs "execution layer" today in software vs data.

The developer tools TAM — IDEs, code assistants, debugging tools — sits at around ~$7B (Mordor Intelligence, 2025). The infrastructure those developers deploy to? ~$420B in annual cloud spend (Synergy Research Group, 2025). Software engineering is an execution-heavy discipline, and the money follows accordingly.

Data is the inverse. The data platform TAM — warehouses, catalogs, BI, governance, all the tooling that helps you understand your data — is ~$90B (Business Research Insights, 2025). But the actual compute spent running data workloads is a fraction of overall cloud infrastructure spend. Even Snowflake and Databricks, the two dominant data execution platforms, combine for only ~$10B in annual revenue (SaaStr, 2025).

	Context & Understanding	Execution & Infrastructure
Software	~$7B (dev tools)	~$420B (cloud infra)
Data	~$90B (data platforms)	~$10B (warehouse compute)

Software engineers spend a little on context and a lot on execution. Data analysts spend a lot on context and comparatively little on execution.

The tooling markets reflect the work itself, and any AI solution for data that skips the context-gathering step is solving the wrong problem.

How are we going to solve AI on data then?

Based on the empirical results of the DAB, and if we were to believe in these hypotheses above to conclude that context gathering is really what differentiates data problems from coding problems, then it is clear that we need a system of capturing and maintaining shared context.

Unlike a product specification for software, (in comparison) the technical and business context for data tasks is spread across a team and its continuously changing and subject to dozens of nuances depending on who's asking and for what purpose.

We need an architecture and a new way of working to solve this shared context problem.

The individual AI form-factor that seems to have reached its zenith with ChatGPT on the consumer front and Claude Code on the SWE front will not be enough.

We need the team AI form-factor which solves the shared context problem and applies the latest advancements in frontier models to solve data problems.

Watch this space for Part 2 as we review and talk about our experiences with what's emerging in the AI landscape for teams, not just individuals.

Notes on the benchmark:

While I use the phrase "Text to SQL" in this post, I must clarify that actually the DAB measures the model's ability to reason through multi-step data analysis: discovering schemas, writing SQL, post-processing with Python, handling messy real-world data (unstructured text, ill-formatted joins, domain-specific logic), and composing a final answer.

Pure SQL generation is just one piece of it.

Vanilla database query generation would yield close to 0% accuracy on the benchmark.

Tanmai Gopal

Tanmai is the co-founder of PromptQL.

43% accuracy with Opus-4.6 & friends - will Text-to-SQL ever be good enough?

What gives?

Data tasks != coding tasks

How are we going to solve AI on data then?

Notes on the benchmark:

See PromptQL in action on your data.