Partnering with UC Berkeley to Build the Benchmark Enterprise AI Needs

Anushrut Gupta

AI reliability needs a new kind of benchmark.

Most benchmarks today test techniques — not the questions real businesses ask. They don't reflect the messy, siloed data or the pressure to get it right the first time.

That's why we're excited to partner with University of California, Berkeley's EPIC Data Lab and Professor Aditya Parameswaran to change that.

Together, we're building the first benchmark for AI data agents focused on enterprise reliability — grounded in real-world datasets from finance, healthcare, retail, telecom, and more. This collaboration brings together Berkeley's deep research rigor and our frontline insights from deploying AI in production.

Why This Matters

Current benchmarks like GAIA, Spider, and FRAMES test AI capabilities in clean, controlled environments. But as Professor Parameswaran puts it, they suffer from the "1% problem" — they're built for tech giants and ignore the 99% of organizations grappling with real-world data complexity.

Meanwhile, 78% of organizations use AI, yet more than 80% haven't seen tangible business impact. The disconnect? We're measuring the wrong things.

What We're Building

This isn't another academic exercise. We're creating a benchmark that reflects:

The complexity of federated, siloed enterprise data
The reliability demands of mission-critical decisions
The real questions businesses ask every day
The pressure to get it right — not just most of the time, but every time

The Path Forward

The benchmark beta will be revealed later this year, with datasets drawn from our real-world deployments across industries. Organizations interested in contributing use-cases or gaining early access can reach out to the research team.

It's not about who gets the highest score on a lab task. It's about whether AI can be trusted in mission-critical decisions.

Read the full announcement →

Blog

07 Jul, 2025

Partnering with UC Berkeley to Build the Benchmark Enterprise AI Needs

Why This Matters

What We're Building

The Path Forward

Related reading

See PromptQL in action on your data.