AI reliability needs a new kind of benchmark.
Most benchmarks today test techniques — not the questions real businesses ask. They don't reflect the messy, siloed data or the pressure to get it right the first time.
That's why we're excited to partner with University of California, Berkeley's EPIC Data Lab and Professor Aditya Parameswaran to change that.
Together, we're building the first benchmark for AI data agents focused on enterprise reliability — grounded in real-world datasets from finance, healthcare, retail, telecom, and more. This collaboration brings together Berkeley's deep research rigor and our frontline insights from deploying AI in production.
Why This Matters
Current benchmarks like GAIA, Spider, and FRAMES test AI capabilities in clean, controlled environments. But as Professor Parameswaran puts it, they suffer from the "1% problem" — they're built for tech giants and ignore the 99% of organizations grappling with real-world data complexity.
Meanwhile, 78% of organizations use AI, yet more than 80% haven't seen tangible business impact. The disconnect? We're measuring the wrong things.
What We're Building
This isn't another academic exercise. We're creating a benchmark that reflects:
- The complexity of federated, siloed enterprise data
- The reliability demands of mission-critical decisions
- The real questions businesses ask every day
- The pressure to get it right — not just most of the time, but every time
The Path Forward
The benchmark beta will be revealed later this year, with datasets drawn from our real-world deployments across industries. Organizations interested in contributing use-cases or gaining early access can reach out to the research team.
It's not about who gets the highest score on a lab task. It's about whether AI can be trusted in mission-critical decisions.
Read the full announcement →