AWS re:Invent 2025 - Fixing AI’s Confidently Wrong Problem in the Enterprise

Are you enjoying this session?

See exactly how PromptQL works for your business.

Book demo

This talk explores how “confident inaccuracy” stalls impact and how to fix it. We’ll share lessons from delivering high-accuracy, high-impact AI Analyst in recent 7-figure deals with Fortune 500s and top tech companies. You’ll learn practical ways to build organic, continuous learning loops that make AI a reliable and evolving business partner.

What's discussed in the video

All right. So I'm going to talk about this big challenge that we've seen over the last year that you've read about. I don't know how many of you saw the MIT, in the percent pilots failed report. Have people come across that report? That was like all the news a few months ago. And one of our kind of takes on the problem is that the big issue is that we're not really solving for is this problem of AI being confidently wrong, right? And so what do I mean by that? AI has been around for 2, 3 years now, right? We've all been using it. I don't think ever in the history of any new technology, it's gotten close to a billion users in such a short period of time. It's kind of insane. But if you think about that kind of science fiction reality of us as technologists going to our business people and saying, here's AI, go use AI and make decisions. ask questions on your data and say things like make decisions that will improve the business, right? Make better staffing decisions, make better pricing decisions, make decisions that will improve the business in a meaningful way, in a way that you couldn't have done before and that's what AI can give you. It's finally accessible. There's a big challenge there is that you're still not able to use AI connected to your data in a way that you can trust, right? So almost all of us would have tried some kind of AI on data project, right? You take an LLM, you connect it to a database, MCP it, whatever connectivity technology you use, and you can make it work. But when you are on the business side and you ask a question that is loaded, the impact of that question is pretty serious, you know for a fact that you're not going to trust what it says. And the reason is because you know that it's going to be hard for even a human to answer those questions, but at least the human would tell you that they don't know. And if an AI tells you something, it's just extremely confident, and you have no idea whether you trust that confidence or not. If you think about this ability that AI has of just being confident all the time and its impact on being confidently wrong, that is kind of what really hurts adoption as you start to use it. It goes from a fun project to, I can't actually trust this to do anything useful. Is this a better models problem, or is this something that we can do something about? Can we solve this problem ourselves? So I'm going to share with you our learnings as we've built an AI product on how we tackle this problem. And hopefully, some of those learnings are useful to you as you folks think about your projects. And that gives you some ideas on how you can tackle this problem. How many of you listen to Dwarkesh Patel's podcast? All right, cool. For those of you who don't, you guys should listen to Dwarkesh Patel's podcast. It's got a bunch of really fun stuff. I think the most recent one is by Andrish Karpathy and then by Ilya, who is the co-founder at OpenAI, talking about a bunch of topics around learning and continuous learning and utility of AI. And this is something that came up from his blog post a while ago. And I really, really like this quote, which is that, The reason why a human being is useful in your business is not because they have five hundred points of IQ because they win the IMO gold medal every time you talk to them, right? It's not their raw intelligence. The raw intelligence kind of was passed by LLMs a while ago. It's not the raw intelligence that makes them useful, it's somehow this ability that they have of picking up context on the fly, of looking at a failure, thinking about it, and fixing that failure. It's this ability of like, incrementally improving that you can trust the human to do, even though the human is not as quote-unquote smart as Deep-sea v. 3, which just won the IMO gold, but somehow the human is still more trustworthy, right? And so this ability to continuously absorb tribal knowledge and improve is critical, right? this is obvious. We all know that we want AI to continuously improve, learn tribal knowledge, and do whatever. The key point that I want to share is you can only teach something when it says it doesn't know. And that's kind of the critical thing that's missing in AI products. That confidently wrong hurts you. Even if AI got really good at learning continuously, the confidently wrong problem hurts you. Because if AI doesn't tell you when it's wrong, you can't teach it. And if you can't teach it, you can't use it. And that's kind of this core product loop that's missing with AI products, right? Let me talk about what we do at PromptQL and show you a demonstration of how we approach this problem. Broadly, we're yet another chat with your data product. It's fascinatingly amazing and new and differentiated with everybody else here. But the idea is to say that if users are interacting with data, they typically talk to multiple people to get some kind of insight. When you want to solve a problem, you want to make a decision, you're a business user, you talk to an analyst, you talk to a scientist, you talk to an engineer, they talk to each other, they talk to data, something happens. Some spaghetti Ness happens, and you get the result at the end of the day. Most often, when you ask a question, you'll probably get the result 2 weeks later, by which time the question is not very important. And the promise of AI is we'll have this cool AI thing that sits in the middle that users will use. Users will use that AI. The AI will do everything that the human does. The AI will do data engineering, and it'll do data analysis, and it'll do data science, and it'll compose all of it magically and give you an answer. And by the way, this might have sounded like rocket science like a year ago. But today, you can see very clearly that it's already possible. You can take Cloud Code and make it write a pipeline for you. And then you can make it write a SQL thing for you. And then you can make it write a data science program for you. It's already in the realm of reach. It's already possible. The thing that's missing though is that you can't actually expose it to end users in a way that they can trust it. If you look at all of these chat with data products, ultimately they all turn out to be text to SQL products where you need the technical person to look at the SQL to verify it. So you can never give it to a non-technical person Because a non-technical person can't read SQL. How do you trust it? You can't trust this AI system, right? And so our operating model is to get AI to admit that it doesn't know so that experts who are the red boxes can teach it. so that the AI gets better. But most importantly, because the AI can tell you that it doesn't know, even on day 0 when it's only 50 percent accurate, it's fine. Users can trust it. Because if I can ask you a question and you tell me that you don't know, that's great. I'll ask you 2 questions and you tell me half the times you don't know, that's totally fine. I can still use your product, right? So I'm going to demo kind of how we do this in product. This is the. This is the core loop that I'd like you folks to watch out for. The first part is to tell you what it knows and what it doesn't know. That's kind of where this cycle begins in the product. The second piece of the cycle that's critical is that once it tells you that it doesn't know, you can ask it to fix its stuff. And then once you do, it learns. These are kind of the 3 pieces that I'd like you to kind of anchor on. All right, so I'll do a recorded demo and then I'll do a live demo because live demos are fun and dangerous and who knows if they'll work. But let's say, for example, you ask a question on what was GM broken down by region yesterday, right? What our system does is first, it does a plan to solve the problem. But you notice that it highlights in blue links what it knows and where that knowledge comes from. Internally, if you, as a user, go click on that blue link, that is actually a wiki that powers the AI internally. This wiki contains concepts like gross margin that are pretty flexible. Let me roll that up for you. So the gross margin concept, for example, says, this is what gross margin is. This is how you calculate it. These are what our targets are for this quarter. These are what expected range of values could look like. And this is how you calculate it. And so that's kind of a wiki entry that backs this. And it's just like a wiki. There's multiple concepts that are connected to each other, blah, blah, blah, blah, blah, fun, fun, fun. And that's kind of the core piece of how it works. And so this is fine. We all have this today. You have some kind of knowledge base. You refer citation on the knowledge base. And you make it work. What happens, and what is interesting, is when it doesn't work. So you say, what was GM last FY? And here, the first thing that the AI does is tell you, well, FY, it's this new thing. It's a new thing that I think in this domain is a concept that is not known. And that's a red link that it will make an assumption about. And so it makes an assumption, tells the user what the assumption is. And then the user kind of goes on, sees what the plan is, looks at the answer, right? And now as a user, when I look at this answer and I look at the value, I'm like, this is fine, but I also don't really know what the fiscal year is. This is the most common thing you'll notice. Your business users will use the product. And they'll be like, thank you for telling me that you don't know something. But true facts, even I don't know. I know we have a fiscal year. I don't actually know what the exact period is, because we're a global MNC. And is it different in different regions? I have no idea. And so this is critical, where you now bring in another person into this conversation, an expert or a group of experts, to say, can you clarify what fiscal year means? This person kind of joins our conversation. This is an actual human being. The human is an expert, joins the conversation, and says, well, PromptQL, our fiscal year is whatever, whatever, whatever. And PromptQL is like, cool. That fills my gap. That fills the assumption that I had. And it redo’s the plan and gives you an answer. This was really important for 2 reasons. 1, the user is actually happy. The user is happy that they got an answer that was right. because they did something, it was not accurate, I brought in an expert and I got the right answer. This is what I would have done even without AI. It's not surprising, it's natural to me. But what is really nice is that this entire conversation has so much tribal knowledge that just got captured. This conversation is the seed of learning. So now there's another agent loop that goes and helps you create a Wiki entry. and updates the wiki based on what just happened in this conversation. And because it's a wiki, it's literally a wiki, like a Wikipedia, AI and humans can collaboratively maintain that wiki, and that wiki can kind of keep getting better over time so that you can use it, right? And then in the future, let's say, for example, you ask a question and you say, whatever wiki, et cetera, goes live, what was the GM last fiscal quarter, right? That's kind of the key capability where you also can reference these advanced kind of concepts in the Wiki, despite what users are saying, right? And this is kind of how you surface it up. I'll take another example. to kind of show you how we think about what goes in the wiki, right? So this is another example of a finance kind of data set, right? Where I can say true north revenue for Q 3, also forecast Q 4. And you can literally imagine what it's doing in the background is it's kind of going here. It's going to True North Revenue, the Wiki article. It's reading that. And the Wiki article is literally just a Wiki. So it says, go to this table. This thing is in Google Drive. It's a PDF. You'll have to extract from a PDF. Subtract adjustment. What is an adjustment? Go read up about an adjustment. Learn what an adjustment is. Read up about all of these things. And that's kind of what it's doing for this first step of the plan. In the next step of it, what it will do is it will try to read what forecast is. And forecast is a wiki entry that says, do this data science thing. These are the outliers that you need to think about, whatever. And these are kind of progressively maintained You can imagine that these are sparse, and they get more and more structured. They get better and better as the knowledge accumulates. But initially, it's just basic revenue forecasts use this model. Then later, somebody used it and said, nope, Black Friday, bad outlier, please factor that in. Maybe a data science expert comes in and teaches that to you. But that kind of powers the ability to create an accurate plan. There's kind of another really important piece here because the kind of uncertainty and lack of accuracy is not just semantics and business concepts, but it's also in the details of fairly specific technical problems right and so what we do is we also surface the fact that the forecasting methodology assumes that something should be consistent right it should assume that there's no refunds it is going to assume that there's no refunds now whether or not that is important to you is something that you can discuss, think about, bring in an expert. And they can say, yeah, you know what? This assumption was good, or this assumption was bad. Let's not do this again. And so you teach it, fix it, and again, the learning loop improves. The way we think about our product is that for non-technical users, we surface the wiki. And for technical experts, we surface it in what we call the confidence analysis of the confidence score. users are expected to think about the wiki, and experts are expected to look at the confidence score. And that creates the loop. So that's kind of a quick look at what this looks like. What we've kind of observed, and this is kind of our core pitch, is whatever AI you're using today is doomed for failure, it will fail. And the reason why it will fail is because, A, you've probably not managed to get a hundred percent of the tribal knowledge into your AI anyway, because how can you? It's not possible if you're business of any respectable size. It's not even possible to try to embed that tribal knowledge. And so you're doomed for failure because not only do you have do you not have knowledge, but you also don't have a system where people can trust what your AI is doing. And what we want to get you on the path for is to say, it will always be inaccurate. Can we tell you that it's inaccurate? And then can we learn? And so put you on this path of actual adoption. And that's what's allowed us to scale massively. And it's allowed us to focus on solving interesting technical problems. For example, the fact that our AI system is not restricted by the size of the schema. If you look at things like Databricks, Genie, or Snowflake Cortex, they have, you know, 20, 25 tables AI on data. Who has 25 tables, dude? Like, what enterprise has 25 tables that you work on? That's not even real, right? And of course, it's a problem because you can't squeeze that much in context. So we've kind of focused on solving problems, saying, hundred thousand tables, ten thousand metrics, How do you solve for that? But you can only start to solve for that once you fix the product loop of saying, whatever the size of your schemas or whatever the size of the context is, unless you have that improvement wrong loop, you can't even get to those interesting problems. But that's allowed us to scale, since we started early this year, across a bunch of really interesting use cases where there's a lot of high velocity decision making that's happening. Come stop by a booth at 1733 if you want to learn more. We've got a bunch of tech folks. We've got a bunch of sales folks. If you want to buy this and use this, I will never say no. But also if you just want to stop by, exchange notes, and learn, please pop by and we can chat more. Thank you for your time, folks.