The AI Transformation Blueprint with Trustworthy AI on Real Data

Are you enjoying this session?

See exactly how PromptQL works for your business.

Book demo

Would you trust an AI system tied to your enterprise data with mission-critical decisions? In the opening keynote, Tanmai Gopal and Rajoshi Ghosh explore what it takes for an AI copilot or agent to earn that trust.

Drawing on breakthroughs from the Hasura AI Lab, they share best practices for quantifying AI reliability in real-world scenarios — enabling dependable performance without waiting for perfect data readiness. They also showcase transformative AI use cases, from unlocking new revenue streams by monetizing data to empowering employees by democratizing data insights.

What's discussed in the video

How many of you have had these big dreams about what you can disrupt about the status quo at work with AI? Do you trust your AI to take actions for you, to make decisions for you? Are you able to take the next step based on what it's told you to do in mission critical projects at work with enterprise data? If AI is reliable in little pieces, On your enterprise, what happens when it's reliable on a big piece? The 4 takeaways, reliability is pretty important in case it escaped you. The second is, for us, decoupling was a very core kind of insight and capability that without degrading performance, and in fact, increasing performance, allowed us to control predictability and explain ability to a level that was sufficient for mission critical tasks, which is awesome. I believe that these kinds of techniques we start to see more often in the industry. We've already started seeing a few of them. It will unlock kind of the next generation of automation and decision making. Analytics will transform into decision making and software engineering and IT will transform into automation, which is kind of what it already does. just levelled up massively over the next few years. And the last is to help people find their dancing partners. When we speak to tech people, we're like, who's the business person? We speak to business people, we're like, who's the tech person? Because if you don't have that group, nothing's goanna happen, and we're goanna hear a little bit more today. I'd like to introduce our keynote speakers today, Tanmay and Rajoshi. first up is Rajoshi, who'll be running through reliability and why it's not just a nice to have, but a fundamental component in building enterprise solutions with AI. Welcome, everybody, to AI Disrupt, and we'll get right in. This is going to be a pretty packed 45 minutes, so I'm going to start off with what you can expect. These are the 4 takeaways. Reliability is the most important thing that you're going to hear about in 25. We're going to go into what we have learned over the last 18 months in building a reliable general data agent on enterprise data. So you're going to hear from our team about what we've been doing. And then we'll dive into what are the high impact things that you can do with reliable AI. And finally, how should we unlock these ambitious AI projects? We're at AI disrupt, so let's get right into it. So let's start off with a show of hands. How many of you have had these big dreams about what you can disrupt about the status quo at work with AI? Amazing, and you're at AI disrupt, so I guess that goes to say that you're really thinking about it. Now, raise your hands again if you have something deployed today that is close to what you think you can do with AI. Hey, nice. Some hands, all right. OK, good, good. But not too many, right? And you're not alone. This is pretty common when we speak to people. And we've spoken to a lot of different business and technical leaders over the last 8 months asking them similar questions. And the reason why they haven't been able to deploy things and get to what they can do with AI, always ends up boiling up to kind of the very same thing. And you can guess it, it's reliability. So we're going to talk a lot about reliability over the next few hours. But in very, very simple terms, what do I mean by reliability? I guess trust. Do you trust? your AI to take actions for you, to make decisions for you? Are you able to take the next step based on what it's told you to do in mission-critical projects at work with enterprise data? And when we've spoken to people, So we've kind of seen these 2 broad sort of camps where people have kind of embarked on these projects, right? Either they have built these projects out, but there's a gatekeeper. There's this VP of AI or chief AI officer who's kind of the gatekeeper who's not letting these projects be shipped because they're unreliable and they don't know what is going to happen when it gets into the hands of everybody. Or people have shipped it. But then adoption rates are really, really bad. That's because you put it out, people start using it, you get a few wrong answers, and it's a terrible experience. It's like when back in the day, you would click a button, and it would keep loading, and websites didn't come up, and that was the end of that website. So that's kind of the experience today. So these are not really getting rolled out in production. There's this podcast that you should, it's a little bit old, it's from 20 23, where one of the co-founders of OpenAI, who is now the founder of Safe Intelligence, Ilya was asked, you know, if AI was not to achieve the economic impact that we think it can have by the year 20 30, what do you think that reason would be? And he kind of caveats it by saying, I don't think that's going to be the case. But if it were to be the case, There's only one answer for that, and that's reliability. If you cannot trust your AI, that is the only reason the economic impact of AI is not going to be realized. what I've been really excited about this year is that the conversations are shifting. The first few years was all about these bad PR articles about companies who are actually, frankly, sort of the early adopters sort of adopting AI, but having really bad experiences and embarrassing things, you know, which show up on newspapers with reliability kind of taking their brand value down to reliability becoming something that we're addressing as a community. From Hacker News having a pretty long, interesting thread about reliability to we have a conference in this very city in a couple of months, the AI Engineer World Summit having an entire track dedicated to reliability. So if you're working on stuff, you should submit a CFP to that. And also this McKinsey does this report every year and though inaccuracy has kind of always been the number one problem that they've highlighted, what was interesting to see in 2020 five's report is that's also the problem that people are actively working towards. So this is all to say reliability is. the biggest topic and sort of that last thing to crack before we can kind of really look at the impact of AI. And with that, I invite Tanmay on stage to talk about what is the state of AI reliability? Thank you, Rao Sri. All right, cool. So reliability is important. No shit. And so let's kind of take a quick look at what's happening, what the state of reliability is, right? So what are the AI peoples? doing to make AI more reliable, right? They're just throwing more GPUs at it. What's going on? Like, how are these systems getting better, right? And I think one of the latest pieces of work that is a nice summary into what's happening is a real nice paper, a collection of papers, and a nice summary by the anthropic folks talking about tracing the thoughts of a large language model, right? So this is a sneak peek into kind of what's happening in LLM land when it kind of comes to reliability, right? So I'll take a few snippets from that conversation The first is that it confidently hallucinates. This is something that we've all experienced. It can do fake reasoning, but then not do what it kind of promised it would do. But it can sound very convincing that it has indeed done it, right? And then occasionally, when it doesn't have an answer and kind of there's a missing value that ought to have been calculated, it'll just make stuff up. I think the technical term that they're calling it is bullshitting. So that's another thing that's happening. But the good news is that they've been coming up with techniques to understand really whether the system is hallucinating or not, or whether it's making up something, it's filling in a gap that was not supposed to be filled, that was not expected to be filled at all, and with a bunch of interpretability techniques. So there's good progress there. That, as users, as consumers of what the AI labs are putting out, our responsibility as people who are building and deploying AI solutions is to kind of make that reliable on our enterprise data and our systems, right? And so these are kind of the typical questions that you're grappling with. How do I make AI reliable on data it's not been trained on? How do we deal with inherent non-determinism, right? It's a little bit of like a mental gymnastic thing that's required, which is like we're used to, I click on a button and something happens, and then I'm kind of moving to a form factor where That thing that I've taken for granted about machines from industrial revolution times where I think happens is no longer happening What does it mean to build a product like that or to deploy a solution like that? How do we deal with the real-life messiness that just exists not just inside our organization, but now? Manifested in our data and our systems and then I think most importantly how do we want to nudge into? user behaviour, our stakeholders, internal and external, to actually wield AI well. So these are some of these problems that we're all thinking about. And for me, as we've been building in this space and talking to people and helping them build systems, there are 2 bare minimum necessary things that I think about that must be addressed when it comes to making AI reliable. The first is predictability, which is essentially saying, is it doing? And specifically, we're talking about AI on enterprise data deployed for mission-critical scenarios. So that's kind of what we're talking about. We're not talking about the purely generative use cases of help me write an email, generate a Studio Ghibli image, summarize something. It's more I'm going to make a mission-critical decision or automate something, and it's going to have serious impact. So the first is predictability, which is I'd like you to do the same thing every time. And I'd like to be be able to predict whether it's going to fail or not. I just don't want unexpected failures. That's one really important piece. The second is explain ability, but not in the way that we've thought about from an audit compliance point of view. Oh, AI should be able to explain itself. Otherwise, like, what do we say? You know, something happened. How are we going to handle it? Explain ability from a point of view of control. If you can't explain, if you can't understand what the AI system is doing, you won't be able to, or you can't change what the AI system is doing. You can't exert control. And if you can't exert control on your AI system, it's really hard to think about deploying that intelligence for something that's mission critical. So these are the 2 big must-haves. So whether you build, you buy, you co-build, whatever you do, I think those are the 2 most important things that need to exist that you need to have a clear handle on. So when we kind of integrate AI with our data and systems, let's take a quick look at existing techniques and see how they kind of stack up on predictability and explain ability, right? So I'm sure you're all familiar with RAG and a bunch of techniques. that have evolved around RAG over the last few years as well. Broadly, it boils down to 3 really simple things. The idea that we create a store of something that is searchable. We search for relevant data as questions or tasks come in. We add that to our context on the fly, and then we generate the desired result from that now augmented context. And each of these steps can become increasingly sophisticated and increasingly complicated, but at its core, the idea is pretty simple, right? Now, what this kind of results in, and I'm going to take a few examples that make the point in a slightly exaggerated way, so it's a little These are not necessarily the causes, but this is a typical experience that you often have with RAG deployed on an enterprise data kind of situation, right? So email, this is an email AI solution. When was my last trip with Uber? How much did I spend? Pops up the last email, gives me the right answer. That's great. I'm happy. Things are working. I move to the next question, which is when was this trip? And the answer suddenly turns into June 29th of 2024, right? And I don't know what to do. Like, I'm confused, right? And so then I'm like, how did you get this answer? And then it says each product has its own strengths. Deciding which one is best for you depends on what you're trying to do. With Gemini, you have access to Google's most capable AI models. So, no, look, this is it's not like the cause is not I'm not saying that the cause is right but I'm saying this is similar to kind of what you can expect where something works and then something doesn't work and it's very hard to form an intuition behind why it's not working if you're not the person who's built it, right, if you're not the person who understands how these things work. So I would rate that predictability is low. Even though it's kind of always running through the same pipeline to process and answer, the explain ability is low as well because I don't know what to change to make it improve itself, right? point that you would have heard with RAG, and I'm taking this from the words of a customer that we talked to, which is guaranteeing breadth and depth of tasks seems like an impossible thing to do, right? Whatever I do, whether I prepare a knowledge graph, the knowledge graph is brittle, every time my data or use case changes, I have to refresh the knowledge graph, or whether I do and I build RAG stuff, it just seems like to get to, like, the breadth of data that we need to capture or the depth in which we need to be able to get the right pieces of data. It seems like it's nearly impossible to get it to work, right? Cool. Let's talk about technique number 2, tool composition. And especially, for example, if you've been keeping up with the MCP hype, MCP is a tool. And ultimately, when you have, let's say, a lot of MCP systems or a lot of different tools that can come together and do something, and you can compose those tools to get work done for you. The core ideas, again, So more sophisticated, but at its core boils down to the LLM calls a tool that does something, takes the output, decides if it needs to call another tool, and keeps doing that, collecting a bunch of data, and generating the result that you want, right? So this is a popular tool. AI on sales data that we ran on our systems. On the first attempt, it's complex. When I ask a question of what my average sales cycle length is, it tells me what to do, asks me for some more data, and then kind of just stops. There's no more interaction that I can have. In the second attempt, exact same question. It says I actually can't do this work. In the third attempt, I get an answer which is 71 days. I ask for how it was calculated. I get an answer. And then I say, well, I don't actually have 4 stages in my pipeline. I have 7 stages in my pipeline that you need to use to calculate the result. And then it kind of freezes again. On the 4th attempt, again, the exact same question that we're trying. It doesn't have an answer. In the same prompt, I'm like, try again. You can do this. Let's go. 2 point to one days. So the same question 4 different times, just reloading the page and running it, wildly different experiences. And this, again, and again, I'm kind of exaggerating the causation correlation here, but this is a typical experience that you could expect when you have complex tools. When you have a large number of tools, it becomes really hard to predict that it's going to run in the same way every single time because different tools can be called dynamically as they're running, right? And when this happens, explain ability drops. Since you don't understand what it's done or why it's done it, it becomes kind of hard to change. what it's doing or exert control on that AI system again. The number one complaint that you will hear with tool composition is that it works when it's simple, but as soon as I have a complex tool, or if I have a large number of tools, I'm not feeling comfortable about deploying this solution. Finally, let's talk about Text-to-SQL. I'm sure all of you have a Text-to-SQL project running internally inside your organizations at this point of time. And so this should be relatable. Text to SQL is a very simple idea. Take the user's intent. Let's convert it into a database query. Let's run that database query. Boom. Answer. Let's go. Typical example from a famous co-pilot. Hey, how many albums have from this metal genre on a music database have a positive and happy-sounding title? Give me some examples. You cannot ask this question. because actually it will only do the SQL bits and then tell you to go do the AI bits yourself. But it does generate the SQL for me. And that's going to keep getting better, which is nice. But it can't do the AI bits for me unless I wire it up to another step after that. And so if I look at predictability and explain ability for text to SQL-like system, Very predictable, very explainable, pretty high, right? Because it's going to generate SQL. I can look at the SQL and I'll be like, yes, you are right or you are wrong. The caveat, though, is that it only works by narrowing that kind of experience a fair bit, right? It doesn't do the arbitrary generative things. It doesn't integrate with diverse data and systems like tool calling, right? Like, suppose I want to make it do something else or fetch data from somewhere else or do a custom piece of business logic. doesn't fit into that model. And finally, this explain ability only works for analysts, right? Because if you don't know the system, you can't really do much with the SQL that's generated, right? And so the number one complaint that you'll have with text-to-SQL systems is that only your analysts can use it. and it's limited to what's in the database. So your text to SQL system has not actually made it out into the rest of your organization. It's largely still inside the analyst org. It's still more of a co-pilot than it is a we have democratized data for everybody and we're all making the best decisions and it's amazing, right? So that's kind of the top complaint that we're hearing from people who've been building text to SQL. It's pretty much the first thing you built when you had AI and you had data and you were like, what should I do? Let's do it. But hard to kind of roll out again outside that. So you can also mix and match all of these techniques, which is actually what people are doing in real life. Nobody does just one of these. Everybody mixes and matches them. And accuracy does improve, but at a cost of heavily increasing complexity, right? So you typically start to gravitate towards the same pitfalls that you have from tool composition, that broadly it kind of like falls in that bucket of pitfalls, but heavy increase in complexity. So that's the state. What can you do to make AI work reliably in your enterprise? We're still kind of back to the beginning, which is that if you're tackling a breadth of use cases and we actually want the power of AI to happen, what should we do? So I'm going to share some learnings on what we tried and what worked well for us. So the first bit is the thing that I like most, more than my morning coffee, is who can I get up and hold accountable and fire today? And so kind of using that principle, the first thing that we want to do is let's create a thing that is responsible for reliability. Like, just have a thing that is responsible for whatever I mean by reliability and have it be responsible for that. And so let me build AI and co-pilots and automation agents, but let's do it on top of a thing. We'll call it a data agent because we can come up with as many new words as we want. It's a sweet spot right now between the last year to the next 2, 3 years, whatever you want goes. It kind of doesn't really matter. So anyway, data agent, right? So this data agent is going to be the thing that's going to be reliable. Now, the third thing that I like after morning coffee and a neck to choke is Zen sayings. And there's going to be a Zen saying, I think, 30 years from now, which is that the only thing we control with GenAI models is what it generates. you think about sphere of control, what do you really control with GenAI? You just control what it generates. That's what you control. You can't control the AI model itself. It's hard. But when you're using GenAI, what you can control and what people are optimizing that control for is what it's generating. So when we kind of started with those axioms, One of the core ideas that we resulted in was, if we change our current approach that is basically saying, let's take some input, put it into an AI system that generates a result, what if we flip this and say, let's decouple planning and execution? What if the input generates a plan, not the result? And the plan is entirely programmatically executed. We just completely separate the 2. The GenAI generates the plan. I mean, it will kind of generate the result. But it generates the plan. The plan is then just run to generate the result. And this idea, and to do this to represent the plan, what is the language that represents the plan, et cetera, what is the language that LLM speak, what is the language that humans speak, et cetera, et cetera, et cetera. Let's call it a DSL, a domain-specific language. And PromptQL is what we called it. And so PromptQL is this language that represents this ability to decouple planning and execution. And so we can create and generate the plan. and then run the plan to get the output. So we don't generate the output, right? Let's take a look at and see what that looks like, right? So back to my email example. I'm like, hey, get me work expense. Get me all of the taxi expenses like Uber, Lyft, et cetera for March. I get 4 results. I'm not picking on any particular product here. I've tried out like 4 email clients. It's the exact same experience across all of them. You get 4. I did, like, 50 trips in March. It's definitely not 4. I get those four, and I'm like, okay, where are these rides? What cities? Like, help me figure out which ones I need to submit my expense report for or whatever, right? And I get a bunch of random answers that are not kind of connected, right? So this is the current experience. Now, if we change that into a plan-based experience, let me just zoom that up. And see what that looks like. So let me see if I can hit replay here. And we'll see how that works. Cool. So again, same thing. PromptQL data agent, some UI thing on top, that connects to our email system. And so now we're like, help me find all of these travel expenses in March so that I can submit my expense report, right? And it does a bunch of things. It says I'm doing a bunch of things. It runs into an error, and it fixes itself. And that's fine. I'm going to let it run and see. It found some rental confirmation emails. I'm like, OK, that's cool. You found the rental confirmation emails. And then it gives me an answer, so let's take a look at and see what this looks like, right? And it says, these are kind of what my travel related expenses look like in March, which is enterprise, enterprise, and I'm like, what? What happened to all the Uber trips? I'm not going to, this is like from a ski trip, so I can't work expense that, at least right now. But how are we going to fix this? How do I understand what's happened? And what I can do now is at each step of the plan, because each step of the plan runs deterministically, I can actually take a look at what happened. And so I can see what did you even fetch? Did you not see any of the, like, what did you do? Why are you screwing this up, right? And so I can see, okay, it's queried. It's used the data source of Gmail messages. It's kind of used these filters. And now I can kind of start to see what's happening, right? It's like, okay, they've kind of used these very specific filters. for some reason. And that's kind of what's causing this. If I take a quick look at the different emails, et cetera, that I have, I can even start to look at this data and start to see where what emails are missing, right? If I then kind of go ahead and say, hey, have any taxi stuff like Uber or Lyft, runs that again, and this time now I can see again what kind of plan was run and whether it was able to capture that or not, right? And so now I can see, okay, cool, it's doing the right thing. Now that I understand what it's doing, I can also start to change it. There's a big difference in that experience because if I get a guarantee that if each step that it comes up with is going to be executed and I can take that execution for granted, then I, as a stakeholder, can focus on what it's doing and improve what it's doing and change that behaviour, right? To a degree that now, and so of course, fine, I finally end up getting the answer, et cetera, stuff like that, which is okay. But I can also start to ask meta questions that I can expect answers to. Like, why did this happen? Why did the original taxi stuff not come up, right? And I can expect to get a real answer. right? A real answer that's useful for me. So that I know the next time that I use this AI system, I'm going to maybe be more specific and say, like, I need to submit travel expense report for these kinds of things, and that's what you should search for, right? And so this is a big change in the way that I would use these systems to be able to kind of get reliable answers, right? When it increases kind of the level of predictability and explain ability that I have in the system, which is like if it fails, it fails in explainable, expected, changeable ways, right? So on a predictability-explain ability rating, I would kind of rank this approach as pretty high on both fronts, especially if end users can understand what it's doing without worrying about what the technical details were on how that actually got implemented, right? I don't actually care whether it's text to SQL or vector search. or API calls or whatever underneath. I can see the data sources. I can see the filters. I can look at the input and output data at each step. I can build a feeling for what the system is doing to be able to trust it and to kind of give other tasks to it, right? With what we've with PromptQL in this language that describes a plan, this language can describe all kinds of things, can get data tasks like search retrieval, filter retrieval, stuff like that, computational tasks, transformations, joins, generative tasks, like extract and classify, that's what was happening in our email to extract the amount from each email and present that and convert that into a table form, and it can also It can compose higher-order operations. So that's technical detail on kind of how we did that. But one of the big questions is if we decouple planning from execution, does that deteriorate state-of-the-art performance on existing techniques, right? And as we kind of benchmark this just out of the box, doing nothing, just connecting that to the data agent, right, and saying have at it with frames, which are pretty hard, good benchmarks on tool calling and on SQL. we're in almost all cases better than what state of the art does anyway, right? With Gaia, we just did text only. Multimodal is something that we were working on. So we ran this with, I think, 3.5 Sonnet, 3.7 Sonnet. It kind of doesn't matter what the underlying LLM that the data agent uses is. It ends up staying close to where state of the art is, which is really nice to see. it's not enough, right? Because that level of accuracy, if you look at these levels of accuracy, is not enough for you to actually deploy in your organization, right? And so the real kind of killer capability of this decoupling approach is that it allows us to exert a very precise amount of control on the AI system, right? It allows you on your data for your stakeholders to utterly smash state-of-the-art performance on what you have. And we're talking about accuracy levels here that are far in excess of like, 95 percent easy, right? And that's what you're able to do for your data and your systems and your stakeholders, right? And so let's take a quick look at what that means, right? So here's a question from the frames dataset, which is a hard dataset on RAD. that state of the art models don't perform well on. And this question is, how many employees, the data set is Wikipedia articles only. So you're supposed to look at Wikipedia articles only. And the question is, how many employees does the company that is alphabetically 10th by ticker symbol from the top, so you have A, Agilent, and Apple, and Adobe, and a bunch of companies, the 10th from the top, and the 10th from the bottom, What's the delta and the number of employees that they have? This is an arbitrary question. Nobody asks this question. This is not something you care about, hopefully. But this kind of just represents the nature of the breadth and depth of tasks that we were talking about on unstructured data. And so when I run this, what happens? right? And so I have a shared thread here from GPD for O. Oh, sorry. This one is 01. I didn't even ask it to do rag, right? I'm like, I literally just took the article. I pasted the entire list of S&P five hundred into 01 pro, and I said, Sam, take my money, take two hundred dollars and tell me, tell me not even the employee count difference. Just tell me the just tell me the 2 companies. Just get me the 2 tickers. Get me the 10th from the top and 10th from the bottom. And it reasoned for 12 minutes. It contributed in a minor way to global warming. And gave me the wrong answer. Right? It really tried. Like the reasoning is, you know, it's pretty intense. Like it's going through it. It's going through every single thing. But you know why? It does work if I do first from the top and last. It works if I do second from the top and second from the last. And it fails unexpectedly when I cross this random threshold that I don't know. So now that it's wrong, I, again, I don't know what to do. I don't even know what system to build. What's the point of building RAG or tool calling or whatever if one of the tools is going to return a bunch of data? I'm going to need some processing on top of it, and I don't know what to do. I'm feeling kind of that lack of control again with this. So the next time I pointed out, I said, look, in your list of 10, you missed something, and then it fixed it. But that's this particular use case, right? Try the same thing on 40. And with 40, because 40 can run code, I said, let me zoom this up a little bit. Same thing, use code to solve the problems required. Don't sample data in code. Write it exactly, right? And so this time, it got the first ten right. It got the second one wrong. It did run code. But while running code, it wasn't able to capture the context that it previously had into code. So it just messed up putting the tickers in code to do the sort and analysis, right? And this again kind of puts me in a position of I don't know what to do. It has failed. I have suggested the fix. This ought to have been the fix. And we don't know how to fix it. Now, this is what the PromptQL experience looks like, which is the same question. But with that same question, we augment it with a particular strategy that is good for this data set. This data set is unstructured data. There's no knowledge graph. There's nothing. So the augmentation is if you're extracting facts from articles and it's a small set of facts, use AI. If the facts or the data that you expect is literally present in the article, try to do it programmatically instead so that you can get all of the data out. This hint is a scalable hint. This hint I would make a part of our semantic layer. This hint I would make a part of, say, a system instruction or whatever. And say that this is a property of my data set. I'm not looking at a SQL database. I'm not doing an agent action here. I'm just looking at a bunch of unstructured nodes. And so here's a hint about how to think about this data. And then when it goes through that, it works extremely accurately, right? It goes through, it runs the plan, it finds the list of S&P 500, it finds the list article, it then reads and samples the list article, it creates a parser, it then loads that entire article directly into the program, it does the parsing, it sorts it alphabetically, it then does the follow up next look ups of how to extract the employee information. That extraction is powered by AI, because you want to extract the employee account from a giant Wikipedia article. You just need the employee account. Then it does the subtraction, gives me a number. And then I can obviously ask questions, like give me the 10th from the top, 10th from the bottom, whatever I want to do. And I can expect exact, accurate answers, which is a nice property, because now whatever is a property of my system, I can start to kind of embed that inside, right? The observer will have a question and say, hey, Tanmay, all this is cute, but all of this knowledge and nudging that we want to do on our data is inside our heads. How are we going to How are we going to capture the tribal knowledge of, like, ten thousand people, right, or whatever, whatever the scale of our deployment is, and how are we going to put it in the system, right? Like, how does that happen? That's the problem. If I had an infinite amount of time to prepare the semantic layer and to do all of this work, then it would have been we would have solved this problem, right? We don't even have that knowledge today. We'll discover that knowledge when people start using the system. When they use the system and they have particular patterns of asking questions or doing analysis or running certain actions, that's when we'll know what a dominant pattern is, right? And so how do we fix that? with AI is the answer. And so the answer to that is instead, and this is kind of the second piece of the DSL. So first of the DSL that decouples planning from execution. But then to drive planning accuracy, if the semantic layer can start to become something that improves by itself, as the system is used, that kind of helps us complete a part of this puzzle where I don't now have to worry about stressing about, is it perfect? Is it accurate? Is it capturing all the knowledge that everybody has? If I can stop worrying about it and the system can improve itself, that'd be a great world to live in. Our data teams can all take that giant offsite to Hawaii, and the system can kind of run itself, right? And so let's take a quick look at what that looks like. So here's an example of a question on a deliberately bad data set, which employees are working in departments with more than a ten thousand dollar budget. The data set is really bad. The tables are called mock, flu, and zorp. Like, there is no semantic layer. There is no information that the entities are all weirdly named. But the experience that I kind of undergo with PromptQL is I get very specific. I'm like, no, you know what? Sample the data. Look at the entities that you have. Tell me what looks like the right entity to query, right? And then it figures out that XORP and Plug are the right tables to query. And then it queries those tables, and then it does the stuff. So this is me. telling it what to do, because I can do it. Now, as soon as I do that, right, and I got to the answer that I wanted, the second piece, sorry, the second piece here after I get to the answer that I wanted, and I'll just pause this, right, I get to the answer that I want by deliberately, by getting into it, by being an analyst, by being the person who understands the data layer. I got into it. The next thing that we do is then we ask the system right? And then the budget is in cents, it should be in dollars. It's messy. It's a tiny, small microcosm of messiness. Whatever you have is weird in different ways. And so what I can now do, and this is on the agent side, and I can say, hey, based on these recent conversations, improve the semantic layer. And this is me doing it manually, but this can also be done autonomously, which now becomes an experience of saying, Let's go through all of these conversations that happened where the people who had tribal knowledge did the right things with our data. Let's extract that information. Let's improve our semantic layer. Like, it's called zorp and plume. And let's update that semantic layer. Let's add things like these columns are in cents and not in dollars. So use this procedure to change it. Let's apply that change. And every time you apply a change, you get a unique instance of the semantic layer. And now when you ask that same question on the semantic layer again, it just works in a single shot. It knows exactly what models to go to. It knows exactly what procedure to run. And so this is an example of how the semantic layer itself can improve and gradually absorb learnings and context that you have from inside. And this, again, works because when you decouple the planning step and the execution step. And by the way, if you're doing text to SQL, that's already happening if decoupled planning and execution. You're planning in SQL and you're executing SQL. And that's why a lot of text to SQL folks talk about semantic layer. It's kind of that same idea, but now bring it to general AI, where you're saying, oh, OK, it's prompt queue that can do any AI, not just SQL. And here's the semantic layer that drives planning for that DSL that we have. So that's kind of the second piece. This model, in fact, ends up being how we started engaging with customers since we launched Alpha in January, February, where we kind of go connect to customer data and systems. And on that particular eval set, we're able to hit greater than a 95 percent accuracy within a very short period of time. And we can actually start to offer an accuracy SLA. So because of our agent, We kind of like having this experience with our customers where you fire us for lack of reliability, and then we'll have a tough conversation with our data agent about customer churn because of lack of reliability, right? So that's the chain of trust solves Satya Nadella's problem on liability. But this gives us the capability to kind of start doing that because we know that we have precise control over the system to be able to move reliability in the direction and to get it to a baseline that we need to have. precise control over the system to be able to move reliability in the direction and to get it to a baseline that we need to have. So the quick summary is decouple, decouple, decouple, which is if we decouple planning from execution, good things happen because we can independently drive planning accuracy and deterministic execution, and we can separate those 2 bits out ourselves. Our learning was by doing this with a DSL that can describe any AI task and then an agentic semantic layer for that DSL. And this kind of can help you power co-pilots or agents or applications or automation sparkle buttons that will actually work when you ship them. If you want to sound cooler when you summarize, this would be my cocktail conversation, and I'm like the generative output of the reliable enterprise AI is not the result, because you can't control the result. It's the plan, because you can understand and control the plan. So as we kind of start to morph the way that we think to focus on the plan, then we can start to take the result for granted. And that becomes almost philosophically the way that you talk to a human being as well, right? Like, how do you trust? How do you have reliable intelligence in your organization, right? You have to trust the result that they give you. You can't look at the result and hold them accountable for the result. It's not going to be a productive engagement with your colleague. You're going to collaborate with your colleague on how they think and they plan and set them up for success in how they plan. So we're deploying intelligence here. We're not kind of deploying machines anymore. So in a way, it kind of makes sense. But now that we have all of this reliable AI kind of stuff, what are the kind of possibilities that we can start to unlock? So 2 broad use cases that I find very exciting. The first is being able to unlock on-demand automation, which basically means if I can do anything and I can do anything reliably, let me just automate the crap out of anything that I see, especially the high-impact stuff. So again, kind of taking that quick example of what we can do here is an expense report. The expense report is an example that I'm taking because it's relatable to all of us. But in practice, I'll take a few more examples of what are the kind of automations that are really kind of more high impact, right? But here, I just need to submit that monthly expense report every single time. And it's a pain to put all of that information together. And so now I can start to interact with the system. to create a prompt that says help me create a work expense. I can be very specific of get me the transaction amount, get me the expense. You classify it yourself, look at the email, and tell me if you think it might be a personal spend or not. And then it goes and looks at the email. I can iterate on this a little bit. It gives me a sense of what the expense type is, what the amount is, whether it's likely business or not. I'm like, you know what? Actually, this format of data is super unhelpful for me to make a decision on whether these are work expenses or not. And so then I kind of go in and make a bigger change and say, oh, sorry. Let me load up the right thread. Yeah. And the first format, if I don't like the first format, I can go and change the format and say also include the vendor name and add a one line summary so that I can actually see why you classified it as a business or a personal expense. I get to that point gradually. If there's a link to the invoice, add a link. I get this kind of summary. I like it. And then once I have a plan that works, each plan that works can get then converted into an API call. And that API call does not execute the LLM to create the plan, because the plan has been created and it's frozen, it literally just runs what we call the PromptQL program. So now that the program that has been created, which goes over the last calendar month, does a whole bunch of things, does some AI work, including classification and extraction. It does a bunch of structured work. It does some computer dust totalling. I can give it my expense policy because the expense policy keeps changing. I can give it all of that. I can create this automation. And now once this automation exists, this is now just an API call that I need to keep spamming every time I want something done. And I have the confidence myself that this will work because I can verify each step in the middle and see how it's going, right? This is a personal kind of relatable experience, but what we've kind of partnered on are a few initiatives that I'll give you a quick glimpse of. The first is customer onboarding, for example, which is a common use case that we've seen is you're onboarding a customer with Fintech. You're onboarding a customer, it takes months to onboard the customer because this is in the Fintech data space. The data in the schema that the customer has needs to be mapped to the data in the schema that's inside the organization. There's a bunch of compliance stuff. It's a pain to do that. And that process takes about 3 months. That creates kind of an automation prompt, triangles 2 different data systems. It uses a combination of AI programmatic techniques to figure out what the mapping is and then apply that. And now that thing can be reused again and again for that customer, because that automation works really well for that customer. won't work for other customers, but it's high impact enough to keep driving it, right? This results in a straight up 3 X increase in capacity, right? So about forecast accuracy improvement, this is a problem that we face, that all of you face, which is sales reps, I'm looking at some of you here, hate updating Salesforce. after your calls are done. And when you don't do that, and I don't know what's in stage 2 or stage 3, then for the life of me, I have no idea what we're going to make this quarter. Because you didn't update stage 2, stage 3. You didn't tell me if you had a technical decision maker or economic buyer identified or not. Now, I can't do anything, right? Turns out, if you upload your playbook connected to Salesforce and say, this is the fields that I need you to capture, you can keep running that automation, capture that information, have that entered into Salesforce. And depending on kind of what the scale of your business is, it drives an improvement in forecast accuracy. For us, that will end up being at least a ten percent increase in forecast accuracy. The larger your business, the meaning of that ten percent could be significant, right? If you're running a hundred million, you're running a billion, you're running twenty billion through your books, a percentage change is a massive amount if it's a small automation that you can give to a thousand people, right? And this automation is hard to do because this automation keeps changing, right? The playbook changes, the fields that you need to capture changes, this change in the field is very real, this kind of stuff can't happen. Last example here that I'm going to take is assisted scheduling, which is, again, a time-consuming process. Lots of unique business rules. Somebody calls you. They want to schedule an appointment. You have to figure out what state are they in? What insurance are they in? What procedure do they want? Who's the doctor? What medical code should I use? Oh, somebody, like, the regulation has changed. We don't do this anymore in this state anymore. 12 minutes comes down to 8 minutes because an AI runs a decision tree to figure out what decision code you should use. In a reliable way, you can actually reduce accuracy. You have to train less people to do that work manually. You're driving a straight-up increase in the number of appointments that you schedule, right? Another kind of example that we're working on customers with on the kind of on-demand automation that is possible. The final demo, which I really like, which is Q&A and retrieval and information stuff is cute. When Deep Research first came out, I was so excited. We tried out Deep Research, and I was like, oh, maybe it's actually a researcher. And it sits, and it creates a hypothesis, and then conclusions and observations, and it writes a little paper for me to solve my personal pet problem. wasn't quite that. It was basically heavily parallelized web search. Much better than that. It saved me a lot of time. I used it a lot. I used it for preparing our presentation today. I used it for a lot of things. If AI is reliable in little pieces on your enterprise, what happens when it's reliable on a big piece? And so for example, if I'm connected to an e-commerce retail data set, if I just want to give it a vague question and say, I need to figure out how to improve my sales, come up with a few hypotheses and investigate them. Like, let's go and do a bunch of slightly hard to do things here. And we instantly kind of get into an experience where it's coming up with various different hypotheses, and for each hypothesis, collecting information in a structured way, as hypotheses fail, it rewrites pieces of the plan, or captures that fact that something has failed, and then kind of starts to put that information together. And so just kind of looking at the kind of plans that we have here, it's saying, you know, it's kind of, it's doing Let's take a look at this. So it comes up with a bunch of hypotheses. It says some stores might be underperforming. Certain film categories, this is like a movie rental database. Some film categories might have underutilized revenue opportunities. Here's a hypothesis that we'll try that then ended up failing. Let's analyze different kind of frequency patterns. And then you can start to see kind of the results that it gives you. And you can even ask it to do a forecast if those things were applied, right? And so, for example, the kind of recommendations and the quality of recommendations that you can start to see are start to become really precise. So you can see, first of all, what the actual data is, what is happening. You can go to any step and do a gut check of does this look right? Are we on the right track here? And it becomes easy because at each step, you can see what the input was, what the output was. And then when you start to see what the actual recommendations are, They make sense in the context of the data that you have, which is that there's a particular category where you can increase stock. There's a particular category that is underperforming. So let's create a targeted campaign for that. Let's identify high-value customers to see how much repeat is and how much impact that will have. And these are very specific now to our data. And then ultimately, you can even go in and ask it to run a forecast. And it goes and runs a forecast for you to figure out how much impact a particular campaign would have. The cool thing that I've not done here for this demo, but that is very straightforward to try out that we've done, is you can connect Wolfram to it so that it can start doing basic statistical stuff for you on demand. So you can say, well, I'll just go to regression and predict what the value is going to be for forecasting effectively, rather than forecasting it with just code that might use a very simple predictor, or instead of forecasting it with GenAI, use this forecasting model and use that to create the forecast of what you want to do, right? So that's the kind of stuff that you can start to do. Once bits are reliable, you can start to create larger and deeper plans that are reliable. the last piece that I'd like to talk about is just a story here, which is a technology leader story and a business leader story and a let's learn to dance together story. So for technology leaders in this room, Do you know that for every single question the business team asks you or for every automation that the business team requests of you, there are ten that they don't even talk to you about because they know that, like, you cannot do it, right? So there's ten questions that I have of any analyst. There's ten automations that I want at any point of time that don't even come to your backlog because you don't even know how valuable it is. And that's kind of one of the first things that we've seen when we talk to technology teams about the impact of what they can do with AI. The business leader problem on the other side is that the impact of enterprise AI is only possible when silos start to get broken. It's really hard to break silos without the partnership of your technology team. So you're kind of buying AI to fix this problem, but it's not leveraging the rest of what you have, because that needs to be done more systematically. This collaboration is super critical, because there is no definition of done for this kind of work that we're doing. There is no button to hit that works all the time. And because of this, without that collaboration between these 2 teams and leaders at a level that we've not done before in our industry, nothing successful can ever happen. A few weeks ago, a public health care company where a product and field ops leader, not like a digital product leader, but product ops leader who's running stuff on the ground, came in along with somebody from the CTO team for a deep understanding of the existing system, along with somebody from the CIO team, compliance and security, like completely different people, we came together for a 3day hackathon connected to their actual live data, and we prototyped automation on their data that would have greater than a fifty million dollar impact, right? But this is not possible to do if all of these 3 people are not in the room. And a lot of the times when we are building and thinking about, oh, let's do a cool transformation there, we don't have these people in the room. So any one of them just cannot think, does not have enough of a perspective on how to build and what to build and how to even roll it out. And I think that's kind of been the biggest learning for us in how we even think about what to build and how to build and who to collaborate with, right? And that is really important. This is my time today, but just recapping the 4 takeaways, reliability is pretty important in case it escaped you. The second is, for us, decoupling was a very core kind of insight and capability that without degrading performance, and in fact, increasing performance, allowed us to control predictability and explain ability to a level that was sufficient for mission critical tasks, which is awesome. I believe that these kinds of techniques we start to see more often in the industry. We've already started seeing a few of them. It will unlock kind of the next generation of automation and decision making. Analytics will transform into decision making, and software engineering and IT will transform into automation, which is kind of what it already does. just levelled up massively over the next few years. And the last is to help people find their dancing partners. When we speak to tech people, we're like, who's the business person? We speak to business people, we're like, who's the tech person? Because if you don't have that group, nothing's going to happen. And we're going to hear a little bit more today from the rest of the rocks. So that's my time. Thank you so much, folks. What I'd love to do now is I can take a few questions, but as we set up next for bringing Avlok from AngelList on stage and having a quick chat, I can take a question or 2. Go ahead. No. Meaning that nothing systematic that we trust. We have a talk after lunch that kind of talks about some of the error propagation stuff in multiagent systems. Yeah, we can go into more detail with our experiences there. But the TLDR of it has been that that's just not been a trustworthy model for us. So instead of a multi-agent orchestration that's happening live, we prefer to represent a prompt QL plan that when you look at the deep research plan, for example, that was multiple multiple PromptQL agents running different hypotheses and putting that data together. So it was multi-agent, but the multi-agent plan was written in a DSL a priori. So that freezes the contracts of how different systems are talking to each other and what they're responding with. And that freezes the expectations a priori. That freezing of expectations has a benefit, because now you can actually reliably learn the system. If you don't freeze that, the most common thing that you notice is that they just go into talking to each other and not knowing when to exit the conversation. Like, we've crossed an error threshold, or we don't know if it's right anymore. So those kinds of things start to happen. But Abhinav from our team will talk a little bit, can get into more detail there. So that's kind of where the semantic layer is the piece that you need to exercise control on. Because how accurate is your plan depends on how much context you have. And as state-of-the-art LLMs improve, their reasoning is improving very well. So if I can keep improving the context, in a language that the LLM understands, its ability to increase, like, planning accuracy will go up. And that's kind of where that semantic layer being agentic becomes really important. So if I can describe enough about my domain, I can expect plans to become more and more accurate, right? If you come from a finance background and you ask a very particular question about financial analysis, the plan might not be accurate. It might execute, might not be accurate. It might not compute a continuous date series. It might compute a different kind of rolling aggregate, which is not what you want. But if you can embed that procedure into the semantic layer, then boom, planning accuracy goes up for you, right? Which is where that agentic semantic layer becomes really important to drive planning accuracy. But at least the good thing is we don't have to worry about plan execution. So that's one piece done. Now I can just focus on plan accuracy, not worry about plan execution. Yes. Once you start freezing the context, the plan starts to become very repeatable. If you don't improve the semantic layer, you still get different plans every time. But as soon as you start improving the semantic layer, the plan repeatability goes up amazingly well. Amazing. Thank you so much for the questions. Thank you so much, Tanmai.