Abstract
Retrieval-Augmented Generation (RAG) has emerged as a critical approach for enhancing Large Language Models (LLMs) with external knowledge. However, traditional RAG implementations demonstrate significant accuracy limitations when handling complex, multi-hop reasoning tasks. This paper presents a comprehensive evaluation of three RAG approaches using the FRAMES benchmark dataset, revealing that PromptQL's plan-based execution methodology achieves near-perfect accuracy (~100%) compared to approximately 40% for naive RAG and 60% for agentic RAG implementations. Through systematic analysis, we identify four primary failure modes in conventional RAG systems: extraction errors in processing, context size limitations, inexhaustive computation patterns, and computational reasoning failures. Our findings demonstrate that decoupling context-heavy plan generation from context-limited plan execution through programmatic environments fundamentally addresses these limitations, enabling reliable AI-driven data retrieval and analysis for enterprise applications.
Introduction
Background and Motivation
The integration of Large Language Models (LLMs) with external knowledge bases through Retrieval-Augmented Generation (RAG) has become increasingly critical for enterprise AI applications. While RAG systems promise to ground LLM outputs in factual information, their practical deployment reveals significant accuracy limitations, particularly for complex queries requiring multi-step reasoning and information synthesis from multiple sources.
The FRAMES (Factuality, Retrieval, And reasoning MEasurement Set) benchmark, introduced by Google Research, provides a standardized evaluation framework with 824 challenging multi-hop questions that require retrieving and integrating information from 2-15 Wikipedia articles. These questions span diverse reasoning types including numerical computation, tabular data processing, temporal reasoning, and constraint satisfaction—representing the complexity typical of real-world enterprise queries.
Research Contributions
This paper makes the following contributions:
-
Comprehensive Benchmark Evaluation: We present results from running the FRAMES benchmark across three distinct RAG architectures, providing statistically significant evidence of performance differences.
-
Systematic Failure Analysis: We identify and categorize four primary failure modes in traditional RAG systems, with concrete examples demonstrating how each manifests in practice.
-
Architectural Innovation: We demonstrate how PromptQL's plan-based execution approach, which separates planning from execution, fundamentally addresses the limitations of in-context processing.
-
Practical Implications: We discuss the implications for enterprise AI deployment where accuracy and reliability are non-negotiable requirements.
Related Work
Evolution of RAG Systems
Traditional RAG architectures follow a pipeline approach: query embedding, vector similarity search, context retrieval, and LLM generation (Lewis et al., 2020). While effective for simple factual queries, these systems struggle with complex reasoning tasks that require multiple retrieval steps and sophisticated information integration.
Agentic Approaches to RAG
Recent work has explored giving LLMs direct control over retrieval through tool-calling mechanisms (Schick et al., 2023). These "agentic RAG" systems allow the LLM to formulate its own search queries iteratively. However, as our results demonstrate, this approach still suffers from fundamental limitations when processing retrieved information within the LLM's context window.
The FRAMES Benchmark
The FRAMES benchmark (Google Research, 2024) represents a significant advancement in RAG evaluation, focusing on multi-hop reasoning tasks that mirror real-world complexity. Questions are categorized by reasoning type:
- Numerical reasoning requiring mathematical computation
- Tabular data extraction and analysis
- Multiple constraint satisfaction
- Temporal reasoning and disambiguation
- Post-processing operations on retrieved data
Methodology
Experimental Setup
We evaluated three distinct RAG approaches using the complete FRAMES benchmark dataset:
Naive RAG
A traditional RAG pipeline implementing:
- Query embedding using state-of-the-art embedding models
- Vector similarity search retrieving the top 5 most relevant Wikipedia articles
- Direct provision of retrieved context to the LLM for answer generation
Agentic RAG
An LLM-driven approach where:
- The LLM has access to a vector search tool
- It can autonomously formulate search queries based on the question
- Retrieved articles (top 5 per query) are processed within the LLM's context
- The LLM iteratively refines its search strategy
PromptQL
Utilizing the same retrieval infrastructure but implementing:
- LLM-generated query plans in a domain-specific language
- Programmatic execution of plans outside the LLM context
- Structured memory artifacts for storing intermediate results
- Referenceable outputs across multiple reasoning steps
All three systems used Claude 3.5 Sonnet as the base LLM to ensure fair comparison, with identical retrieval functions returning the same Wikipedia content.
Evaluation Protocol
Each system was evaluated on the FRAMES questions. Answers were scored as correct if they exactly matched the gold standard answer provided in the benchmark. For questions requiring numerical answers, we allowed for reasonable floating-point precision differences. No partial credit was awarded—this binary scoring reflects the requirements of enterprise applications where approximate answers are often unusable.
Planning Prompt for PromptQL
To ensure optimal performance, PromptQL was provided with a planning prompt that guides its approach to FRAMES-style questions:
This problem requires the collection of many facts step by step and creating an algorithm to solve the problem. Don't rely on your own knowledge of facts, but instead try to collect facts step by step. To collect facts, retrieve relevant articles by making search queries. From a retrieved article use the extract function to extract the precise fact if its present in the article.
Results
Overall Accuracy
The evaluation on the FRAMES benchmark revealed dramatic differences in accuracy across the three approaches:
Approach |
Accuracy |
Naive RAG |
~40% |
Agentic RAG |
~60% |
PromptQL |
~100% |
These results demonstrate that while giving LLMs control over retrieval (Agentic RAG) improves performance compared to simple pipeline approaches (Naive RAG), both traditional methods fail on a significant portion of questions. In contrast, PromptQL's plan-based execution achieves near-perfect accuracy, suggesting a fundamental architectural advantage rather than incremental improvement.
Sample Benchmark Question Results
Here are the detailed results for 10 sample questions from the FRAMES benchmark:
Question |
Agentic RAG |
PromptQL |
Correct Answer |
If my future wife has the same first name as the 15th first lady of the United States' mother and her surname is the same as the second assassinated president's mother's maiden name, what is my future wife's name? |
Eliza Ballou |
Jane Ballou |
Jane Ballou |
I have an element in mind and would like you to identify the person it was named after. Here's a clue: The element's atomic number is 9 higher than that of an element discovered by the scientist who discovered Zirconium in the same year. |
Context length error |
Dmitri Mendeleev |
Dmitri Mendeleev |
A general motors vehicle is named after the largest ward in the country of Monaco. How many people had walked on the moon as of the first model year of the vehicle? Note: the model year is not the same as the year the model was first produced. |
2 |
4 |
4 |
On March 7th, 2012, the director James Cameron explored a very deep underseas trench. As of August 3, 2024, how many times would the tallest building in San Francisco fit end to end from the bottom of the New Britain Trench to the surface of the ocean? The answer should be a rounded-off whole number. |
28 |
28 |
28 |
In August of 2024, what is the first name of the mayor of the U.S. state capital city who attended the same university as at least one U.S. president and whose city is home to an outgoing or former full member of the Big 12 Conference |
Cannot determine |
Leirion |
Leirion |
According to the 1990 United States census, what was the total population of the cities in Oklahoma that had at least 100,000 residents according to the 2020 United States census? |
892,092 |
950,135 |
950,135 |
As of July 1, 2024, if I wanted to give my daughter the middle name of the American woman who is the most decorated female in the history of American gymnastics as her first name and the full first name of the American woman who holds the world record in the 800-meter freestyle as her middle name, what would I name my daughter? |
Simone Katie |
Arianne Kathleen |
Arianne Kathleen |
As of 2023, how many more employees does the company alphabetically first by ticker symbol in the S&P500 have than the company alphabetically 2nd to last by ticker symbol in the S&P500? |
151,250 (Apple - Zebra) |
8,150 (Agilent - Zebra) |
8,150 (Agilent - Zebra) |
The state, whose motto was adopted March 26, 1928, has 0.94% of the population in 2024 speaking a language that is native to which country? |
Philippines |
Philippines |
Philippines |
As of 2024, at the time of his birth, what was the middle name of the U.S. president who won Alaska, graduated from Yale University, and had a son named Michael? |
No president found |
Lynch |
Lynch |
Failure Mode Analysis
Our analysis identified four primary failure modes in traditional RAG systems:
The following question failed with Agentic RAG because of errors in processing the data in context:
Question: "If my future wife has the same first name as the 15th first lady of the United States' mother and her surname is the same as the second assassinated president's mother's maiden name, what is my future wife's name?"
Correct answer: Jane Ballou
Agentic RAG final response:
Let me process all this information:
- The 15th First Lady was Mary Todd Lincoln, and I need to find her mother's first name
- The second assassinated president was James Garfield, and I need to find his mother's maiden name
From searching through the results:
- Mary Todd Lincoln's mother's name Eliza Parker
- James Garfield's mother was Eliza Ballou
Therefore, if your future wife has the first name "Eliza" (like Mary Todd Lincoln's mother) and the surname "Ballou" (like James Garfield's mother's maiden name), her name would be Eliza Ballou.
So your future wife's name would be Eliza Ballou.
The problem here is the wrong extraction of the 15th First Lady from the Wikipedia article. Mary Todd Lincoln is actually the 16th First Lady, while Harriet Lane is the 15th First Lady. This is a common problem with errors in processing when dealing with large data in-context.
2. Context Size Limitations
The following question failed on Agentic RAG due to context size limits:
Question: "I have an element in mind and would like you to identify the person it was named after. Here's a clue: The element's atomic number is 9 higher than that of an element discovered by the scientist who discovered Zirconium in the same year."
Correct answer: Mendelevium is named after Dmitri Mendeleev.
Agentic RAG final response:
Error code: 400 - {'type': 'error', 'error': {'type': 'invalid_request_error', 'message': 'prompt is too long: 208600 tokens > 200000 maximum'}}
Since we are retrieving top-5 chunks per tool call, and there can be multiple tool calls requested in the course of one question - it's not uncommon to see context size limit issues.
3. Inexhaustive Computation
The following question failed on Agentic RAG because it preemptively responded with a plausible answer instead of exhaustively verifying the answer.
Question: "A general motors vehicle is named after the largest ward in the country of Monaco. How many people had walked on the moon as of the first model year of the vehicle? Note: the model year is not the same as the year the model was first produced."
Correct answer: 4
Agentic RAG response:
By July 1969, when the first 1970 model year Monte Carlo would have been introduced, two people had walked on the Moon - Neil Armstrong and Buzz Aldrin who landed on July 20, 1969 as part of the Apollo 11 mission. They were the first and only people to have walked on the Moon when the first 1970 model year Monte Carlos went on sale. The Apollo 12 mission also happened in 1969 but in-context processing simply skips processing it.
4. Errors in Computational Reasoning
The following question failed on Agentic RAG because it couldn't reason and compute the user intent correctly:
Question: "As of July 1, 2024, if I wanted to give my daughter the middle name of the American woman who is the most decorated female in the history of American gymnastics as her first name and the full first name of the American woman who holds the world record in the 800-meter freestyle as her middle name, what would I name my daughter?"
Correct answer: Arianne Kathleen
Agentic RAG answer:
Based on the search results, I can determine that as of July 1, 2024, the name would be:
The first name would be "Simone", as Simone Biles is clearly the most decorated American female gymnast in history, with 32 Olympic and World Championship gold medals among her many accomplishments.
The middle name would be "Katie", as Katie Ledecky is the world record holder in the 800m freestyle with a time of 8:04.79, which she set at the 2016 Rio Olympics and still stands as of 2024.
Therefore, the name would be "Simone Katie".
Although the athletes determined are correct, in-context processing fails to put the right computational plan for the final answer. We needed to get the middle name for Simone Biles and the full first name of Katie Ledecky.
The PromptQL Approach
Architectural Overview
PromptQL fundamentally reimagines RAG by separating the planning phase from the execution phase:
-
Planning Phase: The LLM generates a detailed query plan in PromptQL's domain-specific language, focusing solely on the logical steps needed to answer the question.
-
Execution Phase: The plan is executed programmatically in a controlled environment, with each step producing structured outputs stored as referenceable artifacts.
Key Innovations
Structured Memory Management
Unlike traditional RAG where all information must fit within a single context window, PromptQL maintains structured memory artifacts that can be referenced across unlimited reasoning steps. This enables:
- Processing of arbitrarily large datasets
- Systematic verification of intermediate results
- Precise tracking of information provenance
Programmatic Execution Environment
By executing plans outside the LLM context, PromptQL achieves:
- Deterministic computational operations
- Elimination of arithmetic and logical errors
- Consistent handling of structured data (tables, lists, etc.)
User-Guided Planning
The system allows users to provide planning guidance, ensuring the approach aligns with domain-specific requirements or preferences.
Example: Solving a Complex FRAMES Question
Here's how PromptQL's planner breaks down the retrieval and reasoning for one of the benchmark questions:
Question: "If my future wife has the same first name as the 15th first lady of the United States' mother and her surname is the same as the second assassinated president's mother's maiden name, what is my future wife's name?"
PromptQL's Approach:
-
Initial Planning:
- Find the 15th First Lady of the United States
- Search for articles about the 15th First Lady
- Extract her identity from the articles
-
First Execution:
- Search query: "15th First Lady"
- Result: Harriet Lane was the 15th First Lady
-
Second Planning:
- Search for articles specifically about Harriet Lane's mother
- Extract her mother's first name from the articles
-
Second Execution:
- Search query: "Harriet Lane mother"
- Result: Harriet Lane's mother's first name was Jane
-
Third Planning:
- Search for articles about assassinated US presidents to identify the second one
- Once identified, search for information about his mother's maiden name
-
Third Execution:
- Search query: "assassinated US presidents"
- Result: James A. Garfield was the second assassinated president
-
Fourth Planning:
- Search for articles about James Garfield's mother
- Extract his mother's maiden name from the articles
-
Final Execution:
- Search query: "James Garfield mother"
- Result: James Garfield's mother's maiden name was Ballou
-
Answer Construction:
- First name: Jane (from Harriet Lane's mother)
- Surname: Ballou (from James Garfield's mother's maiden name)
- Final Answer: Jane Ballou
You can see the PromptQL thread here.
This programmatic approach eliminates the extraction and reasoning errors that plague in-context processing.
6. Discussion
Implications for Enterprise AI
The near-perfect accuracy achieved by PromptQL compared to the ~40% accuracy of traditional RAG represents a paradigm shift for enterprise applications. This level of accuracy from traditional approaches is insufficient for:
- Financial analysis where errors have monetary consequences
- Healthcare applications where patient safety is paramount
- Legal research where precision is mandatory
- Supply chain optimization where small errors compound
Scalability Advantages
By moving data processing outside the LLM's context window, PromptQL can handle:
- Datasets with billions of records
- Complex queries requiring dozens of retrieval steps
- Real-time processing of streaming data
- Integration with multiple heterogeneous data sources
Theoretical Insights
Our results suggest that the fundamental limitation of traditional RAG lies not in retrieval quality but in the conflation of planning and execution within a probabilistic context. Separating these concerns allows each component to operate optimally:
- LLMs excel at high-level reasoning and plan generation
- Programmatic environments excel at precise execution and data manipulation
Conclusion
This comprehensive evaluation using the complete FRAMES benchmark demonstrates that achieving near-perfect accuracy in RAG systems is possible through architectural innovation. PromptQL's approach of separating plan generation from execution addresses fundamental limitations in traditional RAG systems:
- Extraction errors are eliminated through programmatic processing
- Context limitations are overcome with structured memory management
- Inexhaustive computation is prevented by systematic execution
- Reasoning errors are avoided through deterministic operations
For organizations deploying AI systems where accuracy is non-negotiable, the plan-based execution paradigm offers a path to reliable, enterprise-grade AI. As we move toward increasingly complex AI applications, the principles demonstrated here—separation of concerns, structured memory, and programmatic execution—will become essential for trustworthy AI systems.
The near-perfect accuracy achieved on FRAMES is not merely a benchmark result but a demonstration that reliable AI is achievable when we fundamentally rethink how LLMs interact with external knowledge. The future of enterprise AI lies not in larger context windows or more sophisticated prompting, but in architectural approaches that leverage the complementary strengths of neural and symbolic computation.
References
[1] Google Research. (2024). FRAMES: Factuality, Retrieval, And reasoning MEasurement Set. Hugging Face Datasets. https://huggingface.co/datasets/google/frames-benchmark
[2] Lewis, P., et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. Advances in Neural Information Processing Systems, 33, 9459-9474.
[3] Schick, T., et al. (2023). Toolformer: Language Models Can Teach Themselves to Use Tools. arXiv preprint arXiv:2302.04761.
[4] Anthropic. (2024). Claude 3.5 Sonnet: Technical Specifications. Anthropic Documentation.
[5] Hasura. (2025). PromptQL: A Platform for Reliable AI on Enterprise Data. https://promptql.hasura.io/