07 Jul, 2025

•

11 MIN READ

Why RAG Fails

Anushrut Gupta

Abstract

Retrieval-Augmented Generation (RAG) has emerged as a critical approach for enhancing Large Language Models (LLMs) with external knowledge. However, traditional RAG implementations demonstrate significant accuracy limitations when handling complex, multi-hop reasoning tasks. This paper presents a comprehensive evaluation of three RAG approaches using the FRAMES benchmark dataset, revealing that PromptQL's plan-based execution methodology achieves near-perfect accuracy (~100%) compared to approximately 40% for naive RAG and 60% for agentic RAG implementations. Through systematic analysis, we identify four primary failure modes in conventional RAG systems: extraction errors in processing, context size limitations, inexhaustive computation patterns, and computational reasoning failures. Our findings demonstrate that decoupling context-heavy plan generation from context-limited plan execution through programmatic environments fundamentally addresses these limitations, enabling reliable AI-driven data retrieval and analysis for enterprise applications.

Introduction

Background and Motivation

The integration of Large Language Models (LLMs) with external knowledge bases through Retrieval-Augmented Generation (RAG) has become increasingly critical for enterprise AI applications. While RAG systems promise to ground LLM outputs in factual information, their practical deployment reveals significant accuracy limitations, particularly for complex queries requiring multi-step reasoning and information synthesis from multiple sources.

The FRAMES (Factuality, Retrieval, And reasoning MEasurement Set) benchmark, introduced by Google Research, provides a standardized evaluation framework with 824 challenging multi-hop questions that require retrieving and integrating information from 2-15 Wikipedia articles. These questions span diverse reasoning types including numerical computation, tabular data processing, temporal reasoning, and constraint satisfaction—representing the complexity typical of real-world enterprise queries.

Research Contributions

This paper makes the following contributions:

Comprehensive Benchmark Evaluation: We present results from running the FRAMES benchmark across three distinct RAG architectures, providing statistically significant evidence of performance differences.
Systematic Failure Analysis: We identify and categorize four primary failure modes in traditional RAG systems, with concrete examples demonstrating how each manifests in practice.
Architectural Innovation: We demonstrate how PromptQL's plan-based execution approach, which separates planning from execution, fundamentally addresses the limitations of in-context processing.
Practical Implications: We discuss the implications for enterprise AI deployment where accuracy and reliability are non-negotiable requirements.

Evolution of RAG Systems

Traditional RAG architectures follow a pipeline approach: query embedding, vector similarity search, context retrieval, and LLM generation (Lewis et al., 2020). While effective for simple factual queries, these systems struggle with complex reasoning tasks that require multiple retrieval steps and sophisticated information integration.

Agentic Approaches to RAG

Recent work has explored giving LLMs direct control over retrieval through tool-calling mechanisms (Schick et al., 2023). These "agentic RAG" systems allow the LLM to formulate its own search queries iteratively. However, as our results demonstrate, this approach still suffers from fundamental limitations when processing retrieved information within the LLM's context window.

The FRAMES Benchmark

The FRAMES benchmark (Google Research, 2024) represents a significant advancement in RAG evaluation, focusing on multi-hop reasoning tasks that mirror real-world complexity. Questions are categorized by reasoning type:

Numerical reasoning requiring mathematical computation
Tabular data extraction and analysis
Multiple constraint satisfaction
Temporal reasoning and disambiguation
Post-processing operations on retrieved data

Methodology

Experimental Setup

We evaluated three distinct RAG approaches using the complete FRAMES benchmark dataset:

Naive RAG

A traditional RAG pipeline implementing:

Query embedding using state-of-the-art embedding models
Vector similarity search retrieving the top 5 most relevant Wikipedia articles
Direct provision of retrieved context to the LLM for answer generation

Agentic RAG

An LLM-driven approach where:

The LLM has access to a vector search tool
It can autonomously formulate search queries based on the question
Retrieved articles (top 5 per query) are processed within the LLM's context
The LLM iteratively refines its search strategy

PromptQL

Utilizing the same retrieval infrastructure but implementing:

LLM-generated query plans in a domain-specific language
Programmatic execution of plans outside the LLM context
Structured memory artifacts for storing intermediate results
Referenceable outputs across multiple reasoning steps

All three systems used Claude 3.5 Sonnet as the base LLM to ensure fair comparison, with identical retrieval functions returning the same Wikipedia content.

Evaluation Protocol

Each system was evaluated on the FRAMES questions. Answers were scored as correct if they exactly matched the gold standard answer provided in the benchmark. For questions requiring numerical answers, we allowed for reasonable floating-point precision differences. No partial credit was awarded—this binary scoring reflects the requirements of enterprise applications where approximate answers are often unusable.

Planning Prompt for PromptQL

To ensure optimal performance, PromptQL was provided with a planning prompt that guides its approach to FRAMES-style questions:

This problem requires the collection of many facts step by step and creating an algorithm to solve the problem. Don't rely on your own knowledge of facts, but instead try to collect facts step by step. To collect facts, retrieve relevant articles by making search queries. From a retrieved article use the extract function to extract the precise fact if its present in the article.

Results

Overall Accuracy

The evaluation on the FRAMES benchmark revealed dramatic differences in accuracy across the three approaches:

Approach	Accuracy
Naive RAG	~40%
Agentic RAG	~60%
PromptQL	~100%

These results demonstrate that while giving LLMs control over retrieval (Agentic RAG) improves performance compared to simple pipeline approaches (Naive RAG), both traditional methods fail on a significant portion of questions. In contrast, PromptQL's plan-based execution achieves near-perfect accuracy, suggesting a fundamental architectural advantage rather than incremental improvement.

Sample Benchmark Question Results

Here are the detailed results for 10 sample questions from the FRAMES benchmark:

Question	Agentic RAG	PromptQL	Correct Answer
If my future wife has the same first name as the 15th first lady of the United States' mother and her surname is the same as the second assassinated president's mother's maiden name, what is my future wife's name?	Eliza Ballou	Jane Ballou	Jane Ballou
I have an element in mind and would like you to identify the person it was named after. Here's a clue: The element's atomic number is 9 higher than that of an element discovered by the scientist who discovered Zirconium in the same year.	Context length error	Dmitri Mendeleev	Dmitri Mendeleev
A general motors vehicle is named after the largest ward in the country of Monaco. How many people had walked on the moon as of the first model year of the vehicle? Note: the model year is not the same as the year the model was first produced.	2	4	4
On March 7th, 2012, the director James Cameron explored a very deep underseas trench. As of August 3, 2024, how many times would the tallest building in San Francisco fit end to end from the bottom of the New Britain Trench to the surface of the ocean? The answer should be a rounded-off whole number.	28	28	28
In August of 2024, what is the first name of the mayor of the U.S. state capital city who attended the same university as at least one U.S. president and whose city is home to an outgoing or former full member of the Big 12 Conference	Cannot determine	Leirion	Leirion
According to the 1990 United States census, what was the total population of the cities in Oklahoma that had at least 100,000 residents according to the 2020 United States census?	892,092	950,135	950,135
As of July 1, 2024, if I wanted to give my daughter the middle name of the American woman who is the most decorated female in the history of American gymnastics as her first name and the full first name of the American woman who holds the world record in the 800-meter freestyle as her middle name, what would I name my daughter?	Simone Katie	Arianne Kathleen	Arianne Kathleen
As of 2023, how many more employees does the company alphabetically first by ticker symbol in the S&P500 have than the company alphabetically 2nd to last by ticker symbol in the S&P500?	151,250 (Apple - Zebra)	8,150 (Agilent - Zebra)	8,150 (Agilent - Zebra)
The state, whose motto was adopted March 26, 1928, has 0.94% of the population in 2024 speaking a language that is native to which country?	Philippines	Philippines	Philippines
As of 2024, at the time of his birth, what was the middle name of the U.S. president who won Alaska, graduated from Yale University, and had a son named Michael?	No president found	Lynch	Lynch

Failure Mode Analysis

Our analysis identified four primary failure modes in traditional RAG systems:

1. Extraction Errors in Processing

The following question failed with Agentic RAG because of errors in processing the data in context:

Question: "If my future wife has the same first name as the 15th first lady of the United States' mother and her surname is the same as the second assassinated president's mother's maiden name, what is my future wife's name?"

Correct answer: Jane Ballou

Agentic RAG final response:

Let me process all this information:

The 15th First Lady was Mary Todd Lincoln, and I need to find her mother's first name

The second assassinated president was James Garfield, and I need to find his mother's maiden name

From searching through the results:

Mary Todd Lincoln's mother's name Eliza Parker

James Garfield's mother was Eliza Ballou

Therefore, if your future wife has the first name "Eliza" (like Mary Todd Lincoln's mother) and the surname "Ballou" (like James Garfield's mother's maiden name), her name would be Eliza Ballou.

So your future wife's name would be Eliza Ballou.

The problem here is the wrong extraction of the 15th First Lady from the Wikipedia article. Mary Todd Lincoln is actually the 16th First Lady, while Harriet Lane is the 15th First Lady. This is a common problem with errors in processing when dealing with large data in-context.

2. Context Size Limitations

The following question failed on Agentic RAG due to context size limits:

Question: "I have an element in mind and would like you to identify the person it was named after. Here's a clue: The element's atomic number is 9 higher than that of an element discovered by the scientist who discovered Zirconium in the same year."

Correct answer: Mendelevium is named after Dmitri Mendeleev.

Agentic RAG final response:

Error code: 400 - {'type': 'error', 'error': {'type': 'invalid_request_error', 'message': 'prompt is too long: 208600 tokens > 200000 maximum'}}

Since we are retrieving top-5 chunks per tool call, and there can be multiple tool calls requested in the course of one question - it's not uncommon to see context size limit issues.

3. Inexhaustive Computation

The following question failed on Agentic RAG because it preemptively responded with a plausible answer instead of exhaustively verifying the answer.

Question: "A general motors vehicle is named after the largest ward in the country of Monaco. How many people had walked on the moon as of the first model year of the vehicle? Note: the model year is not the same as the year the model was first produced."

Correct answer: 4

Agentic RAG response:

By July 1969, when the first 1970 model year Monte Carlo would have been introduced, two people had walked on the Moon - Neil Armstrong and Buzz Aldrin who landed on July 20, 1969 as part of the Apollo 11 mission. They were the first and only people to have walked on the Moon when the first 1970 model year Monte Carlos went on sale. The Apollo 12 mission also happened in 1969 but in-context processing simply skips processing it.

4. Errors in Computational Reasoning

The following question failed on Agentic RAG because it couldn't reason and compute the user intent correctly:

Question: "As of July 1, 2024, if I wanted to give my daughter the middle name of the American woman who is the most decorated female in the history of American gymnastics as her first name and the full first name of the American woman who holds the world record in the 800-meter freestyle as her middle name, what would I name my daughter?"

Correct answer: Arianne Kathleen

Agentic RAG answer:

Based on the search results, I can determine that as of July 1, 2024, the name would be:

The first name would be "Simone", as Simone Biles is clearly the most decorated American female gymnast in history, with 32 Olympic and World Championship gold medals among her many accomplishments.

The middle name would be "Katie", as Katie Ledecky is the world record holder in the 800m freestyle with a time of 8:04.79, which she set at the 2016 Rio Olympics and still stands as of 2024.

Therefore, the name would be "Simone Katie".

Although the athletes determined are correct, in-context processing fails to put the right computational plan for the final answer. We needed to get the middle name for Simone Biles and the full first name of Katie Ledecky.

The PromptQL Approach

Architectural Overview

PromptQL fundamentally reimagines RAG by separating the planning phase from the execution phase:

Planning Phase: The LLM generates a detailed query plan in PromptQL's domain-specific language, focusing solely on the logical steps needed to answer the question.
Execution Phase: The plan is executed programmatically in a controlled environment, with each step producing structured outputs stored as referenceable artifacts.

Key Innovations

Structured Memory Management

Unlike traditional RAG where all information must fit within a single context window, PromptQL maintains structured memory artifacts that can be referenced across unlimited reasoning steps. This enables:

Processing of arbitrarily large datasets
Systematic verification of intermediate results
Precise tracking of information provenance

Programmatic Execution Environment

By executing plans outside the LLM context, PromptQL achieves:

Deterministic computational operations
Elimination of arithmetic and logical errors
Consistent handling of structured data (tables, lists, etc.)

User-Guided Planning

The system allows users to provide planning guidance, ensuring the approach aligns with domain-specific requirements or preferences.

Example: Solving a Complex FRAMES Question

Here's how PromptQL's planner breaks down the retrieval and reasoning for one of the benchmark questions:

Question: "If my future wife has the same first name as the 15th first lady of the United States' mother and her surname is the same as the second assassinated president's mother's maiden name, what is my future wife's name?"

PromptQL's Approach:

Initial Planning:
- Find the 15th First Lady of the United States
- Search for articles about the 15th First Lady
- Extract her identity from the articles
First Execution:
- Search query: "15th First Lady"
- Result: Harriet Lane was the 15th First Lady
Second Planning:
- Search for articles specifically about Harriet Lane's mother
- Extract her mother's first name from the articles
Second Execution:
- Search query: "Harriet Lane mother"
- Result: Harriet Lane's mother's first name was Jane
Third Planning:
- Search for articles about assassinated US presidents to identify the second one
- Once identified, search for information about his mother's maiden name
Third Execution:
- Search query: "assassinated US presidents"
- Result: James A. Garfield was the second assassinated president
Fourth Planning:
- Search for articles about James Garfield's mother
- Extract his mother's maiden name from the articles
Final Execution:
- Search query: "James Garfield mother"
- Result: James Garfield's mother's maiden name was Ballou
Answer Construction:
- First name: Jane (from Harriet Lane's mother)
- Surname: Ballou (from James Garfield's mother's maiden name)
- Final Answer: Jane Ballou

You can see the PromptQL thread here.

This programmatic approach eliminates the extraction and reasoning errors that plague in-context processing.

6. Discussion

Implications for Enterprise AI

The near-perfect accuracy achieved by PromptQL compared to the ~40% accuracy of traditional RAG represents a paradigm shift for enterprise applications. This level of accuracy from traditional approaches is insufficient for:

Financial analysis where errors have monetary consequences
Healthcare applications where patient safety is paramount
Legal research where precision is mandatory
Supply chain optimization where small errors compound

Scalability Advantages

By moving data processing outside the LLM's context window, PromptQL can handle:

Datasets with billions of records
Complex queries requiring dozens of retrieval steps
Real-time processing of streaming data
Integration with multiple heterogeneous data sources

Theoretical Insights

Our results suggest that the fundamental limitation of traditional RAG lies not in retrieval quality but in the conflation of planning and execution within a probabilistic context. Separating these concerns allows each component to operate optimally:

LLMs excel at high-level reasoning and plan generation
Programmatic environments excel at precise execution and data manipulation

Conclusion

This comprehensive evaluation using the complete FRAMES benchmark demonstrates that achieving near-perfect accuracy in RAG systems is possible through architectural innovation. PromptQL's approach of separating plan generation from execution addresses fundamental limitations in traditional RAG systems:

Extraction errors are eliminated through programmatic processing
Context limitations are overcome with structured memory management
Inexhaustive computation is prevented by systematic execution
Reasoning errors are avoided through deterministic operations

For organizations deploying AI systems where accuracy is non-negotiable, the plan-based execution paradigm offers a path to reliable, enterprise-grade AI. As we move toward increasingly complex AI applications, the principles demonstrated here—separation of concerns, structured memory, and programmatic execution—will become essential for trustworthy AI systems.

The near-perfect accuracy achieved on FRAMES is not merely a benchmark result but a demonstration that reliable AI is achievable when we fundamentally rethink how LLMs interact with external knowledge. The future of enterprise AI lies not in larger context windows or more sophisticated prompting, but in architectural approaches that leverage the complementary strengths of neural and symbolic computation.

References

[1] Google Research. (2024). FRAMES: Factuality, Retrieval, And reasoning MEasurement Set. Hugging Face Datasets. https://huggingface.co/datasets/google/frames-benchmark

[2] Lewis, P., et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. Advances in Neural Information Processing Systems, 33, 9459-9474.

[3] Schick, T., et al. (2023). Toolformer: Language Models Can Teach Themselves to Use Tools. arXiv preprint arXiv:2302.04761.

[4] Anthropic. (2024). Claude 3.5 Sonnet: Technical Specifications. Anthropic Documentation.

[5] Hasura. (2025). PromptQL: A Platform for Reliable AI on Enterprise Data. https://promptql.hasura.io/

Blog

07 Jul, 2025

Why RAG Fails

Abstract

Introduction

Background and Motivation

Research Contributions

Evolution of RAG Systems

Agentic Approaches to RAG

The FRAMES Benchmark

Methodology

Experimental Setup

Naive RAG

Agentic RAG

PromptQL

Evaluation Protocol

Planning Prompt for PromptQL

Results

Overall Accuracy

Sample Benchmark Question Results

Failure Mode Analysis

1. Extraction Errors in Processing

2. Context Size Limitations

3. Inexhaustive Computation

4. Errors in Computational Reasoning

The PromptQL Approach

Architectural Overview

Key Innovations

Structured Memory Management

Programmatic Execution Environment

User-Guided Planning

Example: Solving a Complex FRAMES Question

6. Discussion

Implications for Enterprise AI

Scalability Advantages

Theoretical Insights

Conclusion

References

Related reading

See PromptQL in action on your data.

Why RAG Fails

Abstract

Introduction

Background and Motivation

Research Contributions

Related Work

Evolution of RAG Systems

Agentic Approaches to RAG

The FRAMES Benchmark

Methodology

Experimental Setup

Naive RAG

Agentic RAG

PromptQL

Evaluation Protocol

Planning Prompt for PromptQL

Results

Overall Accuracy

Sample Benchmark Question Results

Failure Mode Analysis

1. Extraction Errors in Processing

2. Context Size Limitations

3. Inexhaustive Computation

4. Errors in Computational Reasoning

The PromptQL Approach

Architectural Overview

Key Innovations

Structured Memory Management

Programmatic Execution Environment

User-Guided Planning

Example: Solving a Complex FRAMES Question

6. Discussion

Implications for Enterprise AI

Scalability Advantages

Theoretical Insights

Conclusion

References

Related reading

See PromptQL in action on your data.