07 Jul, 2025

•

6 MIN READ

Why LLM Tool Calling is Doomed to Fail

Tobi Ogunnaike

Abstract

LLM tool calling has become a popular architectural approach as agents move from demos to full-scale deployment. However, practitioners report unpredictable failures in production. To understand these failure patterns, we analyzed a sample of GAIA benchmark questions using publicly available tool-calling AI agents (Manus AI, H20 AI) compared against PromptQL's code-first approach. Our evaluation reveals tool calling often fails due to i) inconsistent problem-solving, ii) interpretation errors, and iii) logical mistakes in mathematical operations. While limited in scope, these examples illustrate why deterministic code execution may offer advantages over probabilistic tool selection for enterprise AI deployment.

Analysis

To understand why tool-calling architectures systematically fail, we analyzed three GAIA questions and one realistic business question that test AI agent capabilities: precise instruction following, logical verification, and computational simulation.

What we tested: GAIA questions provide an isolated environment for testing whether systems can execute specific tasks consistently. Our business question demonstrates how these patterns manifest in real enterprise workflows.

How we tested: For each question, we evaluated conventional tool-calling systems (Manus AI, H2O AI) against PromptQL under identical conditions. We simply asked the questions on their consumer-facing apps. This apples-to-apples comparison reveals architectural differences rather than implementation variations.

Why this matters: The results expose consistent failure patterns in tool-calling architectures while demonstrating how code-first approaches like PromptQL's deliver superior reliability. If systems struggle with these controlled reasoning tasks, how can they be expected to handle business-critical questions where ambiguous queries, conflicting data, and imprecise context reign supreme?

Limitations: While GAIA questions test core capabilities in isolation, real enterprise workflows involve conflicting data sources, ambiguous requirements, and messy contexts. We're collaborating with Prof. Aditya Parameswaran's lab at UC Berkeley to develop benchmarks that capture this full complexity. Until then, GAIA provides a solid foundation for understanding fundamental architectural differences.

Understanding Code-First Architecture

We define "code-first" architectures as systems that treat executable code as the primary problem-solving interface. In this approach, code is a first-class citizen—planning, reasoning, and delegation all happen through explicit programming constructs rather than probabilistic natural language decisions.

Under the hood, code-first systems may still call external tools, APIs, or even delegate to other AI models. The key difference is that these operations happen through deterministic code paths rather than LLM-driven tool selection. Where traditional systems ask an LLM to decide "which tool should I use and how?", code-first systems embed those decisions in explicit control flow.

Recent research identifies this as "code-enhanced reasoning," where code serves as "an abstract, modular, and logic-driven structure that supports reasoning". Programming languages inherently enforce systematic, step-by-step thinking that complex problems require—but which LLMs struggle to maintain consistently through natural language alone.

Q1: "Pick That Ping-Pong" Riddle

A complex probabilistic puzzle describing a game where ping-pong balls are ejected from a three-position platform, with specific rules for ball movement and ejection patterns.

GAIA Question: ec09fa32-d03f-4bf8-84b0-1f16922c3ae4
Experimental Setup: We tested this question on Manus AI (configured to use "agent mode") and PromptQL.

	Manus AI	PromptQL
Method	Natural language reasoning to select tools	Code as primary problem-solving interface
Approach	Attempted qualitative analysis of game mechanics	Immediate computational simulation
Output	Ball 100	Ball 3
Result	❌ Incorrect	✅ Correct
Evidence	Execution thread	Execution thread

Why Tool-Calling Failed

Manus AI has access to the right tools for this kind of question. Their AI agent has code writing, code execution, and simulation capabilities. The failure here wasn't due to a lack of capabilities. The failure occurred at the orchestration level. The LLM decided to "reason" qualitatively rather than computationally.

This is the core issue with tool-calling architectures: tool selection is probabilistic, and that means inconsistent problem-solving approaches. Sourcing and designing the right tools is a challenge, but even when the right tools are available, consistent delegation remains unreliable. The system can't guarantee that similar problems will be approached the same way twice.

Why Code-First Approach Succeeded

In the execution we observed, PromptQL solved this problem through executable code rather than qualitative reasoning. This code-first approach delivered two key advantages:

Modular problem decomposition
To solve the problem, PromptQL wrote discrete functions : advance_balls(), calculate_probabilities(), update_platform() – creating a systematic approach to the probabilistic analysis.
Verifiable execution paths
Every step of the ball ejection logic can be traced in the actual code: if piston_fires == 1: eject_ball(position_1). Unlike natural language reasoning, each logical step is explicit and checkable.

If any of the solution steps fail in the code-first approach, it's easier to fix those building blocks rather than follow the logic trail in a natural language analysis.

Q2: Text Pattern Extraction Question

Extract a sentence from a 5x7 block of text by reading from left to right and using all letters in order.

GAIA Question: 50ad0280-0819-4bd9-b275-5de32d3b5bcb
Experimental Setup: We tested this question on Manus AI (configured to use "agent mode") and PromptQL.

	Manus AI	PromptQL
Method	Natural language interpretation of instructions	Literal string manipulation following exact instructions
Process	Treated each line as separate words with spaces	Concatenated all characters, then identified sentence structure
Output	"THESE A GULL GLIDED PEACEFULLY TO MY CHAIR"	"THE SEAGULL GLIDED PEACEFULLY TO MY CHAIR"
Result	❌ Incorrect	✅ Correct
Key Point	Added implicit word boundaries not in original text	Followed "use all letters in order" literally
Evidence	Thread	Thread

Why Tool-Calling Failed

This failure reveals a prominent failure surface with tool-calling architectures: opaque reasoning. Manus AI produced an incorrect result, but we can't actually see the reasoning pathway until we specifically asked for clarification.

On closer inspection, we see Manus AI made a critical interpretation error—treating each line as a separate word—but this logic was inaccurate and buried within the tool-calling execution. Based on this (inaccurate) interpretation, Manus inserted spaces that weren't in the original text, and ultimately provided a wrong solution.

Why PromptQL Succeeded

In contrast, PromptQL created a query plan (in natural language) and then wrote a Python program to execute that plan. Both of these artifacts are visible, explicit, and verifiable. If the query plan was flawed, PromptQL provides a mechanism for users to edit that plan. And if the Python program fails to compile or has bugs, PromptQL iterates and tries to fix them in real time. This approach yields several benefits: real-time verification, reproducible reasoning, and systematic debugging.

Q3: Mathematical Operation Analysis

Given a table of mathematical operations, identify which elements are involved in counter-examples that prove the operation is not commutative.

GAIA Question: 6f37996b-2ac7-44b0-8e68-6d28256631b4

Experimental Setup: We tested this question on both PromptQL and H20 AI (configured with default LLM settings).

	H20 AI	PromptQL
Method	Manual table inspection with natural language reasoning	Verification via code
Approach	Step-by-step manual checking of pairs	Automated iteration through all combinations
Output	"b,c,e"	"b,e"
Result	❌ Incorrect	✅ Correct
Evidence	Thread	Thread

Why Tool-Calling Failed

H2O.ai correctly identified the problem: b * e ≠ e * b (since according to the map, b * e = c, but e * b = b). This proves both elements (b, e) don't commute. However, instead of concluding that "b, e" were the non-commutative elements, it included "c" in its answer.

This failure is a logical error. The question asks which elements are involved in counter-examples, meaning which elements were tested for commutativity. The result of the operation i.e. "c" has nothing to do with which elements were being compared.

Why Code-First Approach Worked

PromptQL systematically checked every pair (x,y) against (y,x) and added only the compared elements to the counter-example set. The code implemented the question's logic without ambiguity about what "involved" means.

Real-World Business Complexity

The failure patterns we've observed in these GAIA questions become even more critical when applied to real business operations. Consider this typical customer success workflow that many companies run daily:

Give me a sorted list of the top 5 support tickets which I should prioritize amongst the last 30 most recent open tickets.

For each ticket, attach:

project_id (extract from ticket description)
project plan
issue criticality
monthly average revenue for the project
recent ticket_ids for that project (last 6 months)

Issue criticality (descending priority):

Production downtime
Instability in production
Performance degradation in production
Bug
Feature request
How-to

Prioritization rules: Production issues first, then advanced plan > base plan > free plan. Non-production issues ordered by monthly revenue. Break ties by: time since opened, then recent negative experience.

This query might look straightforward. But answering it requires:

6 different data requests (tickets, projects, plans, revenue, comments, history)
Semantic analysis of ticket descriptions and comments
Multi-criteria ranking with nested business rules
Tie-breaking logic across multiple dimensions
Consistent output format across all results.

Here's the core challenge: How do you answer this question with confidence? If you ask this question every week, how many times will you get the right answer?

With tool-calling approaches, each execution invokes probabilistic decisions:

Will tie-breaking logic be applied uniformly?
Will intermediate results be preserved correctly between steps?
Will the LLM retrieve data in the same order?

The total surface area of failure is enormous. Semantic misclassification could break the chain. Context corruption could cloud the reasoning. Prioritization rules could be applied inconsistently. One probabilistic decision goes wrong anywhere in the chain, and your business analyst receives the wrong priority list.

This illustrates the fundamental challenge with tool-calling architectures: they promise sophisticated business intelligence but struggle with consistency when reliability matters most.

Code-first approaches offer an alternative paradigm. Instead of relying on probabilistic tool selection, they enforce determinism through explicit programming contracts. Query plans become editable and verified. Business logic gets encoded in testable and auditable functions. Intermediate results can be cached and reused.

While our analysis is limited to a few representative cases, the pattern is clear: deterministic execution paths offer advantages over probabilistic decision-making for enterprise-critical workflows. As AI agents move from impressive demos to business-critical deployment, architectural choices that prioritize consistency over flexibility may prove essential for building user trust.

Blog

07 Jul, 2025

Why LLM Tool Calling is Doomed to Fail

Abstract

Analysis

Understanding Code-First Architecture

Q1: "Pick That Ping-Pong" Riddle

Why Tool-Calling Failed

Why Code-First Approach Succeeded

Q2: Text Pattern Extraction Question

Why Tool-Calling Failed

Why PromptQL Succeeded

Q3: Mathematical Operation Analysis

Why Tool-Calling Failed

Why Code-First Approach Worked

Real-World Business Complexity

Related reading

See PromptQL in action on your data.