Abstract
LLM tool calling has become a popular architectural approach as agents move from demos to full-scale deployment. However, practitioners report unpredictable failures in production. To understand these failure patterns, we analyzed a sample of GAIA benchmark questions using publicly available tool-calling AI agents (Manus AI, H20 AI) compared against PromptQL's code-first approach. Our evaluation reveals tool calling often fails due to i) inconsistent problem-solving, ii) interpretation errors, and iii) logical mistakes in mathematical operations. While limited in scope, these examples illustrate why deterministic code execution may offer advantages over probabilistic tool selection for enterprise AI deployment.
Analysis
To understand why tool-calling architectures systematically fail, we analyzed three GAIA questions and one realistic business question that test AI agent capabilities: precise instruction following, logical verification, and computational simulation.
What we tested: GAIA questions provide an isolated environment for testing whether systems can execute specific tasks consistently. Our business question demonstrates how these patterns manifest in real enterprise workflows.
How we tested: For each question, we evaluated conventional tool-calling systems (Manus AI, H2O AI) against PromptQL under identical conditions. We simply asked the questions on their consumer-facing apps. This apples-to-apples comparison reveals architectural differences rather than implementation variations.
Why this matters: The results expose consistent failure patterns in tool-calling architectures while demonstrating how code-first approaches like PromptQL's deliver superior reliability. If systems struggle with these controlled reasoning tasks, how can they be expected to handle business-critical questions where ambiguous queries, conflicting data, and imprecise context reign supreme?
Limitations: While GAIA questions test core capabilities in isolation, real enterprise workflows involve conflicting data sources, ambiguous requirements, and messy contexts. We're collaborating with Prof. Aditya Parameswaran's lab at UC Berkeley to develop benchmarks that capture this full complexity. Until then, GAIA provides a solid foundation for understanding fundamental architectural differences.
Understanding Code-First Architecture
We define "code-first" architectures as systems that treat executable code as the primary problem-solving interface. In this approach, code is a first-class citizen—planning, reasoning, and delegation all happen through explicit programming constructs rather than probabilistic natural language decisions.
Under the hood, code-first systems may still call external tools, APIs, or even delegate to other AI models. The key difference is that these operations happen through deterministic code paths rather than LLM-driven tool selection. Where traditional systems ask an LLM to decide "which tool should I use and how?", code-first systems embed those decisions in explicit control flow.
Recent research identifies this as "code-enhanced reasoning," where code serves as "an abstract, modular, and logic-driven structure that supports reasoning". Programming languages inherently enforce systematic, step-by-step thinking that complex problems require—but which LLMs struggle to maintain consistently through natural language alone.
Q1: "Pick That Ping-Pong" Riddle
A complex probabilistic puzzle describing a game where ping-pong balls are ejected from a three-position platform, with specific rules for ball movement and ejection patterns.
GAIA Question: ec09fa32-d03f-4bf8-84b0-1f16922c3ae4
Experimental Setup: We tested this question on Manus AI (configured to use "agent mode") and PromptQL.
|
Manus AI |
PromptQL |
Method |
Natural language reasoning to select tools |
Code as primary problem-solving interface |
Approach |
Attempted qualitative analysis of game mechanics |
Immediate computational simulation |
Output |
Ball 100 |
Ball 3 |
Result |
❌ Incorrect |
✅ Correct |
Evidence |
Execution thread |
Execution thread |
Why Tool-Calling Failed
Manus AI has access to the right tools for this kind of question. Their AI agent has code writing, code execution, and simulation capabilities. The failure here wasn't due to a lack of capabilities. The failure occurred at the orchestration level. The LLM decided to "reason" qualitatively rather than computationally.
This is the core issue with tool-calling architectures: tool selection is probabilistic, and that means inconsistent problem-solving approaches. Sourcing and designing the right tools is a challenge, but even when the right tools are available, consistent delegation remains unreliable. The system can't guarantee that similar problems will be approached the same way twice.
Why Code-First Approach Succeeded
In the execution we observed, PromptQL solved this problem through executable code rather than qualitative reasoning. This code-first approach delivered two key advantages:
- Modular problem decomposition
To solve the problem, PromptQL wrote discrete functions : advance_balls(), calculate_probabilities(), update_platform() – creating a systematic approach to the probabilistic analysis. - Verifiable execution paths
Every step of the ball ejection logic can be traced in the actual code: if piston_fires == 1: eject_ball(position_1). Unlike natural language reasoning, each logical step is explicit and checkable.
If any of the solution steps fail in the code-first approach, it's easier to fix those building blocks rather than follow the logic trail in a natural language analysis.
Extract a sentence from a 5x7 block of text by reading from left to right and using all letters in order.
GAIA Question: 50ad0280-0819-4bd9-b275-5de32d3b5bcb
Experimental Setup: We tested this question on Manus AI (configured to use "agent mode") and PromptQL.
|
Manus AI |
PromptQL |
Method |
Natural language interpretation of instructions |
Literal string manipulation following exact instructions |
Process |
Treated each line as separate words with spaces |
Concatenated all characters, then identified sentence structure |
Output |
"THESE A GULL GLIDED PEACEFULLY TO MY CHAIR" |
"THE SEAGULL GLIDED PEACEFULLY TO MY CHAIR" |
Result |
❌ Incorrect |
✅ Correct |
Key Point |
Added implicit word boundaries not in original text |
Followed "use all letters in order" literally |
Evidence |
Thread |
Thread |
Why Tool-Calling Failed
This failure reveals a prominent failure surface with tool-calling architectures: opaque reasoning. Manus AI produced an incorrect result, but we can't actually see the reasoning pathway until we specifically asked for clarification.
On closer inspection, we see Manus AI made a critical interpretation error—treating each line as a separate word—but this logic was inaccurate and buried within the tool-calling execution. Based on this (inaccurate) interpretation, Manus inserted spaces that weren't in the original text, and ultimately provided a wrong solution.
Why PromptQL Succeeded
In contrast, PromptQL created a query plan (in natural language) and then wrote a Python program to execute that plan. Both of these artifacts are visible, explicit, and verifiable. If the query plan was flawed, PromptQL provides a mechanism for users to edit that plan. And if the Python program fails to compile or has bugs, PromptQL iterates and tries to fix them in real time. This approach yields several benefits: real-time verification, reproducible reasoning, and systematic debugging.
Q3: Mathematical Operation Analysis
Given a table of mathematical operations, identify which elements are involved in counter-examples that prove the operation is not commutative.
GAIA Question: 6f37996b-2ac7-44b0-8e68-6d28256631b4
Experimental Setup: We tested this question on both PromptQL and H20 AI (configured with default LLM settings).
|
H20 AI |
PromptQL |
Method |
Manual table inspection with natural language reasoning |
Verification via code |
Approach |
Step-by-step manual checking of pairs |
Automated iteration through all combinations |
Output |
"b,c,e" |
"b,e" |
Result |
❌ Incorrect |
✅ Correct |
Evidence |
Thread |
Thread |
Why Tool-Calling Failed
H2O.ai correctly identified the problem: b * e ≠ e * b (since according to the map, b * e = c, but e * b = b). This proves both elements (b, e) don't commute. However, instead of concluding that "b, e" were the non-commutative elements, it included "c" in its answer.
This failure is a logical error. The question asks which elements are involved in counter-examples, meaning which elements were tested for commutativity. The result of the operation i.e. "c" has nothing to do with which elements were being compared.
Why Code-First Approach Worked
PromptQL systematically checked every pair (x,y) against (y,x) and added only the compared elements to the counter-example set. The code implemented the question's logic without ambiguity about what "involved" means.
Real-World Business Complexity
The failure patterns we've observed in these GAIA questions become even more critical when applied to real business operations. Consider this typical customer success workflow that many companies run daily:
Give me a sorted list of the top 5 support tickets which I should prioritize amongst the last 30 most recent open tickets.
For each ticket, attach:
- project_id (extract from ticket description)
- project plan
- issue criticality
- monthly average revenue for the project
- recent ticket_ids for that project (last 6 months)
Issue criticality (descending priority):
- Production downtime
- Instability in production
- Performance degradation in production
- Bug
- Feature request
- How-to
Prioritization rules: Production issues first, then advanced plan > base plan > free plan. Non-production issues ordered by monthly revenue. Break ties by: time since opened, then recent negative experience.
This query might look straightforward. But answering it requires:
- 6 different data requests (tickets, projects, plans, revenue, comments, history)
- Semantic analysis of ticket descriptions and comments
- Multi-criteria ranking with nested business rules
- Tie-breaking logic across multiple dimensions
- Consistent output format across all results.
Here's the core challenge: How do you answer this question with confidence? If you ask this question every week, how many times will you get the right answer?
With tool-calling approaches, each execution invokes probabilistic decisions:
- Will tie-breaking logic be applied uniformly?
- Will intermediate results be preserved correctly between steps?
- Will the LLM retrieve data in the same order?
The total surface area of failure is enormous. Semantic misclassification could break the chain. Context corruption could cloud the reasoning. Prioritization rules could be applied inconsistently. One probabilistic decision goes wrong anywhere in the chain, and your business analyst receives the wrong priority list.
This illustrates the fundamental challenge with tool-calling architectures: they promise sophisticated business intelligence but struggle with consistency when reliability matters most.
Code-first approaches offer an alternative paradigm. Instead of relying on probabilistic tool selection, they enforce determinism through explicit programming contracts. Query plans become editable and verified. Business logic gets encoded in testable and auditable functions. Intermediate results can be cached and reused.
While our analysis is limited to a few representative cases, the pattern is clear: deterministic execution paths offer advantages over probabilistic decision-making for enterprise-critical workflows. As AI agents move from impressive demos to business-critical deployment, architectural choices that prioritize consistency over flexibility may prove essential for building user trust.