I ran a small experiment comparing two AI agent designs for a practical task. No formal evaluations - just token counts, timestamps, and spot checks on output quality. Here’s what the numbers showed.
What I was testing
Picture a support engineer wrapping up a call. They need to log it: the ticket ID, the affected user, the system involved, what the issue was, the root cause, steps taken to resolve it, whether it was fully resolved, and any attachments.
It’s a structured data problem dressed up in casual conversation.
I built a C# .NET console app using the Azure AI Foundry C# SDK - to test two fundamentally different ways of solving this with AI. Both agents hold the same 8-turn conversation with a support engineer. Both use the same Azure AI Foundry models. The only difference is how much thinking the LLM does.
The two approaches
Agent 1: Tool-based (short prompt + C# logic)
- Short system prompt (~150 words): “You are an incident logging assistant. When the support engineer gives you information, call the
JobNotes_UpdateFieldstool.” - Each turn, the LLM does one thing: understand what the engineer said → construct a tool call with the extracted field values.
- C# does the rest: patches the in-memory job record, then runs
GetNextFieldToCapture()- a deterministic ordered checklist - to decide what to ask next.
LLM handles: natural language comprehension + structured extraction
C# handles: state tracking, field ordering, completion detection
Engineer → LLM (extract fields) → Tool Call → C# Logic (next field?) → ResponseAgent 2: No-tools conversational (large prompt, LLM does everything)
- Large system prompt (~500 words): defines 5 conversation stages, an AAA (Acknowledge–Assess–Action) response pattern, scratchpad instructions, extraction rules, and closing behaviour.
- Every turn, the LLM must extract data, maintain state, reason about what’s missing, select the next question, and format a multi-section response.
- Each response includes
<scratchpad>,<extracted_data>, and<response>XML blocks.
LLM handles: everything
C# handles: nothing except display output
Engineer → LLM (extract + track state + decide next question + format) → ResponseThe test
I ran both agents against the same 8 conversation turns on two Azure AI Foundry models:
- GPT-Nano-5 (smaller, faster, cheaper)
- GPT-5-mini (more capable, more expensive)
Every token was tracked per turn and logged. I did spot checks on the output - both approaches captured the incident data correctly in the cases I reviewed. I haven’t run any formal evaluations comparing output quality head-to-head; this experiment is focused on token efficiency and speed.
The numbers
GPT-Nano-5 results
Tool-Based Agent - Total: 25,520 tokens | Time: 81.88s
| Turn | Input | Output | Total |
|---|---|---|---|
| 1 | 1,638 | 1,056 | 2,694 |
| 2 | 1,882 | 843 | 2,725 |
| 3 | 2,105 | 575 | 2,680 |
| 4 | 2,352 | 507 | 2,859 |
| 5 | 2,638 | 1,004 | 3,642 |
| 6 | 2,888 | 73 | 2,961 |
| 7 | 3,120 | 899 | 4,019 |
| 8 | 3,420 | 520 | 3,940 |
| Total | 20,043 | 5,477 | 25,520 |
No-Tools Conversational Agent - Total: 39,452 tokens | Time: 139.39s
| Turn | Input | Output | Total |
|---|---|---|---|
| 1 | 1,780 | 1,796 | 3,576 |
| 2 | 2,132 | 2,376 | 4,508 |
| 3 | 2,511 | 1,889 | 4,400 |
| 4 | 2,875 | 1,126 | 4,001 |
| 5 | 3,280 | 1,811 | 5,091 |
| 6 | 3,692 | 1,826 | 5,518 |
| 7 | 4,104 | 1,410 | 5,514 |
| 8 | 4,491 | 2,353 | 6,844 |
| Total | 24,865 | 14,587 | 39,452 |
With GPT-Nano-5: Tool-Based used 35% fewer tokens and ran 41% faster.
GPT-5-mini results
Tool-Based Agent - Total: 31,518 tokens | Time: 99.12s
| Turn | Input | Output | Total |
|---|---|---|---|
| 1 | 1,638 | 363 | 2,001 |
| 2 | 4,058 | 620 | 4,678 |
| 3 | 2,386 | 865 | 3,251 |
| 4 | 2,604 | 1,069 | 3,673 |
| 5 | 4,441 | 847 | 5,288 |
| 6 | 3,284 | 722 | 4,006 |
| 7 | 3,499 | 714 | 4,213 |
| 8 | 3,742 | 666 | 4,408 |
| Total | 25,652 | 5,866 | 31,518 |
No-Tools Conversational Agent - Total: 34,151 tokens | Time: 102.27s
| Turn | Input | Output | Total |
|---|---|---|---|
| 1 | 1,780 | 1,111 | 2,891 |
| 2 | 2,120 | 642 | 2,762 |
| 3 | 2,505 | 782 | 3,287 |
| 4 | 2,930 | 1,041 | 3,971 |
| 5 | 3,427 | 1,175 | 4,602 |
| 6 | 4,056 | 1,305 | 5,361 |
| 7 | 4,565 | 889 | 5,454 |
| 8 | 5,080 | 743 | 5,823 |
| Total | 26,463 | 7,688 | 34,151 |
With GPT-5-mini: Tool-Based used 8% fewer tokens, at nearly identical speed.
Head-to-head summary
| Metric | Nano Tool-based | Nano No-tools | Mini Tool-based | Mini No-tools |
|---|---|---|---|---|
| Total tokens | 25,520 | 39,452 | 31,518 | 34,151 |
| Input tokens | 20,043 | 24,865 | 25,652 | 26,463 |
| Output tokens | 5,477 | 14,587 | 5,866 | 7,688 |
| Output ratio | 21.5% | 37.0% | 18.6% | 22.5% |
| Input growth T1→T8 | +109% | +152% | +128% | +185% |
| Response time | 81.88s | 139.39s | 99.12s | 102.27s |
| Data written to DB | ✅ Yes | ❌ No | ✅ Yes | ❌ No |
What the numbers actually mean
Output ratio: how hard is the LLM working?
The output ratio - what percentage of each response is generated text - is the clearest signal here.
- Nano No-Tools: 37% output ratio. Every turn, the model writes a
<scratchpad>(internal reasoning, ~200–400 tokens), a full<extracted_data>block (repeated every turn, even for unchanged fields), and a<response>. That’s a lot of tokens the user never sees. - Nano Tool-Based: 21.5% output ratio. Most turns are a compact JSON tool call. Turn 6 produced just 73 output tokens - a brief acknowledgement, because C# already knew what field came next.
- Mini narrows the gap considerably. GPT-5-mini’s no-tools output ratio is 22.5% - it writes tighter, more purposeful scratchpads. A more capable model is more “token-disciplined.”
Small models feel this more
With GPT-Nano-5, the no-tools approach consumed ~14,000 extra tokens over the full run - 54% overhead, paid every single conversation.
Why so much? The no-tools agent has to re-render its entire understanding of the job record in prose, every turn. With a smaller model that isn’t as good at compression, those blocks get verbose.
With the tool-based approach, the “state” lives in the C# JobDetailsRepository. The LLM doesn’t need to remember anything between turns - it just extracts what’s in the current message.
Context grows faster without tools
In a multi-turn conversation, input tokens grow because you feed the full history each turn. But how fast they grow matters:
- Nano No-Tools input grew 152% from Turn 1 to Turn 8 (1,780 → 4,491 tokens)
- Nano Tool-Based input grew only 109% (1,638 → 3,420 tokens)
No-tools history grows faster because previous responses are long (scratchpad + extracted_data). Tool-based responses are compact JSON. Compact history = slower context growth = lower input cost in later turns.
Speed
With Nano-5: 82s vs 139s - 70% more time for the no-tools agent. This isn’t just token count; it’s inference time for generating those extra 9,000 output tokens.
With Mini, times are nearly identical (99s vs 102s). Mini generates output faster, so the higher volume barely registers.
The gap the token count doesn’t capture
Only one of these approaches actually saved structured data.
The tool-based agent wrote every captured field to a JobDetailsRepository - queryable, auditable, ready to sync to a backend. The no-tools agent produced a well-formatted conversational summary that exists only in terminal output.
For a production system, that’s the difference between an AI assistant and an AI filing clerk.
Worth noting: the no-tools agent did ask richer follow-up questions - probing for error codes, service dependencies, escalation history. That conversational depth is real. But it came at significant token cost, and the data still wasn’t structured.
Why the gap varies so much by model
With Nano-5 the no-tools approach costs 35% more. With Mini it’s only 8%. Why?
Smaller models produce more verbose internal reasoning. GPT-Nano-5’s scratchpads are wordy - the model seems to need more “space” to work through things. GPT-5-mini’s scratchpads are terser. The same prompt structure produces tighter output at higher capability levels.
The practical implication: If you’re building on a small, fast, cheap model to control costs, the no-tools approach carries a real penalty. If you’re already using a capable model that writes tight reasoning, the gap shrinks - but the structural advantages of tool-based design (persistence, deterministic ordering, auditability) remain regardless.
The code that replaces hundreds of prompt tokens
The logic that drives the entire tool-based conversation is just this:
public static string? GetNextFieldToCapture(this JobDetails job){ foreach (var field in JobFieldsChecklist.OrderedFields) { if (!job.IsFieldCaptured(field)) return field; } return null; // All fields captured}This replaces hundreds of tokens of prompt instructions about “conversation stages” and “what to ask next.” It’s deterministic, free to run, and impossible to hallucinate.
The LLM’s job shrinks to: “What did the support engineer just tell me? Fill in those fields.”
The real trade-offs
This isn’t a universal win for tool-based approaches:
| Aspect | Tool-based | No-tools |
|---|---|---|
| Token cost | ✅ Lower | ❌ Higher |
| Speed | ✅ Faster (small models) | ❌ Slower |
| Data persistence | ✅ Structured DB record | ❌ Conversational only |
| Conversational richness | ⚠️ Follows fixed field order | ✅ Can probe nuanced details |
| Field coverage | ✅ Guaranteed (checklist) | ⚠️ Depends on LLM judgment |
| Implementation complexity | ⚠️ Requires tool + C# logic | ✅ Just a prompt |
| Flexibility to change flow | ⚠️ Requires code changes | ✅ Just edit the prompt |
Key takeaways
-
Offload state to code, not tokens. A C# dictionary is a perfect memory store for structured data capture. Don’t make the LLM carry that weight in its context window.
-
Deterministic logic doesn’t need to be learned. “What field should I ask about next?” is a solved problem with a 10-line method. Using LLM reasoning for this wastes tokens and introduces failure modes (hallucinated field names, skipped fields).
-
Model size amplifies architectural decisions. The efficiency gap is 4.5× larger with Nano than Mini. If you’re deploying on small models, architecture choice has serious cost impact.
-
The scratchpad is a symptom, not a solution. The no-tools approach needs internal reasoning tokens because the LLM has no other place to “think.” Give it tools and structured state, and those tokens are freed.
-
“Doing everything in the prompt” trades upfront simplicity for ongoing cost. The no-tools prompt took minutes to write. The tool-based approach took hours. In production with thousands of conversations, the difference compounds.
Final thought
The best AI systems I’ve seen don’t try to make the LLM smart. They make the LLM focused - narrow, purposeful prompts for the things only language models can do, and deterministic code for everything else.
When the agent does less, the system does more.
And as always don’t forget: keep your keyboard ready for action and your mind open to learning.
Happy Coding! 🎉
Built with C# .NET 8, Spectre.Console, and the Azure AI Foundry C# SDK.