When the LLM does less, the System does more: A Token Efficiency Experiment

I ran a small experiment comparing two AI agent designs for a practical task. No formal evaluations - just token counts, timestamps, and spot checks on output quality. Here’s what the numbers showed.

What I was testing

Picture a support engineer wrapping up a call. They need to log it: the ticket ID, the affected user, the system involved, what the issue was, the root cause, steps taken to resolve it, whether it was fully resolved, and any attachments.

It’s a structured data problem dressed up in casual conversation.

I built a C# .NET console app using the Azure AI Foundry C# SDK - to test two fundamentally different ways of solving this with AI. Both agents hold the same 8-turn conversation with a support engineer. Both use the same Azure AI Foundry models. The only difference is how much thinking the LLM does.

The two approaches

Agent 1: Tool-based (short prompt + C# logic)

Short system prompt (~150 words): “You are an incident logging assistant. When the support engineer gives you information, call the JobNotes_UpdateFields tool.”
Each turn, the LLM does one thing: understand what the engineer said → construct a tool call with the extracted field values.
C# does the rest: patches the in-memory job record, then runs GetNextFieldToCapture() - a deterministic ordered checklist - to decide what to ask next.

LLM handles: natural language comprehension + structured extraction
C# handles: state tracking, field ordering, completion detection

Engineer → LLM (extract fields) → Tool Call → C# Logic (next field?) → Response

Agent 2: No-tools conversational (large prompt, LLM does everything)

Large system prompt (~500 words): defines 5 conversation stages, an AAA (Acknowledge–Assess–Action) response pattern, scratchpad instructions, extraction rules, and closing behaviour.
Every turn, the LLM must extract data, maintain state, reason about what’s missing, select the next question, and format a multi-section response.
Each response includes <scratchpad>, <extracted_data>, and <response> XML blocks.

LLM handles: everything
C# handles: nothing except display output

Engineer → LLM (extract + track state + decide next question + format) → Response

The test

I ran both agents against the same 8 conversation turns on two Azure AI Foundry models:

GPT-Nano-5 (smaller, faster, cheaper)
GPT-5-mini (more capable, more expensive)

Every token was tracked per turn and logged. I did spot checks on the output - both approaches captured the incident data correctly in the cases I reviewed. I haven’t run any formal evaluations comparing output quality head-to-head; this experiment is focused on token efficiency and speed.

The numbers

GPT-Nano-5 results

Tool-Based Agent - Total: 25,520 tokens | Time: 81.88s

Turn	Input	Output	Total
1	1,638	1,056	2,694
2	1,882	843	2,725
3	2,105	575	2,680
4	2,352	507	2,859
5	2,638	1,004	3,642
6	2,888	73	2,961
7	3,120	899	4,019
8	3,420	520	3,940
Total	20,043	5,477	25,520

No-Tools Conversational Agent - Total: 39,452 tokens | Time: 139.39s

Turn	Input	Output	Total
1	1,780	1,796	3,576
2	2,132	2,376	4,508
3	2,511	1,889	4,400
4	2,875	1,126	4,001
5	3,280	1,811	5,091
6	3,692	1,826	5,518
7	4,104	1,410	5,514
8	4,491	2,353	6,844
Total	24,865	14,587	39,452

With GPT-Nano-5: Tool-Based used 35% fewer tokens and ran 41% faster.

GPT-5-mini results

Tool-Based Agent - Total: 31,518 tokens | Time: 99.12s

Turn	Input	Output	Total
1	1,638	363	2,001
2	4,058	620	4,678
3	2,386	865	3,251
4	2,604	1,069	3,673
5	4,441	847	5,288
6	3,284	722	4,006
7	3,499	714	4,213
8	3,742	666	4,408
Total	25,652	5,866	31,518

No-Tools Conversational Agent - Total: 34,151 tokens | Time: 102.27s

Turn	Input	Output	Total
1	1,780	1,111	2,891
2	2,120	642	2,762
3	2,505	782	3,287
4	2,930	1,041	3,971
5	3,427	1,175	4,602
6	4,056	1,305	5,361
7	4,565	889	5,454
8	5,080	743	5,823
Total	26,463	7,688	34,151

With GPT-5-mini: Tool-Based used 8% fewer tokens, at nearly identical speed.

Head-to-head summary

Metric	Nano Tool-based	Nano No-tools	Mini Tool-based	Mini No-tools
Total tokens	25,520	39,452	31,518	34,151
Input tokens	20,043	24,865	25,652	26,463
Output tokens	5,477	14,587	5,866	7,688
Output ratio	21.5%	37.0%	18.6%	22.5%
Input growth T1→T8	+109%	+152%	+128%	+185%
Response time	81.88s	139.39s	99.12s	102.27s
Data written to DB	✅ Yes	❌ No	✅ Yes	❌ No

What the numbers actually mean

Output ratio: how hard is the LLM working?

The output ratio - what percentage of each response is generated text - is the clearest signal here.

Nano No-Tools: 37% output ratio. Every turn, the model writes a <scratchpad> (internal reasoning, ~200–400 tokens), a full <extracted_data> block (repeated every turn, even for unchanged fields), and a <response>. That’s a lot of tokens the user never sees.
Nano Tool-Based: 21.5% output ratio. Most turns are a compact JSON tool call. Turn 6 produced just 73 output tokens - a brief acknowledgement, because C# already knew what field came next.
Mini narrows the gap considerably. GPT-5-mini’s no-tools output ratio is 22.5% - it writes tighter, more purposeful scratchpads. A more capable model is more “token-disciplined.”

Small models feel this more

With GPT-Nano-5, the no-tools approach consumed ~14,000 extra tokens over the full run - 54% overhead, paid every single conversation.

Why so much? The no-tools agent has to re-render its entire understanding of the job record in prose, every turn. With a smaller model that isn’t as good at compression, those blocks get verbose.

With the tool-based approach, the “state” lives in the C# JobDetailsRepository. The LLM doesn’t need to remember anything between turns - it just extracts what’s in the current message.

Context grows faster without tools

In a multi-turn conversation, input tokens grow because you feed the full history each turn. But how fast they grow matters:

Nano No-Tools input grew 152% from Turn 1 to Turn 8 (1,780 → 4,491 tokens)
Nano Tool-Based input grew only 109% (1,638 → 3,420 tokens)

No-tools history grows faster because previous responses are long (scratchpad + extracted_data). Tool-based responses are compact JSON. Compact history = slower context growth = lower input cost in later turns.

Speed

With Nano-5: 82s vs 139s - 70% more time for the no-tools agent. This isn’t just token count; it’s inference time for generating those extra 9,000 output tokens.

With Mini, times are nearly identical (99s vs 102s). Mini generates output faster, so the higher volume barely registers.

The gap the token count doesn’t capture

Only one of these approaches actually saved structured data.

The tool-based agent wrote every captured field to a JobDetailsRepository - queryable, auditable, ready to sync to a backend. The no-tools agent produced a well-formatted conversational summary that exists only in terminal output.

For a production system, that’s the difference between an AI assistant and an AI filing clerk.

Worth noting: the no-tools agent did ask richer follow-up questions - probing for error codes, service dependencies, escalation history. That conversational depth is real. But it came at significant token cost, and the data still wasn’t structured.

Why the gap varies so much by model

With Nano-5 the no-tools approach costs 35% more. With Mini it’s only 8%. Why?

Smaller models produce more verbose internal reasoning. GPT-Nano-5’s scratchpads are wordy - the model seems to need more “space” to work through things. GPT-5-mini’s scratchpads are terser. The same prompt structure produces tighter output at higher capability levels.

The practical implication: If you’re building on a small, fast, cheap model to control costs, the no-tools approach carries a real penalty. If you’re already using a capable model that writes tight reasoning, the gap shrinks - but the structural advantages of tool-based design (persistence, deterministic ordering, auditability) remain regardless.

The code that replaces hundreds of prompt tokens

The logic that drives the entire tool-based conversation is just this:

public static string? GetNextFieldToCapture(this JobDetails job)
{
    foreach (var field in JobFieldsChecklist.OrderedFields)
    {
        if (!job.IsFieldCaptured(field))
            return field;
    }
    return null; // All fields captured
}

This replaces hundreds of tokens of prompt instructions about “conversation stages” and “what to ask next.” It’s deterministic, free to run, and impossible to hallucinate.

The LLM’s job shrinks to: “What did the support engineer just tell me? Fill in those fields.”

The real trade-offs

This isn’t a universal win for tool-based approaches:

Aspect	Tool-based	No-tools
Token cost	✅ Lower	❌ Higher
Speed	✅ Faster (small models)	❌ Slower
Data persistence	✅ Structured DB record	❌ Conversational only
Conversational richness	⚠️ Follows fixed field order	✅ Can probe nuanced details
Field coverage	✅ Guaranteed (checklist)	⚠️ Depends on LLM judgment
Implementation complexity	⚠️ Requires tool + C# logic	✅ Just a prompt
Flexibility to change flow	⚠️ Requires code changes	✅ Just edit the prompt

Key takeaways

Offload state to code, not tokens. A C# dictionary is a perfect memory store for structured data capture. Don’t make the LLM carry that weight in its context window.
Deterministic logic doesn’t need to be learned. “What field should I ask about next?” is a solved problem with a 10-line method. Using LLM reasoning for this wastes tokens and introduces failure modes (hallucinated field names, skipped fields).
Model size amplifies architectural decisions. The efficiency gap is 4.5× larger with Nano than Mini. If you’re deploying on small models, architecture choice has serious cost impact.
The scratchpad is a symptom, not a solution. The no-tools approach needs internal reasoning tokens because the LLM has no other place to “think.” Give it tools and structured state, and those tokens are freed.
“Doing everything in the prompt” trades upfront simplicity for ongoing cost. The no-tools prompt took minutes to write. The tool-based approach took hours. In production with thousands of conversations, the difference compounds.

Final thought

The best AI systems I’ve seen don’t try to make the LLM smart. They make the LLM focused - narrow, purposeful prompts for the things only language models can do, and deterministic code for everything else.

When the agent does less, the system does more.

And as always don’t forget: keep your keyboard ready for action and your mind open to learning.

Happy Coding! 🎉

Built with C# .NET 8, Spectre.Console, and the Azure AI Foundry C# SDK.