When the LLM does less, the System does more: A Token Efficiency Experiment

I ran a small experiment comparing two AI agent designs for a practical task. No formal evaluations just token counts, timestamps, and spot checks on output quality. Here’s what the numbers showed.

📖 Key terms (plain English)

Not everyone lives and breathes AI jargon - here’s a quick cheat sheet before we dive in:

Term	What it means in plain words
Token	A small chunk of text - roughly ¾ of a word on average. “Cat” = 1 token. “Unbelievable” = 3 tokens. AI models charge by tokens (like paying per puzzle piece), so fewer tokens = faster and cheaper.
LLM (Large Language Model)	The AI brain. A program trained on huge amounts of text that can read, write, and understand language. GPT, Claude, and Gemini are all LLMs.
System Prompt	The instructions you give the AI before the conversation starts - like briefing a new employee: “You work at a help desk. Always ask for the ticket ID first.”
Input Tokens	The tokens you send to the AI - your message plus the full conversation history so far. The longer the chat, the more input tokens each turn costs.
Output Tokens	The tokens the AI sends back - its reply. Generating output is usually slower and pricier than reading input.
Context Window	The AI’s short-term memory. Every turn, it re-reads the entire conversation history. The bigger that history grows, the more tokens you pay per turn.
Tool Call	When the AI asks a piece of code to do something on its behalf - like saving data to a database. The AI fills in the details; the code does the actual work.
Scratchpad	A hidden section in the AI’s response where it “thinks out loud” before giving its final answer. Helpful for accuracy, but those thinking-tokens still cost money.

What I was testing

Picture a support engineer wrapping up a call. They need to log it: the ticket ID, the affected user, the system involved, what the issue was, the root cause, steps taken to resolve it, whether it was fully resolved, and any attachments.

It’s a structured data problem dressed up in casual conversation.

I built a C# .NET console app using the Azure AI Foundry C# SDK - to test two fundamentally different ways of solving this with AI. Both agents hold the same 8-turn conversation with a support engineer. Both use the same Azure AI Foundry models. The only difference is how much thinking the LLM does.

The two approaches

Agent 1: Tool-based (short prompt + C# logic)

Short system prompt (~150 words): “You are an incident logging assistant. When the support engineer gives you information, call the JobNotes_UpdateFields tool.”
Each turn, the LLM does one thing: understand what the engineer said → construct a tool call with the extracted field values.
C# does the rest: patches the in-memory job record, then runs GetNextFieldToCapture() - a deterministic ordered checklist - to decide what to ask next.

LLM handles: natural language comprehension + structured extraction
C# handles: state tracking, field ordering, completion detection

Engineer → LLM (extract fields) → Tool Call → C# Logic (next field?) → Response

Agent 2: No-tools conversational (large prompt, LLM does everything)

Large system prompt (~500 words): defines 5 conversation stages, an AAA (Acknowledge–Assess–Action) response pattern, scratchpad instructions, extraction rules, and closing behaviour.
Every turn, the LLM must extract data, maintain state, reason about what’s missing, select the next question, and format a multi-section response.
Each response includes <scratchpad>, <extracted_data>, and <response> XML blocks.

LLM handles: everything
C# handles: nothing except display output

Engineer → LLM (extract + track state + decide next question + format) → Response

The test

I ran both agents against the same 8 conversation turns on two Azure AI Foundry models:

GPT-Nano-5 (smaller, faster, cheaper)
GPT-5-mini (more capable, more expensive)

Every token was tracked per turn and logged. I did spot checks on the output - both approaches captured the incident data correctly in the cases I reviewed. I haven’t run any formal evaluations comparing output quality head-to-head; this experiment is focused on token efficiency and speed.

The numbers

GPT-Nano-5 results

Tool-Based Agent - Total: 25,520 tokens | Time: 81.88s

Turn	Input	Output	Total
1	1,638	1,056	2,694
2	1,882	843	2,725
3	2,105	575	2,680
4	2,352	507	2,859
5	2,638	1,004	3,642
6	2,888	73	2,961
7	3,120	899	4,019
8	3,420	520	3,940
Total	20,043	5,477	25,520

No-Tools Conversational Agent - Total: 39,452 tokens | Time: 139.39s

Turn	Input	Output	Total
1	1,780	1,796	3,576
2	2,132	2,376	4,508
3	2,511	1,889	4,400
4	2,875	1,126	4,001
5	3,280	1,811	5,091
6	3,692	1,826	5,518
7	4,104	1,410	5,514
8	4,491	2,353	6,844
Total	24,865	14,587	39,452

With GPT-Nano-5: the tool-based agent consumed 35% fewer tokens (25,520 vs. 39,452) and completed the same 8-turn session 41% faster (81.9s vs. 139.4s) - a material efficiency gap at this model tier.

GPT-5-mini results

Tool-Based Agent - Total: 31,518 tokens | Time: 99.12s

Turn	Input	Output	Total
1	1,638	363	2,001
2	4,058	620	4,678
3	2,386	865	3,251
4	2,604	1,069	3,673
5	4,441	847	5,288
6	3,284	722	4,006
7	3,499	714	4,213
8	3,742	666	4,408
Total	25,652	5,866	31,518

No-Tools Conversational Agent - Total: 34,151 tokens | Time: 102.27s

Turn	Input	Output	Total
1	1,780	1,111	2,891
2	2,120	642	2,762
3	2,505	782	3,287
4	2,930	1,041	3,971
5	3,427	1,175	4,602
6	4,056	1,305	5,361
7	4,565	889	5,454
8	5,080	743	5,823
Total	26,463	7,688	34,151

With GPT-5-mini: the token gap compresses to 8% (31,518 vs. 34,151), with execution times within 3% of each other (99.1s vs. 102.3s). The structural advantage persists - but the magnitude shrinks on a more capable model.

Head-to-head summary

Metric	Nano Tool-based	Nano No-tools	Mini Tool-based	Mini No-tools
Total tokens	25,520	39,452	31,518	34,151
Input tokens	20,043	24,865	25,652	26,463
Output tokens	5,477	14,587	5,866	7,688
Output ratio	21.5%	37.0%	18.6%	22.5%
Input growth T1→T8	+109%	+152%	+128%	+185%
Response time	81.88s	139.39s	99.12s	102.27s
Data written to DB	✅ Yes	❌ No	✅ Yes	❌ No

What the numbers actually mean

Output ratio: how hard is the LLM working?

The output ratio - what percentage of each response is generated text - is the clearest signal here.

Nano No-Tools: 37% output ratio. Each turn includes a <scratchpad> for internal reasoning (~200-400 tokens), a full <extracted_data> block repeated in its entirety even for unchanged fields, and a <response>. Roughly a third of total spend is inference the end user never sees.
Nano Tool-Based: 21.5% output ratio. Output is predominantly compact JSON tool calls. Turn 6 produced just 73 output tokens - a brief acknowledgement - because field-ordering logic was handled deterministically in C#, not re-inferred by the model each turn.
GPT-5-mini closes the gap. Its no-tools output ratio drops to 22.5% - tighter scratchpads, more purposeful reasoning chains. Higher model capability directly reduces token waste on internal scaffolding.

Small models feel this more

With GPT-Nano-5, the no-tools approach consumed approximately 13,900 additional tokens across the 8-turn session - a 54% overhead on total token spend, recurring on every conversation. At scale, that overhead compounds directly into cost.

Why so much? The no-tools agent has to re-render its entire understanding of the job record in prose, every turn. With a smaller model that isn’t as good at compression, those blocks get verbose.

With the tool-based approach, the “state” lives in the C# JobDetailsRepository. The LLM doesn’t need to remember anything between turns - it just extracts what’s in the current message.

Context grows faster without tools

In a multi-turn conversation, input tokens grow because you feed the full history each turn. But how fast they grow matters:

Nano No-Tools: input grew 152% from Turn 1 to Turn 8 (1,780 to 4,491 tokens), driven by increasingly verbose prior responses being re-fed into context each turn
Nano Tool-Based: input grew 109% over the same span (1,638 to 3,420 tokens) - a measurably shallower growth curve, because compact tool call responses contribute far less to accumulated context

No-tools history grows faster because previous responses are long (scratchpad + extracted_data). Tool-based responses are compact JSON. Compact history = slower context growth = lower input cost in later turns.

Speed

With Nano-5: 81.9s vs 139.4s - the no-tools agent took 70% longer to complete the same conversation. That gap is not just a token count artefact; it reflects the inference latency of generating approximately 9,100 additional output tokens on a smaller model. In latency-sensitive workflows, that cost is user-visible.

With Mini, times are nearly identical (99s vs 102s). Mini generates output faster, so the higher volume barely registers.

The gap the token count doesn’t capture

Only one of these approaches actually saved structured data.

The tool-based agent wrote every captured field to a JobDetailsRepository - queryable, auditable, ready to sync to a backend. The no-tools agent produced a well-formatted conversational summary that exists only in terminal output.

For a production system, that’s the difference between an AI assistant and an AI filing clerk.

Worth noting: the no-tools agent did ask richer follow-up questions - probing for error codes, service dependencies, escalation history. That conversational depth is real. But it came at significant token cost, and the data still wasn’t structured.

Why the gap varies so much by model

With Nano-5, the no-tools approach consumed 35% more tokens. With Mini, that differential drops to 8%. The architecture is identical in both cases - the model tier changes the magnitude, not the direction.

Smaller models produce more verbose internal reasoning. GPT-Nano-5’s scratchpads are wordy - the model seems to need more “space” to work through things. GPT-5-mini’s scratchpads are terser. The same prompt structure produces tighter output at higher capability levels.

The practical implication: If you’re building on a small, fast, cheap model to control costs, the no-tools approach carries a real penalty. If you’re already using a capable model that writes tight reasoning, the gap shrinks - but the structural advantages of tool-based design (persistence, deterministic ordering, auditability) remain regardless.

The code that replaces hundreds of prompt tokens

The logic that drives the entire tool-based conversation is just this:

public static string? GetNextFieldToCapture(this JobDetails job)
{
    foreach (var field in JobFieldsChecklist.OrderedFields)
    {
        if (!job.IsFieldCaptured(field))
            return field;
    }
    return null; // All fields captured
}

This replaces hundreds of tokens of prompt instructions about “conversation stages” and “what to ask next.” It’s deterministic, free to run, and impossible to hallucinate.

The LLM’s job shrinks to: “What did the support engineer just tell me? Fill in those fields.”

The real trade-offs

This isn’t a universal win for tool-based approaches:

Aspect	Tool-based	No-tools
Token cost	✅ Lower	❌ Higher
Speed	✅ Faster (small models)	❌ Slower
Data persistence	✅ Structured DB record	❌ Conversational only
Conversational richness	⚠️ Follows fixed field order	✅ Can probe nuanced details
Field coverage	✅ Guaranteed (checklist)	⚠️ Depends on LLM judgment
Implementation complexity	⚠️ Requires tool + C# logic	✅ Just a prompt
Flexibility to change flow	⚠️ Requires code changes	✅ Just edit the prompt

Key takeaways

Offload state to code, not tokens. A C# dictionary is a perfect memory store for structured data capture. Don’t make the LLM carry that weight in its context window.
Deterministic logic doesn’t need to be learned. “What field should I ask about next?” is a solved problem with a 10-line method. Using LLM reasoning for this wastes tokens and introduces failure modes (hallucinated field names, skipped fields).
Model size amplifies architectural decisions. The token efficiency gap between approaches is 4.4x larger on Nano than on Mini (35% vs. 8%). On small, cost-optimised models, architectural choices carry measurable cost consequences at scale.
The scratchpad is a symptom, not a solution. The no-tools approach needs internal reasoning tokens because the LLM has no other place to “think.” Give it tools and structured state, and those tokens are freed.
“Doing everything in the prompt” trades upfront simplicity for ongoing cost. The no-tools prompt took minutes to write. The tool-based approach took hours. In production with thousands of conversations, the difference compounds.

Final thought

The best AI systems I’ve seen don’t try to make the LLM smart. They make the LLM focused - narrow, purposeful prompts for the things only language models can do, and deterministic code for everything else.

When the agent does less, the system does more.

And as always don’t forget: keep your keyboard ready for action and your mind open to learning.

Happy Coding! 🎉

Built with C# .NET 8, Spectre.Console, and the Azure AI Foundry C# SDK.