Most AI stories sound too polished.
You ask a question. The model gives an answer. You accept it.
That is not how this went.
This was messier, but better.
I was working on getting production smoke testing working for an API service that spanned multiple components: Entra authentication, usage credits, image upload, Azure Blob SAS, image recognition, reporting, audit logs, and deployment workflows.
The goal was not:
Can AI generate a test script?
That would have been too small.
The real goal was:
Can I build enough confidence after production deployment that the service is actually usable?
A smoke test is not just a script. It is a trust exercise.
Why this mattered
This exercise was not about proving that AI can write tests.
The benefit was in minimising uncertainty after a production deployment.
Before this, it was too easy to rely on shallow signals: the deployment completed, the container was running, the health endpoint responded, and the logs looked quiet. Those signals were useful, but they did not prove that the service was usable.
The smoke tests gave me a stronger signal. They checked whether the important production path actually worked across auth, credits, upload, image recognition, audit, and reporting.
The biggest win was not the script itself. It was reducing the gap between:
Deployment succeeded.
And:
Production is actually usable.
It was no longer:
I used AI to create a test script.
It became:
I used AI to help turn a risky, manual, easy-to-forget production verification process into something repeatable, evidence-based, and repo-owned.
That is the part worth sharing.
The first trap: “just write a smoke test”
At the start, my prompt was simple:
Write a script that calls the health endpoint and maybe one protected API.
That would have looked easy to do, but it also would have been weak.
The service included multiple components: authentication, configuration, credit management, uploading, image recognition, auditing, reporting, and deployment. A health check would only confirm that the container was running, not that the system was functioning properly.
That distinction mattered.
Alive is not ready.
Ready is not useful.
Useful is not verified.
Where AI helped
AI was useful because it could hold many moving parts in its head at once. It could read the repo, inspect docs, find endpoint contracts, compare test expectations against production behaviour, and help turn a messy verification path into repeatable smoke tests.
But that was only part of the story. The other part was steering.
There were several points at which the initial answer was insufficient. One of the first things I asked was:
Why are we changing this? Is this actually a production bug?
That question shifted the work back to evidence.
At another point, I pushed back with:
Shouldn’t the ASP.NET framework already handle this?
That forced a check against how migrations and connection auth actually worked.
When the answer started drifting into assumptions, I asked:
Does the deployment framework actually wire this up?
That forced me to read the app host (Aspire) and the docs, instead of guessing.
And when the smoke test needed authentication, I asked the obvious question:
Why not just log in using Playwright in Chrome?
That forced the difference between delegated user tokens and app-only workflow identity into the open.
Those questions were not interruptions. They were the work.
The method: S.M.O.K.E.
If I had to name the workflow I ended up using, I would call it S.M.O.K.E. Not because acronyms magically improve engineering, but because a good acronym can make the shape of the work easier to remember.

S - Start With The Real Failure
I did not start from an imaginary test plan. I started from the actual production path.
The deployment had previously failed. The smoke test had identified confusing behaviour before. A small image fixture caused the image recognition path to fail in a way that appeared to be a production bug, but was actually invalid test input.
That was important. Bad smoke tests create false fear. A good smoke test should fail loudly when production is broken, but not because the test is unrealistic.
I updated the harness to generate a valid PNG at runtime. This detail is more important than it seems because it allows the smoke test to exercise the product path rather than a broken fixture.
M - Map The Whole Workflow
A useful smoke test follows the user journey. In this case, the “user” is not only a person clicking a UI. It is the full API workflow:
- Get authorised.
- Confirm protected endpoints reject anonymous access.
- Confirm configuration exists.
- Reserve an upload.
- Upload a real image.
- Confirm the upload.
- Run image recognition.
- Verify audit and reporting records exist.
- Check edge cases.
That is much stronger than “call endpoint X and expect 200.”
The smoke tests now prove that the key components work together. Not perfectly. Not exhaustively. But enough to catch the kind of production breakages that matter after deployment.
O - Oppose The AI
This was probably the most useful part.
AI is fast. Fast can be dangerous when the answer sounds plausible. My job was not to accept the first confident answer. My job was to push back.
Some of the best progress came from asking:
- Why do we need that?
- Shouldn’t the ASP.NET framework already handle this?
- Is this a real bug?
- Why didn’t tests catch it?
- Does this break local development?
- Should this live in the server instead?
- Should this be automatic or manual?
Those questions changed the outcome. They prevented unnecessary changes, separated product bugs from environment issues, and kept the smoke tests from becoming a pile of clever calls with unclear value.
AI helped accelerate. I had to keep direction. That was the partnership.
K - Keep Evidence Close
A lot of AI-assisted work goes wrong because the model floats away from the repo. So I kept pulling it back.
Check the docs. Check the Aspire app host. Check the endpoint tests. Check the GitHub Actions logs. Check Azure CLI. Check the production route. Check the actual response shape.
That mattered because small assumptions were wrong in ways that could have made the smoke tests flaky. For example, the production health route was not plain:
/healthIt was:
/api/v1/healthThat is a small detail. But smoke tests live or die on small details.
Another example was auth. The service had different authorisation shapes in different areas:
- image recognition expected delegated scope-style claims
- admin-style operations expected role-based authorisation
That means a deployment identity cannot magically run the whole smoke suite just because it can log into Azure. Azure resource permissions are not API permissions. That distinction is exactly the kind of thing production smoke tests should make visible.
E - Encode The Runbook
The end state was not a chat transcript. That would have been useless a week later.
The end state was repo-owned:
- a production smoke script
- a GitHub Actions workflow
- production smoke tests documentation
- deployment documentation updates
That is the important part. AI helped me explore, but the result became boring.
Boring is good here.
A future production deploy should not depend on me remembering a sequence from a long chat. It should be one clear command. That is the difference between “AI helped once” and “the system got better.”
What AI did not do
AI did not know which risks mattered most. It did not know when a production change was acceptable. It did not know when to stop.
That judgment still had to come from me.
The advantage was not that AI took over judgment. It helped me apply judgment faster, across more paths, and turn the useful parts into lasting tools.
The real output
The output was not just a set of smoke tests. It was a stronger production signal.
More specifically:
- confidence that auth rejects anonymous calls
- confidence that credits can be created and debited
- confidence that upload and image recognition work together
- confidence that audit and reporting can see the activity
- confidence that edge cases behave predictably
- confidence that the runbook is now repeatable
The value was not that AI gave me fewer questions to ask. The value was that it helped me ask better questions, test more paths, and turn uncertainty into something I could run again.
Final takeaway
AI is most useful when I do not treat it like an oracle. I treat it like a fast pair programmer: useful for reading, drafting, connecting dots, and proposing paths, but still needing direction.
The questions still matter:
- Why?
- Where is the evidence?
- Does this match the repo?
- What breaks locally?
- What are we proving?
- What are we not proving yet?
That is the version of AI-assisted engineering I trust more.
Not blind trust. Faster proof.