Setting up alerting for a REST API used to take a lot of manual effort. Most of that time went into writing baseline queries, reading results, and making judgment calls about thresholds. I built a workflow that leverages AI to compress that into hours. Here’s how it worked for me, so you can try it yourself.
One caveat up front: AI will get things wrong. Resource names, thresholds, root-cause explanations. This workflow works because it builds in validation at every step. Skip the validation and you’ll ship broken alerts.
The prompt snippets below are intentionally generalized to avoid exposing internal service names, traffic patterns, or production baseline data, but they reflect the real workflow I used.
The workflow at a glance
- Generate baseline queries across the four Golden Signals
- Run and validate every query against your real infrastructure
- Feed results back and ask AI for threshold reasoning
- Treat every AI explanation as a hypothesis and verify it
- Keep tuning after every incident and service change
The rest of this post walks through each step with NRQL-style examples you can adapt for your platform.
Step 1: Generate baseline queries with AI
You need to understand what “normal” looks like before you can define what’s wrong. That means writing exploratory queries across the four Golden Signals: latency, errors, traffic, and saturation.
These aren’t queries you write every day. Each one usually means digging through platform docs or hunting down a teammate’s old query to adapt. This is where I got the most value from AI.
I did not start with one perfect prompt. I used multiple passes.
Pass 1: Get a first draft
I'm using New Relic NRQL for a Kubernetes-hosted REST API.Generate baseline queries for:- latency- errors- traffic- saturationWhy this pass mattered
- It gave me a fast first draft across all four signals
- It showed me the model’s default assumptions
- It helped me spot what operational constraints were still missing
Pass 2: Add production constraints
I'm using New Relic NRQL for a Kubernetes-hosted REST API.Requirements:- Exclude health check traffic- Exclude synthetic or probe traffic- Use the last 14 days of data- Use hourly granularity where useful- Provide reasoning for why each query is written this way
Generate baseline queries for:- latency- errors- traffic- saturationWhat improved after Pass 2
- The queries became closer to real customer traffic by removing noisy health checks and probes
- The time window became useful for baseline analysis
- The hourly breakdown made trend review easier
- The reasoning exposed assumptions I could validate instead of just accepting syntax
Why two weeks? Two weeks captures at least two full weekday/weekend cycles, which is usually enough to see your service’s natural traffic pattern. Your mileage may vary. Services with monthly billing cycles, seasonal spikes, or recent architecture changes may need a longer or shorter window. Pick a window that reflects your service’s real-world rhythm.
Here are the queries I ended up with after the second pass, validated and adjusted against my actual environment:
Latency
-- Latency percentiles by endpointFROM TransactionSELECT percentile(duration, 50, 90, 95, 99), count(*), average(duration)WHERE appName = 'your-service-name' AND transactionType = 'Web' AND name NOT LIKE '%HealthCheck%'FACET nameSINCE 2 weeks agoTwo useful suggestions from the AI output that I validated and kept: excluding health checks (Kubernetes probes flood your data with near-zero-latency hits that skew your percentiles) and suggesting hourly resolution to avoid platform bucket limits.
Errors
Run both an aggregate and a time-series variant. The aggregate gives you the overall rate. The time-series shows whether errors cluster around deploys or peak hours.
-- Error rate: aggregateFROM TransactionSELECT percentage(count(*), WHERE error IS TRUE) as 'Error %', count(*) as 'Total'WHERE appName = 'your-service-name' AND transactionType = 'Web' AND name NOT LIKE '%HealthCheck%'SINCE 2 weeks ago-- Error rate: over timeFROM TransactionSELECT percentage(count(*), WHERE error IS TRUE) as 'Error %'WHERE appName = 'your-service-name' AND transactionType = 'Web' AND name NOT LIKE '%HealthCheck%'SINCE 2 weeks agoTIMESERIES 1 hourTip from experience: Error percentage is almost always more useful than raw count for API alerts. A fixed count threshold fires differently at 3am versus midday because traffic changes. Percentage normalizes automatically.
Traffic
-- Requests per minute over timeFROM TransactionSELECT rate(count(*), 1 minute) as 'RPM'WHERE appName = 'your-service-name' AND transactionType = 'Web' AND name NOT LIKE '%HealthCheck%'SINCE 2 weeks agoTIMESERIES 1 hourLook for daily cycles in the results. If traffic has a strong periodic pattern, that matters for Step 3 when you pick your alert type.
Saturation
For a first pass, I focused on CPU utilization and pod availability as the most actionable saturation signals.
-- CPU utilization as percentage of limitFROM K8sContainerSampleSELECT average(cpuUsedCores / cpuLimitCores) * 100 as 'CPU %'WHERE containerName LIKE '%your-container%'SINCE 2 weeks agoTIMESERIES 1 hour-- Pod availability over timeFROM K8sDeploymentSampleSELECT latest(podsDesired), latest(podsAvailable)WHERE deploymentName LIKE '%your-deployment%'SINCE 2 weeks agoTIMESERIES 1 hourNote: the CPU query only works if your containers have CPU limits set. Containers without limits are silently excluded from the average.
Step 2: Run and validate every query
This is the step most people skip, and it’s the one that matters most. AI-generated queries look right but often aren’t.
The most common failure: AI infers resource names from context and gets them wrong.
What AI got wrong
The first query draft assumed a resource naming pattern that did not match my deployed service. My app name and my Kubernetes deployment name were different strings. The query looked correct but returned nothing.
What I changed
I checked the actual deployment and application naming in the environment and updated the filter criteria to match the real resource identifiers. I only caught this by checking the manifest.
Lesson: AI was useful for scaffolding the query shape. It was not reliable for environment-specific names. This is exactly why Step 2 exists.
Before moving on, check each query against this list:
| Check | What to look for |
|---|---|
| Resource names | Cross-reference with your manifest files or deployment config |
| Data comes back | Run the query. No data = broken query. |
| Exclusions work | Health checks and internal traffic are filtered out |
| Bucket limits | Resolution times window length stays within platform limits |
NOTE: If you’re pasting production data into AI, scrub sensitive information first: customer IDs, internal IPs, anything you wouldn’t put in a public.
Step 3: Feed results back and ask for thresholds
Once I had the baselines, I used AI again for threshold reasoning. I did not paste sensitive production numbers into this article, but the workflow looked like this:
Pass 3: Ask for threshold suggestions
Based on the baseline patterns from these queries, suggest warning and critical thresholds for:- latency- error rate- traffic drops or spikes- saturation
For each one:- explain the reasoning- suggest where anomaly detection is better than static thresholds- highlight where evaluation windows matter- avoid generic defaults if the pattern looks cyclicalHow to share baseline data safely. Vague descriptions get you textbook thresholds. Actual patterns get you useful ones, but you do not need to paste raw production data. Relative values and ratios are enough:
My API's baseline patterns from the last 14 days:- Latency: p95 is roughly 3x higher during peak hours than overnight minimums- Errors: stays below 0.5% most of the time, brief spikes coincide with deployments- Traffic: strong daily cycle, peak RPM is about 8x the overnight minimum- CPU: averages around 40%, peaks near 75% during traffic highs
Suggest warning and critical thresholds using these relative patterns.Do not ask me for raw numbers.Ranges and ratios give the model enough signal to reason about thresholds without exposing internal identifiers, exact traffic volumes, or SLA-sensitive numbers.
I did not use those thresholds blindly. I treated them as a starting point, then adjusted them based on alert noise, business impact, and how the service actually behaves during normal peaks.
The pattern that worked for me:
- Warning at ~2x your observed baseline. Catches early degradation before users notice.
- Critical at ~4-5x. Something is meaningfully broken.
- Anomaly detection for cyclical signals. If your traffic has a strong daily pattern, a static floor will fire every night. Anomaly detection handles that automatically.
- Static thresholds for stable signals. Latency and error rate usually don’t have strong daily cycles, so static works.
Check your own baseline data to decide which applies. Don’t assume. Your mileage may vary based on your use cases.
One thing I caught while reviewing the AI’s suggestions: the evaluation window matters as much as the threshold itself. A p95 spike for 30 seconds is noise. The same spike sustained for 5 minutes is an incident. Ask the AI to recommend evaluation windows alongside thresholds.
Step 4: Treat every AI explanation as a hypothesis
This applies to threshold reasoning, root-cause analysis, and anything the AI tells you about why something is happening.
For incident interpretation, I used AI more like a hypothesis generator than a source of truth.
Pass 4: Ask AI to interpret a spike
I observed a sudden increase in error rate while availability remained stable.Given this pattern:- list the most likely explanations- separate infrastructure-level causes from application-level causes- explain what evidence would support each explanation- do not assume causation from correlation aloneI learned the value of this approach during an incident investigation. Error rates had spiked sharply while pod availability still looked healthy. Everything looked fine from an infrastructure perspective, but users were getting errors. The AI gave me a plausible diagnosis: pods were passing shallow health checks but failing on real requests because they could not reach a downstream dependency. That turned out to be directionally correct, but the specific mechanism it described was wrong. I only caught it by validating the explanation against the actual implementation.
The response was useful as a hypothesis generator, but I still had to verify the explanation against the code and the runtime behavior. AI helped narrow the search space. It did not replace the investigation.
The rule: Cross-reference every AI explanation against your code, config, and event timeline before acting on it. Plausible explanations are the hardest to question, which makes them the most dangerous when they’re wrong.
Step 5: Keep tuning
Alerts are a feedback loop, not a one-time config.
- Alert fires and gets closed immediately? Review the threshold.
- Real incident went undetected? Review why the alert didn’t catch it.
- Service changed (new endpoints, new dependencies, traffic growth)? Thresholds need to change with it.
AI can help with the initial setup. The ongoing discipline of tuning is still on you.
What to watch out for
A few failure modes I hit that you’ll want to avoid:
- AI guesses resource names. Always validate against your actual manifests and deployment config.
- Generic thresholds without data. If you skip the baseline step, you’ll get “industry standard” numbers that don’t fit your service. Baseline first, thresholds second.
- Confidently wrong explanations. The AI will give you a logical chain that sounds right but isn’t. Treat it as a starting point for investigation, not a conclusion.
- Sensitive data in prompts. Scrub PII and internal identifiers before pasting production data into AI tools.
Quick reference
The workflow:
- Generate baseline queries (four Golden Signals)
- Run and validate every query
- Feed real numbers back, ask for thresholds
- Verify every explanation against code and config
- Keep tuning after incidents and changes
Validation checklist:
| Check | How |
|---|---|
| Resource names match | Compare with manifests/deploy config |
| Query returns data | Run it. No data = broken. |
| Exclusions applied | Health checks and internal traffic filtered |
| Bucket limits respected | Resolution times window within platform limits |
| Threshold is data-derived | Based on your baseline, not a blog post |
| Alert type fits the signal | Static for stable baselines, anomaly for periodic patterns |
| Evaluation window set | Long enough to filter noise, short enough to catch incidents |
Conclusion
AI made me faster. The workflow made it repeatable. The validation made it safe.