I used AI to set up Observability for a REST API. Here's what actually worked

Setting up alerting for a REST API used to take a lot of manual effort. Most of that time went into writing baseline queries, reading results, and making judgment calls about thresholds. I built a workflow that leverages AI to compress that into hours. Here’s how it worked for me, so you can try it yourself.

One caveat up front: AI will get things wrong. Resource names, thresholds, root-cause explanations. This workflow works because it builds in validation at every step. Skip the validation and you’ll ship broken alerts.

The prompt snippets below are intentionally generalized to avoid exposing internal service names, traffic patterns, or production baseline data, but they reflect the real workflow I used.

The workflow at a glance

Generate baseline queries across the four Golden Signals
Run and validate every query against your real infrastructure
Feed results back and ask AI for threshold reasoning
Treat every AI explanation as a hypothesis and verify it
Keep tuning after every incident and service change

The rest of this post walks through each step with NRQL-style examples you can adapt for your platform.

Step 1: Generate baseline queries with AI

You need to understand what “normal” looks like before you can define what’s wrong. That means writing exploratory queries across the four Golden Signals: latency, errors, traffic, and saturation.

These aren’t queries you write every day. Each one usually means digging through platform docs or hunting down a teammate’s old query to adapt. This is where I got the most value from AI.

I did not start with one perfect prompt. I used multiple passes.

Pass 1: Get a first draft

I'm using New Relic NRQL for a Kubernetes-hosted REST API.
Generate baseline queries for:
- latency
- errors
- traffic
- saturation

Why this pass mattered

It gave me a fast first draft across all four signals
It showed me the model’s default assumptions
It helped me spot what operational constraints were still missing

Pass 2: Add production constraints

I'm using New Relic NRQL for a Kubernetes-hosted REST API.
Requirements:
- Exclude health check traffic
- Exclude synthetic or probe traffic
- Use the last 14 days of data
- Use hourly granularity where useful
- Provide reasoning for why each query is written this way

Generate baseline queries for:
- latency
- errors
- traffic
- saturation

What improved after Pass 2

The queries became closer to real customer traffic by removing noisy health checks and probes
The time window became useful for baseline analysis
The hourly breakdown made trend review easier
The reasoning exposed assumptions I could validate instead of just accepting syntax

Why two weeks? Two weeks captures at least two full weekday/weekend cycles, which is usually enough to see your service’s natural traffic pattern. Your mileage may vary. Services with monthly billing cycles, seasonal spikes, or recent architecture changes may need a longer or shorter window. Pick a window that reflects your service’s real-world rhythm.

Here are the queries I ended up with after the second pass, validated and adjusted against my actual environment:

Latency

-- Latency percentiles by endpoint
FROM Transaction
SELECT percentile(duration, 50, 90, 95, 99), count(*), average(duration)
WHERE appName = 'your-service-name'
  AND transactionType = 'Web'
  AND name NOT LIKE '%HealthCheck%'
FACET name
SINCE 2 weeks ago

Two useful suggestions from the AI output that I validated and kept: excluding health checks (Kubernetes probes flood your data with near-zero-latency hits that skew your percentiles) and suggesting hourly resolution to avoid platform bucket limits.

Errors

Run both an aggregate and a time-series variant. The aggregate gives you the overall rate. The time-series shows whether errors cluster around deploys or peak hours.

-- Error rate: aggregate
FROM Transaction
SELECT percentage(count(*), WHERE error IS TRUE) as 'Error %',
       count(*) as 'Total'
WHERE appName = 'your-service-name'
  AND transactionType = 'Web'
  AND name NOT LIKE '%HealthCheck%'
SINCE 2 weeks ago

-- Error rate: over time
FROM Transaction
SELECT percentage(count(*), WHERE error IS TRUE) as 'Error %'
WHERE appName = 'your-service-name'
  AND transactionType = 'Web'
  AND name NOT LIKE '%HealthCheck%'
SINCE 2 weeks ago
TIMESERIES 1 hour

Tip from experience: Error percentage is almost always more useful than raw count for API alerts. A fixed count threshold fires differently at 3am versus midday because traffic changes. Percentage normalizes automatically.

Traffic

-- Requests per minute over time
FROM Transaction
SELECT rate(count(*), 1 minute) as 'RPM'
WHERE appName = 'your-service-name'
  AND transactionType = 'Web'
  AND name NOT LIKE '%HealthCheck%'
SINCE 2 weeks ago
TIMESERIES 1 hour

Look for daily cycles in the results. If traffic has a strong periodic pattern, that matters for Step 3 when you pick your alert type.

Saturation

For a first pass, I focused on CPU utilization and pod availability as the most actionable saturation signals.

-- CPU utilization as percentage of limit
FROM K8sContainerSample
SELECT average(cpuUsedCores / cpuLimitCores) * 100 as 'CPU %'
WHERE containerName LIKE '%your-container%'
SINCE 2 weeks ago
TIMESERIES 1 hour

-- Pod availability over time
FROM K8sDeploymentSample
SELECT latest(podsDesired), latest(podsAvailable)
WHERE deploymentName LIKE '%your-deployment%'
SINCE 2 weeks ago
TIMESERIES 1 hour

Note: the CPU query only works if your containers have CPU limits set. Containers without limits are silently excluded from the average.

Step 2: Run and validate every query

This is the step most people skip, and it’s the one that matters most. AI-generated queries look right but often aren’t.

The most common failure: AI infers resource names from context and gets them wrong.

What AI got wrong

The first query draft assumed a resource naming pattern that did not match my deployed service. My app name and my Kubernetes deployment name were different strings. The query looked correct but returned nothing.

What I changed

I checked the actual deployment and application naming in the environment and updated the filter criteria to match the real resource identifiers. I only caught this by checking the manifest.

Lesson: AI was useful for scaffolding the query shape. It was not reliable for environment-specific names. This is exactly why Step 2 exists.

Before moving on, check each query against this list:

Check	What to look for
Resource names	Cross-reference with your manifest files or deployment config
Data comes back	Run the query. No data = broken query.
Exclusions work	Health checks and internal traffic are filtered out
Bucket limits	Resolution times window length stays within platform limits

NOTE: If you’re pasting production data into AI, scrub sensitive information first: customer IDs, internal IPs, anything you wouldn’t put in a public.

Step 3: Feed results back and ask for thresholds

Once I had the baselines, I used AI again for threshold reasoning. I did not paste sensitive production numbers into this article, but the workflow looked like this:

Pass 3: Ask for threshold suggestions

Based on the baseline patterns from these queries, suggest warning and critical thresholds for:
- latency
- error rate
- traffic drops or spikes
- saturation

For each one:
- explain the reasoning
- suggest where anomaly detection is better than static thresholds
- highlight where evaluation windows matter
- avoid generic defaults if the pattern looks cyclical

How to share baseline data safely. Vague descriptions get you textbook thresholds. Actual patterns get you useful ones, but you do not need to paste raw production data. Relative values and ratios are enough:

My API's baseline patterns from the last 14 days:
- Latency: p95 is roughly 3x higher during peak hours than overnight minimums
- Errors: stays below 0.5% most of the time, brief spikes coincide with deployments
- Traffic: strong daily cycle, peak RPM is about 8x the overnight minimum
- CPU: averages around 40%, peaks near 75% during traffic highs

Suggest warning and critical thresholds using these relative patterns.
Do not ask me for raw numbers.

Ranges and ratios give the model enough signal to reason about thresholds without exposing internal identifiers, exact traffic volumes, or SLA-sensitive numbers.

I did not use those thresholds blindly. I treated them as a starting point, then adjusted them based on alert noise, business impact, and how the service actually behaves during normal peaks.

The pattern that worked for me:

Warning at ~2x your observed baseline. Catches early degradation before users notice.
Critical at ~4-5x. Something is meaningfully broken.
Anomaly detection for cyclical signals. If your traffic has a strong daily pattern, a static floor will fire every night. Anomaly detection handles that automatically.
Static thresholds for stable signals. Latency and error rate usually don’t have strong daily cycles, so static works.

Check your own baseline data to decide which applies. Don’t assume. Your mileage may vary based on your use cases.

One thing I caught while reviewing the AI’s suggestions: the evaluation window matters as much as the threshold itself. A p95 spike for 30 seconds is noise. The same spike sustained for 5 minutes is an incident. Ask the AI to recommend evaluation windows alongside thresholds.

Step 4: Treat every AI explanation as a hypothesis

This applies to threshold reasoning, root-cause analysis, and anything the AI tells you about why something is happening.

For incident interpretation, I used AI more like a hypothesis generator than a source of truth.

Pass 4: Ask AI to interpret a spike

I observed a sudden increase in error rate while availability remained stable.
Given this pattern:
- list the most likely explanations
- separate infrastructure-level causes from application-level causes
- explain what evidence would support each explanation
- do not assume causation from correlation alone

I learned the value of this approach during an incident investigation. Error rates had spiked sharply while pod availability still looked healthy. Everything looked fine from an infrastructure perspective, but users were getting errors. The AI gave me a plausible diagnosis: pods were passing shallow health checks but failing on real requests because they could not reach a downstream dependency. That turned out to be directionally correct, but the specific mechanism it described was wrong. I only caught it by validating the explanation against the actual implementation.

The response was useful as a hypothesis generator, but I still had to verify the explanation against the code and the runtime behavior. AI helped narrow the search space. It did not replace the investigation.

The rule: Cross-reference every AI explanation against your code, config, and event timeline before acting on it. Plausible explanations are the hardest to question, which makes them the most dangerous when they’re wrong.

Step 5: Keep tuning

Alerts are a feedback loop, not a one-time config.

Alert fires and gets closed immediately? Review the threshold.
Real incident went undetected? Review why the alert didn’t catch it.
Service changed (new endpoints, new dependencies, traffic growth)? Thresholds need to change with it.

AI can help with the initial setup. The ongoing discipline of tuning is still on you.

What to watch out for

A few failure modes I hit that you’ll want to avoid:

AI guesses resource names. Always validate against your actual manifests and deployment config.
Generic thresholds without data. If you skip the baseline step, you’ll get “industry standard” numbers that don’t fit your service. Baseline first, thresholds second.
Confidently wrong explanations. The AI will give you a logical chain that sounds right but isn’t. Treat it as a starting point for investigation, not a conclusion.
Sensitive data in prompts. Scrub PII and internal identifiers before pasting production data into AI tools.

Quick reference

The workflow:

Generate baseline queries (four Golden Signals)
Run and validate every query
Feed real numbers back, ask for thresholds
Verify every explanation against code and config
Keep tuning after incidents and changes

Validation checklist:

Check	How
Resource names match	Compare with manifests/deploy config
Query returns data	Run it. No data = broken.
Exclusions applied	Health checks and internal traffic filtered
Bucket limits respected	Resolution times window within platform limits
Threshold is data-derived	Based on your baseline, not a blog post
Alert type fits the signal	Static for stable baselines, anomaly for periodic patterns
Evaluation window set	Long enough to filter noise, short enough to catch incidents

Conclusion

AI made me faster. The workflow made it repeatable. The validation made it safe.