SRE Agent

The SRE Agent is Atmosly's AI-powered reliability assistant for your clusters. It continuously watches your infrastructure and workloads, surfaces problems as they happen, explains the likely root cause, and recommends concrete fixes — so you can resolve issues without manually digging through kubectl output.

Open it from the SRE Agent tab on any cluster's detail page. When active issues are present, the tab shows a red badge with the issue count.

What the Agent Detects

The agent monitors both node-level (cluster) problems and workload-level (pod/service) problems.

Cluster / node issues

Node not ready
Disk pressure
Memory pressure
PID pressure
Network unavailable

Workload issues

CrashLoopBackOff
ImagePullBackOff / ErrImagePull
Failed pods or jobs
Container config errors
Pods stuck in Pending
Out-of-memory kills (OOMKilled)
Services with no endpoints or no selector

Each issue is assigned a severity — Critical, High, Warning/Medium, Info, or Low — shown as a color-coded badge.

The SRE Agent Tab

Summary Cards

Four cards at the top summarize cluster health and act as quick filters when clicked:

Card	Description
Active Issues	Total open issues, with a per-severity breakdown
Critical / High	Count of the most urgent issues
Resolved (24h)	Issues resolved in the last 24 hours
Top Recurring	The issue type that's been reported most often

Status Tabs

Switch between Active (current problems), Resolved (history), and All (both, for triage).

Issue List

Each row shows the severity, issue title, affected resource, namespace, when it was last seen, a recurrence trend, and how many times it's been reported. Use the search bar to find issues by title, description, resource, or namespace, and the column filters to narrow by issue type, namespace, or severity. Use the recheck control to fetch the latest status immediately after you've made a fix.

Investigating an Issue

Click a row to expand it. You'll see:

An AI analysis summarizing the issue and the contributing factors, with a short numbered checklist of recommended steps.
The full description and metadata tags.
A frequency strip (for recurring issues) showing when it was first and last detected and how often.
A "What the agent saw" diagnostics table with the specific signals and metrics behind the detection.

For a deeper view, click View Fix Recommendation to open the remediation panel:

Diagnosis — a plain-language summary, with optional AI reasoning and supporting evidence.
Remediation — ranked fix proposals (the recommended one expanded first), each with pre-flight checks, step-by-step instructions, a risk level, and a confidence score.
History — recent similar fixes applied in this cluster, when available.

Resolving Issues

Mark Resolved — moves an active issue to the Resolved tab, recording who resolved it and when.
Re-open — returns a resolved issue to the Active list if the problem wasn't actually fixed.

info

The agent automatically retires an old issue row when it produces a more precise diagnosis, so you won't see duplicate or stale entries for the same underlying problem.

When node-level (cluster) issues are active, a banner also appears at the top of the Nodegroups & Pods page summarizing the problem and linking straight to the SRE Agent tab. The banner only covers cluster-level infrastructure issues — workload issues live in the SRE Agent tab itself.

Agent Connectivity

The SRE Agent runs inside your cluster. If it loses connection, a notification banner appears in the cluster view so you know the data may be stale until connectivity is restored.

What the Agent Detects​

The SRE Agent Tab​

Summary Cards​

Status Tabs​

Issue List​

Investigating an Issue​

Resolving Issues​

Infrastructure Issues Banner​

Agent Connectivity​