Skip to main content

SRE Agent

The SRE Agent is Atmosly's AI-powered reliability assistant for your clusters. It continuously watches your infrastructure and workloads, surfaces problems as they happen, explains the likely root cause, and recommends concrete fixes — so you can resolve issues without manually digging through kubectl output.

Open it from the SRE Agent tab on any cluster's detail page. When active issues are present, the tab shows a red badge with the issue count.


What the Agent Detects

The agent monitors both node-level (cluster) problems and workload-level (pod/service) problems.

Cluster / node issues

  • Node not ready
  • Disk pressure
  • Memory pressure
  • PID pressure
  • Network unavailable

Workload issues

  • CrashLoopBackOff
  • ImagePullBackOff / ErrImagePull
  • Failed pods or jobs
  • Container config errors
  • Pods stuck in Pending
  • Out-of-memory kills (OOMKilled)
  • Services with no endpoints or no selector

Each issue is assigned a severity — Critical, High, Warning/Medium, Info, or Low — shown as a color-coded badge.


The SRE Agent Tab

Summary Cards

Four cards at the top summarize cluster health and act as quick filters when clicked:

CardDescription
Active IssuesTotal open issues, with a per-severity breakdown
Critical / HighCount of the most urgent issues
Resolved (24h)Issues resolved in the last 24 hours
Top RecurringThe issue type that's been reported most often

Status Tabs

Switch between Active (current problems), Resolved (history), and All (both, for triage).

Issue List

Each row shows the severity, issue title, affected resource, namespace, when it was last seen, a recurrence trend, and how many times it's been reported. Use the search bar to find issues by title, description, resource, or namespace, and the column filters to narrow by issue type, namespace, or severity. Use the recheck control to fetch the latest status immediately after you've made a fix.


Investigating an Issue

Click a row to expand it. You'll see:

  • An AI analysis summarizing the issue and the contributing factors, with a short numbered checklist of recommended steps.
  • The full description and metadata tags.
  • A frequency strip (for recurring issues) showing when it was first and last detected and how often.
  • A "What the agent saw" diagnostics table with the specific signals and metrics behind the detection.

For a deeper view, click View Fix Recommendation to open the remediation panel:

  • Diagnosis — a plain-language summary, with optional AI reasoning and supporting evidence.
  • Remediation — ranked fix proposals (the recommended one expanded first), each with pre-flight checks, step-by-step instructions, a risk level, and a confidence score.
  • History — recent similar fixes applied in this cluster, when available.

Resolving Issues

  • Mark Resolved — moves an active issue to the Resolved tab, recording who resolved it and when.
  • Re-open — returns a resolved issue to the Active list if the problem wasn't actually fixed.
info

The agent automatically retires an old issue row when it produces a more precise diagnosis, so you won't see duplicate or stale entries for the same underlying problem.


Infrastructure Issues Banner

When node-level (cluster) issues are active, a banner also appears at the top of the Nodegroups & Pods page summarizing the problem and linking straight to the SRE Agent tab. The banner only covers cluster-level infrastructure issues — workload issues live in the SRE Agent tab itself.


Agent Connectivity

The SRE Agent runs inside your cluster. If it loses connection, a notification banner appears in the cluster view so you know the data may be stale until connectivity is restored.