Building an Autonomous AI Sysadmin Agent: Orchestrating LangGraph, Ansible, and Gemini on RHEL 9

By Michael Elias on March 2, 2026

426 views

We've all been there: waking up to a pager alert, logging into a server, tailing `journalctl`, and staring at `sar` data to figure out why a service crashed at 3:00 AM. Traditional monitoring tools are great at telling you *what* broke, but they rely on human intuition to figure out why and how to fix it.

It’s time to stop reacting to generic alerts and start bridging LLMs with our infrastructure.

In this build, I’m going to walk you through how to construct a fully autonomous "AI Sysadmin." similar to the rapid development methodologies I used when Vibe Coding Infrastructure in 3 Hours. We are going to orchestrate Ansible to gather telemetry, Python and SQLite to crunch the performance math, LangGraph to manage the stateful workflow, and Google Gemini (Flash) to provide expert-level Root Cause Analysis (RCA) and auto-remediation.

This isn't just a script; it's a stateful AIOps orchestration flow.

The Architecture Blueprint

Before we dive into the code, let’s establish the mental model. We are separating the "Hands", the "Calculator", and the "Brain" to keep the system fast, cost-effective, and halluciation-free. Much like orchestrating GCP Packet Mirroring, separating the data ingestion from the analysis is key to a clean architecture.

Architecture diagram of an autonomous AI Sysadmin using Ansible for data collection, Python/SQLite for anomaly detection, LangGraph for stateful orchestration, and Google Gemini for root cause analysis on RHEL 9

Architecture diagram showing Ansible fetching RHEL 9 logs, Python and SQLite processing SAR metrics, LangGraph routing decisions, and Gemini providing Root Cause Analysis

The Tech Stack

The Hands (Ansible & `ansible-runner`): Securely connects to our fleet via SSH to fetch `sar`, `sshd`, and `journalctl` logs in JSON, and executes remediation playbooks.
The Calculator (Python & SQLite): Ingests raw telemetry, builds a historical baseline, and calculates Standard Deviations ($\sigma$) to find true anomalies (no LLM math required).
The Orchestrator (LangGraph): The stateful engine that controls the flow. It decides if we need to auto-heal a service, call Gemini for analysis, or just go to sleep.
The Brain (Google Gemini 3 Flash): Receives a targeted "Time-Slice" of the system state and correlates performance spikes with system logs to provide RCA.
The Notifier (Postfix): Sends the final Markdown report directly to our inbox.

Step 1: The Telemetry Playbook (The Hands)

First, we need structured data. AI models thrive on JSON. We use Ansible to fetch our telemetry natively from RHEL 9 without installing any extra agents on the remote boxes.

Create `fetch_logs.yml`:

---
- name: Remote Log Retrieval for AI Agent
  hosts: "{{ target | default('localhost') }}"
  become: yes
  gather_facts: no
  tasks:
    - name: 1. Collect SAR (Performance) JSON
      ansible.builtin.command: sadf -j -- -A
      register: sar_raw
      changed_when: false

    - name: 2. Collect System Logs (Recent 100)
      ansible.builtin.command: journalctl -n 100 -o json
      register: journal_raw
      changed_when: false

    - name: 3. Collect Auth Logs (Recent 100)
      ansible.builtin.command: journalctl -u sshd -n 100 -o json
      register: sshd_raw
      changed_when: false

But we don't just want to watch; we want to act. When our LangGraph orchestrator identifies a crashed service, it fires off this targeted remediation playbook, `restart_service.yml`:

---
- name: Auto-Remediate Failed Service
  hosts: "{{ target }}"
  become: yes
  gather_facts: no
  tasks:
    - name: Restart the failed service
      ansible.builtin.systemd:
        name: "{{ service_name }}"
        state: restarted
      register: restart_result
      ignore_errors: yes

    - name: Output status
      ansible.builtin.debug:
        msg: "Service {{ service_name }} restart result: {{ restart_result.state | default('failed to restart') }}"

Step 2: Using LangGraph for Stateful AI Orchestration

If we just wrote a linear Python script, it would be brittle. What if a service crashes? We don't just want an alert; we want the system to try and fix it before calling the AI for an RCA.

This is where LangGraph shines. It allows us to build a cyclical, stateful machine. We define our `AgentState` (a Python TypedDict) that holds our logs, mathematical anomalies, and AI responses. LangGraph passes this state between functional "Nodes".

Crucially, LangGraph allows for Conditional Routing. Look at this logic flow:

# Advanced Conditional Routing in LangGraph
def route_after_math(state: AgentState):
    if state.get('services_to_remediate'):
        return "remediate" # If a service crashed, try to fix it first
    elif state.get('anomalies'):
        return "ai"        # If just a math anomaly, analyze it
    return END             # If totally healthy, go to sleep

This ensures we aren't wasting Gemini API tokens on a healthy system.

Step 3: The Calculator (Python + SQLite)

LLMs shouldn't do math. They are reasoning engines. If you feed an LLM 20,000 lines of `sar` data, it will hallucinate the average.

Instead, we use a local SQLite database on our control node. Every time Ansible fetches data, Python flattens the JSON and `INSERT`s it into SQLite. We then query the last 1,000 runs to build a true historical baseline and calculate the Standard Deviation.

Terminal view of SQLite database storing thousands of system metrics for RHEL 9 historical baseline

If the current CPU `%iowait` spikes > 2$\sigma$ above the historical mean, our Python node flags it as a `TRIGGER` and creates a "Time-Slice"—a snapshot of every metric on the server at that exact second.

Step 4: Bringing it Together (The Core Engine)

Here is a look at the core orchestration script, `ai_log_agent.py`, utilizing the modern `google.genai` SDK.

ai_log_agent.py

import os
import json
import statistics
import sqlite3
import smtplib
from email.message import EmailMessage
import ansible_runner
from google import genai
from dotenv import load_dotenv
from typing import TypedDict, List, Dict, Any
from langgraph.graph import StateGraph, START, END
from datetime import datetime

# --- Database Configuration ---
DB_FILE = "system_metrics.db"

def init_db():
    conn = sqlite3.connect(DB_FILE)
    c = conn.cursor()
    c.execute('''CREATE TABLE IF NOT EXISTS sar_metrics (

The Result: Auto-Healing in Action

To test the system, I manually killed the `postfix` service on a remote target (`kill -9`).

Within 5 minutes, the cron job fired. The Ansible payload returned the state. Python recognized `postfix.service` went from "active" in the SQLite DB to "failed".

LangGraph instantly bypassed the AI, routed to the `remediate` node, and fired an Ansible playbook that restarted the service. Then, it handed the entire package (the trigger, the successful restart confirmation, and the `journalctl` logs) over to Gemini.
Moments later, I received this email via the local Postfix relay:

Email screenshot showing Gemini 3 Flash identifying a SIGKILL event and confirming the successful auto-remediation of the Postfix service

Gemini accurately identified the `SIGKILL` in the systemd logs, correlated it with the service outage, and provided a human-readable Root Cause Analysis. Just as we've seen when analyzing packet captures with tshark and Gemini, the LLM excels at translating raw machine events into actionable human insights.

The Result: Mathematical Anomaly Detection

What happens when it isn't a hard crash, but a performance bottleneck? To test our SQLite-backed math node, I fired up `stress-ng` on the remote box to hammer the disk:

stress-ng --hdd 2 --hdd-bytes 2G --timeout 300s &

The system immediately began struggling. On the next cron interval, our Python calculator kicked in. It queried SQLite for the last 1,000 runs, calculated the baseline standard deviation for `%iowait`, and caught the spike dead to rights.

The script generated the mathematical trigger:
`TRIGGER: cpu-load_all_iowait spiked to 45.2% (Mean: 0.12%)`

It instantly grabbed a "Time-Slice" of every single metric on the server from that exact second and handed it to Gemini. Gemini correlated the high `iowait` with the massive disk write sectors shown in the time-slice and properly diagnosed a severe disk I/O bottleneck rather than a runaway CPU process, sending the Root Cause Analysis directly to my inbox.

Wrapping Up

By combining the execution power of Ansible, the strict mathematical processing of Python, and the stateful orchestration of LangGraph, we've built an AI agent that doesn't just alert—it acts, heals, and explains.

We’ve effectively replaced generic dashboards with a targeted, AIOps-driven engineer that scales across the entire fleet. If you are looking to extend this logic further, you could easily adapt these principles into your workflows and trigger responses via Webhooks.

Next up, I'll be looking into moving this state into LangGraph's persistent Checkpointers for long-term "Memory" across reboots. Stay tuned.

Michael Elias is a Senior Principal Operations Engineer at Dun & Bradstreet with a history of entrepreneurship in the ISP and consulting spaces. A veteran of the dot-com era with certifications from Cisco, Red Hat, and Fortinet, Michael specializes in high-compliance infrastructure and enterprise architecture.

- Michael Elias (Read full bio)

Subscribe to this Post

Get notified when there are updates or new content related to "Building an Autonomous AI Sysadmin Agent: Orchestrating LangGraph, Ansible, and Gemini on RHEL 9".

Comments

Loading comments...

Note: All comments are moderated and will appear after approval.