Building an AI-Powered Incident Response Pipeline

Introduction
The Architecture
1. Prometheus & Alertmanager
2. The Incident Orchestrator
3. LLM Log Summarizer
4. Slack Integration
5. Auto-Ticket Creation
Deployment Considerations
Benefits
Future Enhancements
Conclusion

In the fast-paced world of technology, incidents are inevitable. When they strike, every second counts. Traditional incident response can be a manual, time-consuming process, often leading to delays and increased stress for on-call teams.

This blog post will guide you through building an AI-powered incident response pipeline that integrates Prometheus for alerting, a Large Language Model (LLM) for summarizing logs, Slack for communication, and an automated system for ticket creation. By the end, you'll have a robust, intelligent system that significantly reduces Mean Time To Resolution (MTTR).

The Architecture

Prometheus

Alertmanager

Orchestrator + LLM

Slack

Jira

Let's break down each component in detail.

1. Prometheus & Alertmanager: The Foundation of Monitoring

Prometheus is an open-source monitoring system with a powerful data model and query language (PromQL). Alertmanager handles alerts sent by Prometheus, deduplicating, grouping, and routing them to the correct receiver.

Prometheus Alert Rule

yaml rules.yml

groups:
  - name: example-alerts
    rules:
      - alert: HighErrorRate
        expr: |
          sum(rate(http_requests_total{status_code=~"5xx"}[5m])) by (job)
          / sum(rate(http_requests_total[5m])) by (job) > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate on {{ $labels.job }}"
          description: "HTTP error rate exceeded 5% for 5+ minutes."

Alertmanager Configuration

yaml alertmanager.yml

route:
  receiver: 'incident-orchestrator'
  group_by: ['alertname', 'job']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 1h

receivers:
  - name: 'incident-orchestrator'
    webhook_configs:
      - url: 'http://<your-orchestrator-url>/webhook'
        send_resolved: true

2. The Incident Orchestrator

This is the brain of our operation. It receives alerts from Alertmanager, fetches relevant logs, calls the LLM, posts to Slack, and creates tickets — all automatically.

Key Responsibilities: Receive webhook payloads → Fetch logs → Call LLM → Post to Slack → Create ticket.

Flask Orchestrator (Python)

python orchestrator.py

from flask import Flask, request, jsonify
import requests, os, json

app = Flask(__name__)

SLACK_WEBHOOK_URL  = os.getenv("SLACK_WEBHOOK_URL")
JIRA_API_URL       = os.getenv("JIRA_API_URL")
JIRA_USERNAME      = os.getenv("JIRA_USERNAME")
JIRA_API_TOKEN     = os.getenv("JIRA_API_TOKEN")
LLM_API_ENDPOINT   = os.getenv("LLM_API_ENDPOINT")
LOG_FETCH_ENDPOINT = os.getenv("LOG_FETCH_ENDPOINT")

@app.route('/webhook', methods=['POST'])
def handle_alert():
    alert_data = request.get_json()
    for alert in alert_data.get('alerts', []):
        if alert['status'] == 'firing':
            alert_name  = alert['labels'].get('alertname', 'Unknown')
            job         = alert['labels'].get('job', 'Unknown')
            severity    = alert['labels'].get('severity', 'info')
            summary     = alert['annotations'].get('summary', '')
            description = alert['annotations'].get('description', '')

            logs        = fetch_logs(job, alert['startsAt'])
            log_summary = summarize_logs_with_llm(logs, description)

            send_slack_message({
                "text": (f"🚨 *INCIDENT: {alert_name}*\n"
                         f"Severity: `{severity}` | Job: `{job}`\n"
                         f"Summary: {summary}\n\n"
                         f"🤖 *LLM Summary:* {log_summary}")
            })
            create_jira_ticket(alert_name, description, log_summary, severity)

    return jsonify({"status": "success"}), 200

# --- Helper functions defined below ---

python orchestrator.py — helpers

def fetch_logs(job_name, start_time):
    try:
        r = requests.get(LOG_FETCH_ENDPOINT,
                         params={'job': job_name, 'start': start_time},
                         timeout=10)
        r.raise_for_status()
        return r.text
    except requests.exceptions.RequestException as e:
        return f"Failed to fetch logs: {e}"

def summarize_logs_with_llm(logs, description):
    prompt = (f"Incident: '{description}'\n\nLogs:\n{logs}\n\n"
              f"Summarize the key issues and root causes concisely.")
    try:
        r = requests.post(LLM_API_ENDPOINT, json={'prompt': prompt}, timeout=30)
        r.raise_for_status()
        return r.json().get('summary', 'No summary from LLM.')
    except requests.exceptions.RequestException as e:
        return f"LLM error: {e}"

def send_slack_message(message):
    requests.post(SLACK_WEBHOOK_URL, json=message,
                  headers={'Content-Type': 'application/json'})

def create_jira_ticket(alert_name, description, log_summary, severity):
    payload = {
        "fields": {
            "project": {"key": "OPS"},
            "summary": f"[INCIDENT] {alert_name}",
            "description": f"{description}\n\nLLM Summary: {log_summary}",
            "issuetype": {"name": "Bug"},
            "priority": {"name": "High" if severity == "critical" else "Medium"}
        }
    }
    r = requests.post(JIRA_API_URL,
                      headers={"Content-Type": "application/json"},
                      auth=(JIRA_USERNAME, JIRA_API_TOKEN),
                      json=payload)
    r.raise_for_status()
    print(f"Jira ticket: {r.json()['key']}")

if __name__ == '__main__':
    app.run(debug=True, host='0.0.0.0', port=5000)

3. LLM Log Summarizer

Instead of sifting through thousands of log lines, an LLM extracts critical information, identifies patterns, and suggests potential root causes in seconds.

Choices: Use a self-hosted model (Llama 2, Mixtral) via Ollama, or a cloud service like OpenAI GPT-4, Anthropic Claude, or Google Gemini.

OpenAI Integration

python llm_integration.py

import openai, os

openai.api_key = os.getenv("OPENAI_API_KEY")

def summarize_logs_with_llm(logs, description):
    messages = [
        {
            "role": "system",
            "content": ("You are an expert SRE assistant. Analyze incident logs "
                        "and provide a concise summary of the problem and root causes.")
        },
        {
            "role": "user",
            "content": (f"Alert: '{description}'\n\nLogs:\n{logs}\n\n"
                        "Summarize the incident and suggest possible causes.")
        }
    ]
    try:
        resp = openai.ChatCompletion.create(
            model="gpt-4",
            messages=messages,
            temperature=0.7,
            max_tokens=500
        )
        return resp.choices[0].message['content'].strip()
    except Exception as e:
        return f"LLM error: {e}"

4. Slack Integration

Slack serves as the central communication hub. The orchestrator posts detailed alerts, LLM summaries, and ticket links into a designated #incidents channel instantly.

Generate a Slack Incoming Webhook URL from your workspace settings and pass it as the SLACK_WEBHOOK_URL environment variable.

Sample Slack Notification Format

text #incidents channel

🚨 INCIDENT ALERT: HighErrorRate 🚨
Severity: `critical`
Job: `api-server`
Summary: High error rate detected on api-server
Description: HTTP error rate exceeded 5% for 5+ minutes.

🤖 LLM Log Summary:
The logs show a spike in 503 errors starting at 14:32 UTC,
correlating with a database connection timeout. Root cause
likely: max connection pool reached due to traffic surge.

🔗 View in Prometheus  |  🎫 Jira: OPS-1234

5. Auto-Ticket Creation (Jira)

Automating Jira ticket creation ensures every incident is tracked, assigned, and can be reviewed post-mortem — with rich AI-generated context baked right in.

Security: Never hardcode your JIRA_API_TOKEN. Use AWS Secrets Manager, HashiCorp Vault, or Kubernetes Secrets.

Deployment Considerations

Serverless

Deploy the orchestrator as AWS Lambda or Google Cloud Function for zero-ops scaling.

Containerized

Run on Kubernetes with Docker for more control over resources and networking.

Log Aggregation

Centralize logs with Loki, ELK, Splunk, or Datadog for efficient querying.

Secrets Management

Use AWS Secrets Manager or Vault. Never hardcode sensitive credentials.

Benefits of an AI-Powered Pipeline

Faster MTTR: Automated log summarization and instant notifications cut resolution time significantly.
Less Cognitive Load: On-call engineers get pre-digested information, not raw log dumps.
Consistent Process: Standardized alert handling ensures every incident is treated equally.
Richer Post-Mortems: Auto-linked tickets and summaries provide full context for root cause analysis.

Future Enhancements

Automated Remediation: The LLM suggests and triggers runbooks for known incident types.
Incident Correlation: Cluster multiple related alerts into one parent incident.
Knowledge Base Integration: Query internal wikis to surface relevant runbooks automatically.
Predictive Alerting: Use time-series anomaly detection to fire alerts before the problem is user-visible.

Conclusion

Building an AI-powered incident response pipeline is a significant step toward a more resilient operational environment. By combining Prometheus for monitoring, LLMs for intelligent analysis, and integrations with Slack and Jira, you empower your teams to respond with unprecedented speed and clarity.

The result: happier engineers and more stable systems.

Kiren Jayaprakash Associate DevOps Engineer @ 4Labs Technologies

LinkedIn Contact

Back to Blog

Building an AI-Powered Incident Response Pipeline

Table of Contents

The Architecture

1. Prometheus & Alertmanager: The Foundation of Monitoring

Prometheus Alert Rule

Alertmanager Configuration

2. The Incident Orchestrator

Flask Orchestrator (Python)

3. LLM Log Summarizer

OpenAI Integration

4. Slack Integration

Sample Slack Notification Format

5. Auto-Ticket Creation (Jira)

Deployment Considerations

Serverless

Containerized

Log Aggregation

Secrets Management

Benefits of an AI-Powered Pipeline

Future Enhancements

Conclusion