AI + DevOps

Using ML for Anomaly Detection in Infrastructure Monitoring

By Himanshu Gupta · June 2026 · 10 min read

Static alert thresholds are broken. "Alert when CPU > 80%" sounds reasonable until your service legitimately spikes to 85% every day at 2 PM during batch processing. The result? Alert fatigue — your team ignores the pager because 90% of alerts are noise.

Machine learning can fix this by learning what "normal" looks like for your infrastructure and alerting only on genuine anomalies. Here's how I've been building this into our monitoring stack.

Where ML Fits in the Monitoring Stack

📈

Metric Anomaly Detection

Detect unusual patterns in CPU, memory, latency, error rates without static thresholds

📋

Log Anomaly Detection

Surface unusual log patterns and error clusters automatically

🔮

Predictive Scaling

Forecast resource needs and pre-scale before demand spikes

🔧

Auto-Remediation

Trigger automated responses based on learned patterns

Architecture: CloudWatch + Lambda + ML

This approach uses AWS-native services so you don't need a separate ML platform. CloudWatch collects metrics, a scheduled Lambda trains a simple model on historical data, and anomalies trigger SNS notifications.

import boto3
import numpy as np
from datetime import datetime, timedelta
from sklearn.ensemble import IsolationForest
import json

cloudwatch = boto3.client('cloudwatch')
sns = boto3.client('sns')

def get_metric_data(namespace, metric, dimension_name, dimension_value, hours=168):
    """Fetch 7 days of metric data for training."""
    response = cloudwatch.get_metric_statistics(
        Namespace=namespace,
        MetricName=metric,
        Dimensions=[{'Name': dimension_name, 'Value': dimension_value}],
        StartTime=datetime.utcnow() - timedelta(hours=hours),
        EndTime=datetime.utcnow(),
        Period=300,  # 5-minute intervals
        Statistics=['Average']
    )
    datapoints = sorted(response['Datapoints'], key=lambda x: x['Timestamp'])
    return [dp['Average'] for dp in datapoints]

def detect_anomalies(metric_data, contamination=0.05):
    """Use Isolation Forest to detect anomalies."""
    if len(metric_data) < 100:
        return []  # Not enough data to train

    # Reshape for sklearn
    X = np.array(metric_data).reshape(-1, 1)

    # Train Isolation Forest
    model = IsolationForest(
        contamination=contamination,  # Expected % of anomalies
        random_state=42,
        n_estimators=100
    )
    predictions = model.fit_predict(X)

    # Return indices of anomalies (-1 = anomaly, 1 = normal)
    anomalies = [i for i, p in enumerate(predictions) if p == -1]
    return anomalies

def lambda_handler(event, context):
    """Main Lambda function — runs every 5 minutes."""
    metrics_to_monitor = [
        {
            'namespace': 'AWS/EKS',
            'metric': 'pod_cpu_utilization',
            'dimension_name': 'ClusterName',
            'dimension_value': 'production-cluster',
            'alert_name': 'EKS CPU Anomaly'
        },
        {
            'namespace': 'AWS/ApplicationELB',
            'metric': 'TargetResponseTime',
            'dimension_name': 'LoadBalancer',
            'dimension_value': 'app/prod-alb/abc123',
            'alert_name': 'ALB Latency Anomaly'
        }
    ]

    for metric_config in metrics_to_monitor:
        data = get_metric_data(
            metric_config['namespace'],
            metric_config['metric'],
            metric_config['dimension_name'],
            metric_config['dimension_value']
        )

        if not data:
            continue

        anomalies = detect_anomalies(data)

        # Check if the LATEST data point is anomalous
        if len(data) - 1 in anomalies:
            current_value = data[-1]
            mean_value = np.mean(data)
            std_value = np.std(data)

            sns.publish(
                TopicArn='arn:aws:sns:us-east-1:123456789:anomaly-alerts',
                Subject=f"🚨 {metric_config['alert_name']} Detected",
                Message=json.dumps({
                    'alert': metric_config['alert_name'],
                    'current_value': round(current_value, 2),
                    'mean': round(mean_value, 2),
                    'std_deviation': round(std_value, 2),
                    'deviation_factor': round(
                        abs(current_value - mean_value) / std_value, 1
                    ),
                    'metric': metric_config['metric'],
                    'cluster': metric_config['dimension_value'],
                    'timestamp': datetime.utcnow().isoformat()
                }, indent=2)
            )

    return {'statusCode': 200, 'body': 'Anomaly check complete'}

Deploying with Terraform

# Schedule Lambda to run every 5 minutes
resource "aws_cloudwatch_event_rule" "anomaly_check" {
  name                = "anomaly-detection-schedule"
  description         = "Trigger anomaly detection every 5 minutes"
  schedule_expression = "rate(5 minutes)"
}

resource "aws_cloudwatch_event_target" "anomaly_lambda" {
  rule = aws_cloudwatch_event_rule.anomaly_check.name
  arn  = aws_lambda_function.anomaly_detector.arn
}

resource "aws_lambda_function" "anomaly_detector" {
  filename         = "anomaly_detector.zip"
  function_name    = "infrastructure-anomaly-detector"
  role             = aws_iam_role.anomaly_lambda_role.arn
  handler          = "main.lambda_handler"
  runtime          = "python3.11"
  timeout          = 60
  memory_size      = 256

  layers = [
    "arn:aws:lambda:us-east-1:xxx:layer:sklearn-layer:1"
  ]

  environment {
    variables = {
      SNS_TOPIC_ARN = aws_sns_topic.anomaly_alerts.arn
    }
  }
}

Why Isolation Forest?

It's unsupervised (no labeled data needed), handles high-dimensional data well, and is fast enough to run in a Lambda function. For more complex patterns (seasonality, trends), consider Facebook's Prophet or AWS's built-in anomaly detection in CloudWatch.

Beyond Basic Detection: The Maturity Model

Level 1 — Static Thresholds: CPU > 80% → alert. Where most teams are today.
Level 2 — Statistical Anomaly Detection: Z-score or IQR-based detection. Better, but doesn't handle seasonality.
Level 3 — ML Anomaly Detection: Isolation Forest, autoencoders. Learns "normal" patterns. This article covers this level.
Level 4 — Predictive + Automated: Forecasts issues before they happen and auto-remediates (scales pods, rolls back deployments).
Level 5 — AIOps: Full correlation across metrics, logs, and traces. Root cause analysis. This is where LLMs start to shine — summarizing incidents and suggesting fixes.

What's Next: LLMs for Incident Response

The next frontier is using large language models to correlate alerts, summarize incidents, and suggest runbook actions. Imagine getting a Slack message that says:

"API latency spiked 3x in the last 10 minutes. This correlates with a
deployment to the payment-service at 14:32. The deployment changed
database connection pooling settings. Suggested action: rollback
payment-service to v2.3.1 (previous known-good version)."

This is achievable today by feeding CloudWatch anomalies + deployment logs + git history into an LLM via Lambda. The technology is here — it's a matter of building the integration.

Building something similar?

Let's connect — I'm exploring AI/ML + DevOps intersections. Find me on LinkedIn or check out more in my Code Lab.