Using ML for Anomaly Detection in Infrastructure Monitoring
Static alert thresholds are broken. "Alert when CPU > 80%" sounds reasonable until your service legitimately spikes to 85% every day at 2 PM during batch processing. The result? Alert fatigue — your team ignores the pager because 90% of alerts are noise.
Machine learning can fix this by learning what "normal" looks like for your infrastructure and alerting only on genuine anomalies. Here's how I've been building this into our monitoring stack.
Where ML Fits in the Monitoring Stack
Metric Anomaly Detection
Detect unusual patterns in CPU, memory, latency, error rates without static thresholds
Log Anomaly Detection
Surface unusual log patterns and error clusters automatically
Predictive Scaling
Forecast resource needs and pre-scale before demand spikes
Auto-Remediation
Trigger automated responses based on learned patterns
Architecture: CloudWatch + Lambda + ML
This approach uses AWS-native services so you don't need a separate ML platform. CloudWatch collects metrics, a scheduled Lambda trains a simple model on historical data, and anomalies trigger SNS notifications.
import boto3
import numpy as np
from datetime import datetime, timedelta
from sklearn.ensemble import IsolationForest
import json
cloudwatch = boto3.client('cloudwatch')
sns = boto3.client('sns')
def get_metric_data(namespace, metric, dimension_name, dimension_value, hours=168):
"""Fetch 7 days of metric data for training."""
response = cloudwatch.get_metric_statistics(
Namespace=namespace,
MetricName=metric,
Dimensions=[{'Name': dimension_name, 'Value': dimension_value}],
StartTime=datetime.utcnow() - timedelta(hours=hours),
EndTime=datetime.utcnow(),
Period=300, # 5-minute intervals
Statistics=['Average']
)
datapoints = sorted(response['Datapoints'], key=lambda x: x['Timestamp'])
return [dp['Average'] for dp in datapoints]
def detect_anomalies(metric_data, contamination=0.05):
"""Use Isolation Forest to detect anomalies."""
if len(metric_data) < 100:
return [] # Not enough data to train
# Reshape for sklearn
X = np.array(metric_data).reshape(-1, 1)
# Train Isolation Forest
model = IsolationForest(
contamination=contamination, # Expected % of anomalies
random_state=42,
n_estimators=100
)
predictions = model.fit_predict(X)
# Return indices of anomalies (-1 = anomaly, 1 = normal)
anomalies = [i for i, p in enumerate(predictions) if p == -1]
return anomalies
def lambda_handler(event, context):
"""Main Lambda function — runs every 5 minutes."""
metrics_to_monitor = [
{
'namespace': 'AWS/EKS',
'metric': 'pod_cpu_utilization',
'dimension_name': 'ClusterName',
'dimension_value': 'production-cluster',
'alert_name': 'EKS CPU Anomaly'
},
{
'namespace': 'AWS/ApplicationELB',
'metric': 'TargetResponseTime',
'dimension_name': 'LoadBalancer',
'dimension_value': 'app/prod-alb/abc123',
'alert_name': 'ALB Latency Anomaly'
}
]
for metric_config in metrics_to_monitor:
data = get_metric_data(
metric_config['namespace'],
metric_config['metric'],
metric_config['dimension_name'],
metric_config['dimension_value']
)
if not data:
continue
anomalies = detect_anomalies(data)
# Check if the LATEST data point is anomalous
if len(data) - 1 in anomalies:
current_value = data[-1]
mean_value = np.mean(data)
std_value = np.std(data)
sns.publish(
TopicArn='arn:aws:sns:us-east-1:123456789:anomaly-alerts',
Subject=f"🚨 {metric_config['alert_name']} Detected",
Message=json.dumps({
'alert': metric_config['alert_name'],
'current_value': round(current_value, 2),
'mean': round(mean_value, 2),
'std_deviation': round(std_value, 2),
'deviation_factor': round(
abs(current_value - mean_value) / std_value, 1
),
'metric': metric_config['metric'],
'cluster': metric_config['dimension_value'],
'timestamp': datetime.utcnow().isoformat()
}, indent=2)
)
return {'statusCode': 200, 'body': 'Anomaly check complete'}
Deploying with Terraform
# Schedule Lambda to run every 5 minutes
resource "aws_cloudwatch_event_rule" "anomaly_check" {
name = "anomaly-detection-schedule"
description = "Trigger anomaly detection every 5 minutes"
schedule_expression = "rate(5 minutes)"
}
resource "aws_cloudwatch_event_target" "anomaly_lambda" {
rule = aws_cloudwatch_event_rule.anomaly_check.name
arn = aws_lambda_function.anomaly_detector.arn
}
resource "aws_lambda_function" "anomaly_detector" {
filename = "anomaly_detector.zip"
function_name = "infrastructure-anomaly-detector"
role = aws_iam_role.anomaly_lambda_role.arn
handler = "main.lambda_handler"
runtime = "python3.11"
timeout = 60
memory_size = 256
layers = [
"arn:aws:lambda:us-east-1:xxx:layer:sklearn-layer:1"
]
environment {
variables = {
SNS_TOPIC_ARN = aws_sns_topic.anomaly_alerts.arn
}
}
}
It's unsupervised (no labeled data needed), handles high-dimensional data well, and is fast enough to run in a Lambda function. For more complex patterns (seasonality, trends), consider Facebook's Prophet or AWS's built-in anomaly detection in CloudWatch.
Beyond Basic Detection: The Maturity Model
- Level 1 — Static Thresholds: CPU > 80% → alert. Where most teams are today.
- Level 2 — Statistical Anomaly Detection: Z-score or IQR-based detection. Better, but doesn't handle seasonality.
- Level 3 — ML Anomaly Detection: Isolation Forest, autoencoders. Learns "normal" patterns. This article covers this level.
- Level 4 — Predictive + Automated: Forecasts issues before they happen and auto-remediates (scales pods, rolls back deployments).
- Level 5 — AIOps: Full correlation across metrics, logs, and traces. Root cause analysis. This is where LLMs start to shine — summarizing incidents and suggesting fixes.
What's Next: LLMs for Incident Response
The next frontier is using large language models to correlate alerts, summarize incidents, and suggest runbook actions. Imagine getting a Slack message that says:
"API latency spiked 3x in the last 10 minutes. This correlates with a
deployment to the payment-service at 14:32. The deployment changed
database connection pooling settings. Suggested action: rollback
payment-service to v2.3.1 (previous known-good version)."
This is achievable today by feeding CloudWatch anomalies + deployment logs + git history into an LLM via Lambda. The technology is here — it's a matter of building the integration.