AWS CloudWatch Logs Insights: Query, Analyze and Alert on Logs (2026)
AWS CloudWatch Logs Insights is a fully managed, interactive query service that lets you search and analyze log data stored in CloudWatch Logs. With a purpose-built query language, you can extract fields, aggregate data, calculate percentiles, detect anomalies, and build real-time metric filters — all without moving data to a separate analytics platform. This guide covers everything from the query language fundamentals to production automation with Python boto3 and Terraform.
Table of Contents
- 1. Logs Insights vs CloudWatch Metrics vs Athena
- 2. Query Syntax Deep Dive
- 3. Essential Queries Library
- 4. The parse Command — Unstructured Log Extraction
- 5. Aggregation and Visualization
- 6. Cross-Log-Group Queries
- 7. Metric Filters and Alarms
- 8. Contributor Insights
- 9. Log Anomaly Detection
- 10. Automation with boto3 and Terraform
- FAQ
1. Logs Insights vs CloudWatch Metrics vs Athena — When to Use Which
AWS gives you three main tools for log and metric analysis, and choosing the right one avoids paying 10x for the wrong tool. Each excels in a different scenario.
CloudWatch Metrics
Pre-aggregated numerical time series. Best for real-time alerting and dashboards. AWS services automatically publish metrics (Lambda duration, RDS connections, ALB 5xx). You cannot run ad-hoc queries against raw events — metrics are already summarised. Cost: $0.30 per custom metric per month (after 10 free); standard AWS service metrics are free.
CloudWatch Logs Insights
Ad-hoc SQL-like query engine over raw log events stored in CloudWatch Logs. Best for: debugging incidents, investigating errors, analyzing specific time windows, extracting fields from unstructured logs. Queries are fast (seconds to minutes depending on volume) and charged per GB scanned at $0.005/GB. There is no infrastructure to manage — you query and pay only for what you scan.
Amazon Athena
Serverless SQL query engine over S3. Best for: long-term historical analysis, joining logs with other datasets, complex SQL (window functions, CTEs, JOINs across tables). Logs must be exported from CloudWatch to S3 first (via export task or Kinesis Firehose). Athena charges $5 per TB scanned. For logs archived to S3 in Parquet format with partitioning, Athena is 10-100x cheaper than Logs Insights for large-scale queries.
| Criteria | Logs Insights | Athena | CloudWatch Metrics |
|---|---|---|---|
| Data source | CloudWatch Logs | S3 | CloudWatch Metrics store |
| Query latency | Seconds–minutes | Seconds–minutes | Sub-second (pre-agg) |
| Cost per GB | $0.005 | $0.005 (Parquet: ~$0.0005) | Per metric/month |
| Real-time alerting | No (query-based) | No | Yes |
| SQL JOINs | No | Yes | No |
| Setup required | None | Export pipeline | None (AWS metrics) |
2. Query Syntax Deep Dive
Logs Insights has its own query language with six core commands. Commands are evaluated in order from top to bottom, and you pipe results between commands using a newline (not a pipe character).
fields — Select Columns
Specify which fields to display. CloudWatch auto-discovers fields from JSON logs; for non-JSON logs you use parse to extract fields first.
fields @timestamp, @message, @logStream
| sort @timestamp desc
| limit 50
filter — WHERE Clause
Filter events by field values. Supports comparison operators, logical operators (and, or, not), and the like operator for regex or substring matching.
fields @timestamp, @message
| filter @message like /ERROR/
| filter @logStream not like /test/
| sort @timestamp desc
| limit 100
stats — GROUP BY Aggregation
Aggregate data with functions: count(), count_distinct(), sum(), avg(), min(), max(), pct(field, percentile), stddev(). Use by to group, and bin() to bucket timestamps.
fields @timestamp, @message
| stats count(*) as errorCount by bin(5m)
| sort @timestamp asc
sort — ORDER BY
fields @timestamp, statusCode, requestId
| filter statusCode >= 500
| sort statusCode desc, @timestamp desc
limit — TOP N
Limits total results returned. Default is 1000; max is 10000. Always include a limit to keep costs predictable and response times fast.
parse — Extract Fields from Text
Extracts fields from unstructured log text using glob patterns or regex. Detailed in Section 4.
3. Essential Queries Library
The following queries cover the most common production debugging and monitoring scenarios. Copy them directly into the Logs Insights console or embed them in automation scripts.
Lambda: Error Rate and Top Error Messages
# Lambda errors in the last 1 hour — count by error type
filter @message like /ERROR/
| parse @message "* [*] *" as level, requestId, errorMsg
| stats count(*) as errorCount by errorMsg
| sort errorCount desc
| limit 20
Lambda: P99 Cold Start and Duration
filter @type = "REPORT"
| fields @duration, @billedDuration, @initDuration, @memorySize, @maxMemoryUsed
| stats
count(*) as invocations,
avg(@duration) as avgDuration,
pct(@duration, 95) as p95Duration,
pct(@duration, 99) as p99Duration,
avg(@initDuration) as avgColdStart,
count(@initDuration) as coldStarts
by bin(5m)
Lambda: Throttles and Timeout Detection
filter @message like /Task timed out/ or @message like /Throttling/
| stats count(*) as incidents by bin(1m)
| sort @timestamp desc
API Gateway: 5xx Errors by Endpoint
fields @timestamp, status, resourcePath, httpMethod, responseLatency
| filter status >= 500
| stats
count(*) as errorCount,
avg(responseLatency) as avgLatency
by resourcePath, httpMethod
| sort errorCount desc
| limit 25
API Gateway: Slowest Endpoints (P95)
fields responseLatency, resourcePath, httpMethod
| filter ispresent(responseLatency)
| stats
pct(responseLatency, 95) as p95Latency,
pct(responseLatency, 99) as p99Latency,
count(*) as requestCount
by resourcePath, httpMethod
| sort p95Latency desc
| limit 20
ECS/EKS: Container OOM and Crash Detection
fields @timestamp, @message, @logStream
| filter @message like /OOMKilled/ or @message like /CrashLoopBackOff/ or @message like /exit code 137/
| stats count(*) as crashes by @logStream, bin(10m)
| sort crashes desc
ECS Task: Application Errors with Container Name
fields @timestamp, @message, @logStream
| filter @message like /FATAL/ or @message like /Exception/ or @message like /ERROR/
| parse @logStream "*/container/*" as taskId, containerName
| stats count(*) as errorCount by containerName, bin(5m)
| sort @timestamp desc
RDS: Slow Queries (requires slow query log enabled)
fields @timestamp, @message
| filter @message like /Query_time/
| parse @message "Query_time: * Lock_time: * Rows_examined: *" as queryTime, lockTime, rowsExamined
| filter queryTime > 1
| stats
count(*) as slowQueryCount,
avg(queryTime) as avgQueryTime,
max(queryTime) as maxQueryTime
by bin(5m)
| sort @timestamp desc
VPC Flow Logs: Top Talkers (Bytes)
fields srcAddr, dstAddr, bytes, packets, action
| filter action = "ACCEPT"
| stats sum(bytes) as totalBytes by srcAddr
| sort totalBytes desc
| limit 20
VPC Flow Logs: Rejected Traffic by Port
fields srcAddr, dstAddr, dstPort, protocol, action
| filter action = "REJECT"
| stats count(*) as rejectCount by dstPort, protocol
| sort rejectCount desc
| limit 25
CloudTrail: Unauthorized API Calls
fields @timestamp, userIdentity.arn, eventName, sourceIPAddress, errorCode
| filter errorCode like /UnauthorizedAccess/ or errorCode like /AccessDenied/
| stats count(*) as deniedCount by userIdentity.arn, eventName
| sort deniedCount desc
| limit 20
4. The parse Command — Extracting Fields from Unstructured Logs
Not all logs are JSON. The parse command extracts fields from free-text log lines using either glob patterns (with * as wildcard) or regular expressions. Extracted fields become queryable just like native JSON fields.
Glob Pattern Parsing — Apache Access Logs
Apache combined log format: 192.168.1.1 - - [10/Jun/2026:12:00:00 +0000] "GET /api/users HTTP/1.1" 200 1234 "-" "curl/7.68"
parse @message '* - - [*] "* * *" * * "-" "*"' as clientIp, requestTime, httpMethod, uriPath, httpVersion, statusCode, responseBytes, userAgent
| filter statusCode >= 400
| stats count(*) as errorCount by statusCode, uriPath
| sort errorCount desc
| limit 20
Regex Parsing — Custom Application Logs
For logs like: 2026-06-09T10:15:30Z WARN [OrderService] processOrder latency=342ms orderId=ORD-8821 userId=U4491
parse @message /(?P<level>\w+)\s+\[(?P<service>\w+)\]\s+\w+\s+latency=(?P<latencyMs>\d+)ms\s+orderId=(?P<orderId>\S+)\s+userId=(?P<userId>\S+)/
| filter level = "WARN" or level = "ERROR"
| stats avg(latencyMs) as avgLatency, max(latencyMs) as maxLatency, count(*) as count by service
| sort avgLatency desc
Parsing Nested JSON Fields
CloudWatch auto-parses top-level JSON fields. For nested objects, use dot notation or parse the string manually:
# For logs with nested structure like: {"event": {"type": "order", "amount": 99.99}, "userId": "U123"}
fields @timestamp, event.type, event.amount, userId
| filter event.type = "order"
| stats sum(event.amount) as totalRevenue, count(*) as orderCount by bin(1h)
| sort @timestamp asc
Extracting Multiple Fields with One Parse
# Extract from ELB access log format
parse @message "* * * * * * * * * * * * \"*\" \"*\" * *" as type, time, elb, client, target, requestProcessingTime, targetProcessingTime, responseProcessingTime, elbStatusCode, targetStatusCode, receivedBytes, sentBytes, request, userAgent, sslCipher, sslProtocol
| filter elbStatusCode like /5/
| stats count(*) as errorCount, avg(targetProcessingTime) as avgTargetTime by elbStatusCode
| sort errorCount desc
parse (with *) is faster and cheaper than regex parsing because it uses less compute per event. Use regex only when the log format requires it — for example, when fields are in variable order or separated by variable whitespace.5. Aggregation and Visualization
Logs Insights provides rich aggregation functions that go beyond simple counts. Combined with the time-series visualization in the console, you can build meaningful operational charts directly from log data.
Percentile Aggregations
Percentiles are essential for latency analysis. The pct(field, N) function returns the Nth percentile value. Always report P99 alongside average — averages hide tail latency problems.
filter @type = "REPORT"
| stats
count(*) as requests,
min(@duration) as minMs,
avg(@duration) as avgMs,
pct(@duration, 50) as p50Ms,
pct(@duration, 90) as p90Ms,
pct(@duration, 95) as p95Ms,
pct(@duration, 99) as p99Ms,
max(@duration) as maxMs
by bin(10m)
| sort @timestamp asc
Time Series Graphs
When your query uses stats ... by bin(N) where N is a time interval, the Logs Insights console automatically renders a time series chart. Click "Visualization" tab after running the query. Supported intervals: 1s, 1m, 5m, 10m, 30m, 1h, 6h, 1d.
# Error rate as percentage over time
fields @timestamp, @message
| filter @message like /ERROR/ or @message like /INFO/
| stats
sum((@message like /ERROR/) ? 1 : 0) as errors,
count(*) as total
by bin(5m)
| fields errors / total * 100 as errorRatePct
| sort @timestamp asc
Saving Queries to the Console
After writing a useful query, click Save in the top-right of the query editor. Queries are saved by name to your account. You can also save them to specific log groups and share the saved query ARN with team members.
To save a query via CLI:
aws logs put-query-definition \
--name "Lambda-P99-Duration" \
--log-group-names "/aws/lambda/myapp-processor" \
--query-string "filter @type = \"REPORT\" | stats pct(@duration, 99) as p99 by bin(5m) | sort @timestamp asc"
count_distinct for Unique Analysis
# Unique users hitting errors per hour
filter statusCode >= 500
| stats count_distinct(userId) as uniqueAffectedUsers, count(*) as totalErrors by bin(1h)
| sort @timestamp desc
6. Cross-Log-Group Queries
One of Logs Insights' most powerful features is the ability to query multiple log groups in a single query, eliminating the need to run separate queries and manually correlate results. This is invaluable for microservices architectures where a single user request spans multiple services.
Selecting Multiple Log Groups
In the console, hold Ctrl (Windows) or Cmd (Mac) and click multiple log groups in the selector. Alternatively, use the API to specify an array of log group names:
aws logs start-query \
--log-group-names "/aws/lambda/order-service" "/aws/lambda/payment-service" "/aws/lambda/notification-service" \
--start-time $(date -d '1 hour ago' +%s) \
--end-time $(date +%s) \
--query-string "fields @timestamp, @message, @logStream | filter @message like /ERROR/ | sort @timestamp desc | limit 100"
Correlating Requests Across Services by Trace ID
# Run against /aws/lambda/order-service and /aws/lambda/payment-service simultaneously
fields @timestamp, @message, @logStream
| filter @message like /traceId=abc123xyz/
| sort @timestamp asc
Cross-Account Queries (Observability Access Manager)
With AWS Observability Access Manager (OAM), you can query log groups from linked source accounts from a central monitoring account. Set up the link once:
# In the monitoring account — create a sink
aws oam create-sink --name "central-monitoring-sink"
# In each source account — create a link to the sink
aws oam create-link \
--label-template "$AccountName" \
--resource-types "AWS::Logs::LogGroup" \
--sink-identifier "arn:aws:oam:us-east-1:MONITOR_ACCOUNT_ID:sink/SINK_ID"
Account-Level Query Limits
Logs Insights has the following service quotas to be aware of:
- Maximum log groups per query: 50
- Concurrent queries per account per region: 30 (requestable increase)
- Maximum query duration: 15 minutes
- Maximum results returned: 10,000 events
- Maximum query string length: 10,000 characters
7. Metric Filters and Alarms
Metric filters bridge the gap between raw log data and CloudWatch alarms. They scan incoming log events in real time (not retroactively), extract or count matching patterns, and publish a custom CloudWatch metric. You can then create alarms on that metric, build dashboards, and trigger automated responses.
Creating a Metric Filter via CLI
# Create a metric filter to count Lambda errors
aws logs put-metric-filter \
--log-group-name "/aws/lambda/myapp-processor" \
--filter-name "LambdaErrorCount" \
--filter-pattern "ERROR" \
--metric-transformations \
metricName=LambdaErrors,metricNamespace=MyApp/Lambda,metricValue=1,defaultValue=0,unit=Count
Creating an Alarm on the Metric Filter
aws cloudwatch put-metric-alarm \
--alarm-name "lambda-error-rate-high" \
--alarm-description "Lambda error count exceeds 10 in 5 minutes" \
--namespace "MyApp/Lambda" \
--metric-name "LambdaErrors" \
--statistic "Sum" \
--period 300 \
--evaluation-periods 1 \
--threshold 10 \
--comparison-operator "GreaterThanOrEqualToThreshold" \
--treat-missing-data "notBreaching" \
--alarm-actions "arn:aws:sns:us-east-1:123456789012:on-call-alerts"
Advanced Filter Patterns
# Match JSON field values
aws logs put-metric-filter \
--log-group-name "/aws/lambda/api-handler" \
--filter-name "Http5xxErrors" \
--filter-pattern '{ $.statusCode >= 500 }' \
--metric-transformations \
metricName=Http5xxCount,metricNamespace=MyApp/API,metricValue=1,defaultValue=0
# Extract a numeric value from JSON logs (e.g., response latency)
aws logs put-metric-filter \
--log-group-name "/aws/lambda/api-handler" \
--filter-name "ResponseLatency" \
--filter-pattern '[..., latency=*]' \
--metric-transformations \
metricName=ApiLatencyMs,metricNamespace=MyApp/API,metricValue='$.latency',defaultValue=0,unit=Milliseconds
Terraform: Metric Filter + Alarm
resource "aws_cloudwatch_log_metric_filter" "lambda_errors" {
name = "LambdaErrorCount"
log_group_name = "/aws/lambda/myapp-processor"
pattern = "ERROR"
metric_transformation {
name = "LambdaErrors"
namespace = "MyApp/Lambda"
value = "1"
default_value = "0"
unit = "Count"
}
}
resource "aws_cloudwatch_metric_alarm" "lambda_error_alarm" {
alarm_name = "lambda-error-rate-high"
alarm_description = "Lambda errors exceed threshold"
namespace = "MyApp/Lambda"
metric_name = "LambdaErrors"
statistic = "Sum"
period = 300
evaluation_periods = 1
threshold = 10
comparison_operator = "GreaterThanOrEqualToThreshold"
treat_missing_data = "notBreaching"
alarm_actions = [aws_sns_topic.alerts.arn]
ok_actions = [aws_sns_topic.alerts.arn]
tags = {
Environment = "production"
Team = "platform"
}
}
resource "aws_sns_topic" "alerts" {
name = "on-call-alerts"
}
resource "aws_sns_topic_subscription" "email" {
topic_arn = aws_sns_topic.alerts.arn
protocol = "email"
endpoint = "oncall@example.com"
}
put_metric_data in a script.8. Contributor Insights — Top-N Analysis
Contributor Insights analyses log events and identifies the top contributors to a metric — for example, the top 10 IP addresses making the most requests, or the top Lambda functions consuming the most memory. Unlike Logs Insights, Contributor Insights runs continuously and generates real-time CloudWatch metrics, so you can set alarms on them.
Use Cases
- API throttling: which client IDs or IP addresses are hitting rate limits most often
- Top error sources: which services or functions produce the most errors
- Heavy Lambda callers: which upstream services invoke a Lambda most frequently
- DynamoDB hot keys: which partition keys receive the most read/write traffic
- VPC top talkers: which source IPs generate the most bytes
Creating a Contributor Insights Rule via CLI
aws cloudwatch put-insight-rule \
--rule-name "TopApiCallersByIp" \
--rule-state ENABLED \
--rule-definition '{
"Schema": {
"Name": "CloudWatchLogRule",
"Version": 1
},
"LogGroupNames": ["/aws/apigateway/access-logs"],
"LogFormat": "JSON",
"Fields": {
"1": "$.sourceIp",
"2": "$.status"
},
"Contribution": {
"Keys": ["$.sourceIp"],
"ValueOf": "$.status",
"Filters": [
{
"Match": "$.status",
"GreaterThan": 499
}
]
},
"AggregateOn": "Count"
}'
Viewing Contributor Insights Reports
# Get top contributors for the last hour
aws cloudwatch get-insight-rule-report \
--rule-name "TopApiCallersByIp" \
--start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%SZ) \
--end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \
--period 3600 \
--max-contributor-count 10 \
--metrics UniqueContributors SampleCount Sum Maximum
Alarm on Top Contributor Count (DDoS Detection)
aws cloudwatch put-metric-alarm \
--alarm-name "api-unique-ips-spike" \
--namespace "CloudWatchInsightRule" \
--metric-name "UniqueContributors" \
--dimensions Name=InsightRule,Value=TopApiCallersByIp \
--statistic Maximum \
--period 60 \
--evaluation-periods 3 \
--threshold 500 \
--comparison-operator GreaterThanThreshold \
--alarm-actions "arn:aws:sns:us-east-1:123456789012:security-alerts"
9. Log Anomaly Detection
CloudWatch Logs anomaly detection uses machine learning to automatically learn the normal patterns of your log data and alert when deviations occur. Unlike threshold-based alarms, anomaly detection adapts to seasonal patterns (daily/weekly cycles) without manual tuning.
How It Works
Anomaly detection works at two levels in CloudWatch Logs:
- Log pattern anomaly detection: Identifies new or unusual log patterns. CloudWatch groups log events into patterns and alerts when an existing pattern appears at an abnormal rate or a new pattern emerges.
- Metric anomaly detection: Applied to CloudWatch metrics derived from logs (via metric filters). Creates a dynamic band around the expected metric value.
Enabling Log Anomaly Detection
# Enable anomaly detector on a log group
aws logs create-log-anomaly-detector \
--log-group-arn-list "arn:aws:logs:us-east-1:123456789012:log-group:/aws/lambda/myapp-processor" \
--detector-name "myapp-processor-anomaly" \
--evaluation-frequency "FIVE_MIN" \
--filter-pattern "" \
--anomaly-visibility-time 14
# List anomaly detectors
aws logs list-log-anomaly-detectors
# Get detected anomalies in the last 24 hours
aws logs list-anomalies \
--anomaly-detector-arn "arn:aws:logs:us-east-1:123456789012:anomaly-detector:DETECTOR_ID" \
--suppressed false
Metric Anomaly Detection (for Log-Derived Metrics)
# Enable anomaly detection on a metric filter-derived metric
aws cloudwatch put-anomaly-detector \
--namespace "MyApp/Lambda" \
--metric-name "LambdaErrors" \
--stat "Sum" \
--configuration '{"ExcludedTimeRanges":[], "MetricTimezone":"UTC"}'
# Create an alarm using anomaly detection band
aws cloudwatch put-metric-alarm \
--alarm-name "lambda-errors-anomaly" \
--alarm-description "Lambda errors deviate from expected baseline" \
--namespace "MyApp/Lambda" \
--metric-name "LambdaErrors" \
--statistic "Sum" \
--period 300 \
--evaluation-periods 3 \
--comparison-operator "GreaterThanUpperThreshold" \
--threshold-metric-id "ad1" \
--metrics '[
{"Id":"m1","MetricStat":{"Metric":{"Namespace":"MyApp/Lambda","MetricName":"LambdaErrors"},"Period":300,"Stat":"Sum"}},
{"Id":"ad1","Expression":"ANOMALY_DETECTION_BAND(m1, 2)","Label":"Lambda Errors (expected)"}
]' \
--alarm-actions "arn:aws:sns:us-east-1:123456789012:on-call-alerts"
Suppressing Known Maintenance Windows
# Suppress an anomaly (mark as expected)
aws logs update-anomaly \
--anomaly-id "ANOMALY_ID" \
--suppressed true \
--suppression-period value=24,type=HOURS
10. Automation with Python boto3 and Terraform
The real power of Logs Insights comes when you embed it in automated workflows — scheduled reports, CI/CD pipeline checks, post-deployment validation, and on-call runbooks. The API is asynchronous: you start a query, poll for completion, and retrieve results.
Python boto3: Run a Query and Poll for Results
import boto3
import time
from datetime import datetime, timedelta, timezone
logs = boto3.client('logs', region_name='us-east-1')
def run_logs_insights_query(log_group, query_string, hours_back=1):
"""Run a Logs Insights query and wait for results."""
end_time = datetime.now(timezone.utc)
start_time = end_time - timedelta(hours=hours_back)
# Start the query
response = logs.start_query(
logGroupName=log_group,
startTime=int(start_time.timestamp()),
endTime=int(end_time.timestamp()),
queryString=query_string,
limit=500
)
query_id = response['queryId']
print(f"Query started: {query_id}")
# Poll until complete
while True:
status = logs.get_query_results(queryId=query_id)
query_status = status['status']
if query_status == 'Complete':
results = status['results']
stats = status.get('statistics', {})
print(f"Scanned {stats.get('bytesScanned', 0)/1e6:.2f} MB, "
f"returned {len(results)} rows")
return results
elif query_status in ('Failed', 'Cancelled', 'Timeout'):
raise RuntimeError(f"Query {query_id} ended with status: {query_status}")
else:
# Running or Scheduled
time.sleep(2)
# Example: find top 10 Lambda error messages in the last hour
results = run_logs_insights_query(
log_group='/aws/lambda/myapp-processor',
query_string="""
filter @message like /ERROR/
| parse @message "[*] *" as requestId, errorMsg
| stats count(*) as errorCount by errorMsg
| sort errorCount desc
| limit 10
""",
hours_back=1
)
for row in results:
row_dict = {field['field']: field['value'] for field in row}
print(f" {row_dict.get('errorCount', 'N/A'):>6} | {row_dict.get('errorMsg', 'N/A')[:80]}")
Querying Multiple Log Groups Simultaneously
def run_multi_group_query(log_groups: list, query_string: str, hours_back: int = 1):
"""Query multiple log groups at once."""
end_time = datetime.now(timezone.utc)
start_time = end_time - timedelta(hours=hours_back)
response = logs.start_query(
logGroupNames=log_groups, # List of up to 50 log groups
startTime=int(start_time.timestamp()),
endTime=int(end_time.timestamp()),
queryString=query_string,
limit=1000
)
query_id = response['queryId']
while True:
status = logs.get_query_results(queryId=query_id)
if status['status'] == 'Complete':
return status['results']
elif status['status'] in ('Failed', 'Cancelled'):
raise RuntimeError(f"Query failed: {status['status']}")
time.sleep(2)
# Find errors across all microservices
service_log_groups = [
'/aws/lambda/order-service',
'/aws/lambda/payment-service',
'/aws/lambda/inventory-service',
'/aws/lambda/notification-service',
]
results = run_multi_group_query(
log_groups=service_log_groups,
query_string="""
fields @timestamp, @message, @logStream
| filter @message like /ERROR/ or @message like /FATAL/
| stats count(*) as errorCount by @logStream
| sort errorCount desc
| limit 20
""",
hours_back=6
)
Scheduled Daily Error Report via SNS
import boto3
import json
from datetime import datetime, timedelta, timezone
def lambda_handler(event, context):
"""Run nightly error summary and publish to SNS."""
logs = boto3.client('logs')
sns = boto3.client('sns')
end_time = datetime.now(timezone.utc)
start_time = end_time - timedelta(hours=24)
response = logs.start_query(
logGroupNames=[
'/aws/lambda/order-service',
'/aws/lambda/payment-service',
],
startTime=int(start_time.timestamp()),
endTime=int(end_time.timestamp()),
queryString="""
filter @message like /ERROR/
| stats count(*) as errors by @logStream
| sort errors desc
""",
limit=50
)
query_id = response['queryId']
import time
while True:
result = logs.get_query_results(queryId=query_id)
if result['status'] == 'Complete':
break
time.sleep(3)
lines = ["Daily Error Summary — " + end_time.strftime("%Y-%m-%d"), ""]
for row in result['results']:
d = {f['field']: f['value'] for f in row}
lines.append(f" {d.get('errors','?'):>6} errors | {d.get('@logStream','?')}")
sns.publish(
TopicArn='arn:aws:sns:us-east-1:123456789012:daily-reports',
Subject='Daily Error Summary',
Message='\n'.join(lines)
)
return {'statusCode': 200, 'body': json.dumps({'rowsReported': len(result['results'])})}
Terraform: Full Observability Stack (Metric Filter + Alarm + SNS)
variable "lambda_function_name" {
description = "Name of the Lambda function to monitor"
type = string
default = "myapp-processor"
}
variable "alert_email" {
description = "Email address for alerts"
type = string
}
locals {
log_group = "/aws/lambda/${var.lambda_function_name}"
}
# SNS Topic
resource "aws_sns_topic" "lambda_alerts" {
name = "${var.lambda_function_name}-alerts"
}
resource "aws_sns_topic_subscription" "email" {
topic_arn = aws_sns_topic.lambda_alerts.arn
protocol = "email"
endpoint = var.alert_email
}
# Metric Filter: Count errors
resource "aws_cloudwatch_log_metric_filter" "errors" {
name = "${var.lambda_function_name}-ErrorCount"
log_group_name = local.log_group
pattern = "[level=ERROR, ...]"
metric_transformation {
name = "ErrorCount"
namespace = "MyApp/Lambda"
value = "1"
default_value = "0"
unit = "Count"
}
}
# Alarm: Error count threshold
resource "aws_cloudwatch_metric_alarm" "error_alarm" {
alarm_name = "${var.lambda_function_name}-errors-high"
alarm_description = "Lambda error count exceeds 10 in 5 minutes"
namespace = "MyApp/Lambda"
metric_name = "ErrorCount"
dimensions = {}
statistic = "Sum"
period = 300
evaluation_periods = 1
threshold = 10
comparison_operator = "GreaterThanOrEqualToThreshold"
treat_missing_data = "notBreaching"
alarm_actions = [aws_sns_topic.lambda_alerts.arn]
}
# Metric Filter: Track throttles
resource "aws_cloudwatch_log_metric_filter" "throttles" {
name = "${var.lambda_function_name}-ThrottleCount"
log_group_name = local.log_group
pattern = "Task timed out"
metric_transformation {
name = "TimeoutCount"
namespace = "MyApp/Lambda"
value = "1"
default_value = "0"
unit = "Count"
}
}
# Composite Alarm: fire only when both errors AND timeouts are high
resource "aws_cloudwatch_composite_alarm" "combined" {
alarm_name = "${var.lambda_function_name}-critical"
alarm_description = "Errors AND timeouts are both elevated"
alarm_rule = "ALARM(${aws_cloudwatch_metric_alarm.error_alarm.alarm_name})"
alarm_actions = [aws_sns_topic.lambda_alerts.arn]
}
Frequently Asked Questions
How long is log data retained in CloudWatch Logs?
By default, log groups have indefinite retention — you pay for storage indefinitely. Always set explicit retention policies. Common settings: 7 days for debug logs, 30 days for application logs, 90 days for access logs, 1 year for audit/compliance logs. Set via console or: aws logs put-retention-policy --log-group-name "/aws/lambda/myapp" --retention-in-days 30
Can I query logs from deleted log streams?
Yes. Logs Insights queries the underlying log data, not just active streams. As long as the log group exists and the data is within the retention period, you can query it regardless of whether the originating resource (Lambda function, EC2 instance) still exists.
Why is my Logs Insights query returning 0 results for JSON logs?
CloudWatch only auto-parses JSON if the entire log event is a valid JSON object starting with {. If your application wraps JSON in a prefix (e.g., INFO {"key": "value"}), CloudWatch treats it as plain text. Use parse @message "* *" as level, jsonPart to extract the JSON part, then query its fields. Alternatively, configure your application to emit pure JSON log events.
What is the maximum time range for a Logs Insights query?
There is no hard limit on time range, but queries time out after 15 minutes. For very large log groups, a 7-day query may timeout. Best practice: stay within 24 hours for high-volume log groups, and use 7–30 days only for low-volume groups.
How do metric filters interact with log subscriptions?
Metric filters and log subscriptions (Kinesis, Lambda) are independent — both process the same incoming log events in parallel. A metric filter counts or extracts values; a subscription destination receives the full log event. You can have up to 2 subscription filters and unlimited metric filters per log group.