AWS CloudWatch Logs Insights: Query, Analyze and Alert on Logs (2026)

AWS CloudWatch Logs Insights

AWS CloudWatch Logs Insights is a fully managed, interactive query service that lets you search and analyze log data stored in CloudWatch Logs. With a purpose-built query language, you can extract fields, aggregate data, calculate percentiles, detect anomalies, and build real-time metric filters — all without moving data to a separate analytics platform. This guide covers everything from the query language fundamentals to production automation with Python boto3 and Terraform.

1. Logs Insights vs CloudWatch Metrics vs Athena — When to Use Which

AWS gives you three main tools for log and metric analysis, and choosing the right one avoids paying 10x for the wrong tool. Each excels in a different scenario.

CloudWatch Metrics

Pre-aggregated numerical time series. Best for real-time alerting and dashboards. AWS services automatically publish metrics (Lambda duration, RDS connections, ALB 5xx). You cannot run ad-hoc queries against raw events — metrics are already summarised. Cost: $0.30 per custom metric per month (after 10 free); standard AWS service metrics are free.

CloudWatch Logs Insights

Ad-hoc SQL-like query engine over raw log events stored in CloudWatch Logs. Best for: debugging incidents, investigating errors, analyzing specific time windows, extracting fields from unstructured logs. Queries are fast (seconds to minutes depending on volume) and charged per GB scanned at $0.005/GB. There is no infrastructure to manage — you query and pay only for what you scan.

Amazon Athena

Serverless SQL query engine over S3. Best for: long-term historical analysis, joining logs with other datasets, complex SQL (window functions, CTEs, JOINs across tables). Logs must be exported from CloudWatch to S3 first (via export task or Kinesis Firehose). Athena charges $5 per TB scanned. For logs archived to S3 in Parquet format with partitioning, Athena is 10-100x cheaper than Logs Insights for large-scale queries.

Decision rule: Use Logs Insights for incident debugging and queries against recent logs (last 7–30 days). Use Athena for compliance audits, trend analysis over months of data, or when you need SQL JOINs. Use metrics for real-time alerting.
CriteriaLogs InsightsAthenaCloudWatch Metrics
Data sourceCloudWatch LogsS3CloudWatch Metrics store
Query latencySeconds–minutesSeconds–minutesSub-second (pre-agg)
Cost per GB$0.005$0.005 (Parquet: ~$0.0005)Per metric/month
Real-time alertingNo (query-based)NoYes
SQL JOINsNoYesNo
Setup requiredNoneExport pipelineNone (AWS metrics)

2. Query Syntax Deep Dive

Logs Insights has its own query language with six core commands. Commands are evaluated in order from top to bottom, and you pipe results between commands using a newline (not a pipe character).

fields — Select Columns

Specify which fields to display. CloudWatch auto-discovers fields from JSON logs; for non-JSON logs you use parse to extract fields first.

fields @timestamp, @message, @logStream
| sort @timestamp desc
| limit 50

filter — WHERE Clause

Filter events by field values. Supports comparison operators, logical operators (and, or, not), and the like operator for regex or substring matching.

fields @timestamp, @message
| filter @message like /ERROR/
| filter @logStream not like /test/
| sort @timestamp desc
| limit 100

stats — GROUP BY Aggregation

Aggregate data with functions: count(), count_distinct(), sum(), avg(), min(), max(), pct(field, percentile), stddev(). Use by to group, and bin() to bucket timestamps.

fields @timestamp, @message
| stats count(*) as errorCount by bin(5m)
| sort @timestamp asc

sort — ORDER BY

fields @timestamp, statusCode, requestId
| filter statusCode >= 500
| sort statusCode desc, @timestamp desc

limit — TOP N

Limits total results returned. Default is 1000; max is 10000. Always include a limit to keep costs predictable and response times fast.

parse — Extract Fields from Text

Extracts fields from unstructured log text using glob patterns or regex. Detailed in Section 4.

Tip: Logs Insights queries time out after 15 minutes. For very large log groups, narrow the time range first, then broaden only if needed. The console shows bytes scanned before you run — check it to estimate cost.

3. Essential Queries Library

The following queries cover the most common production debugging and monitoring scenarios. Copy them directly into the Logs Insights console or embed them in automation scripts.

Lambda: Error Rate and Top Error Messages

# Lambda errors in the last 1 hour — count by error type
filter @message like /ERROR/
| parse @message "* [*] *" as level, requestId, errorMsg
| stats count(*) as errorCount by errorMsg
| sort errorCount desc
| limit 20

Lambda: P99 Cold Start and Duration

filter @type = "REPORT"
| fields @duration, @billedDuration, @initDuration, @memorySize, @maxMemoryUsed
| stats
    count(*) as invocations,
    avg(@duration) as avgDuration,
    pct(@duration, 95) as p95Duration,
    pct(@duration, 99) as p99Duration,
    avg(@initDuration) as avgColdStart,
    count(@initDuration) as coldStarts
  by bin(5m)

Lambda: Throttles and Timeout Detection

filter @message like /Task timed out/ or @message like /Throttling/
| stats count(*) as incidents by bin(1m)
| sort @timestamp desc

API Gateway: 5xx Errors by Endpoint

fields @timestamp, status, resourcePath, httpMethod, responseLatency
| filter status >= 500
| stats
    count(*) as errorCount,
    avg(responseLatency) as avgLatency
  by resourcePath, httpMethod
| sort errorCount desc
| limit 25

API Gateway: Slowest Endpoints (P95)

fields responseLatency, resourcePath, httpMethod
| filter ispresent(responseLatency)
| stats
    pct(responseLatency, 95) as p95Latency,
    pct(responseLatency, 99) as p99Latency,
    count(*) as requestCount
  by resourcePath, httpMethod
| sort p95Latency desc
| limit 20

ECS/EKS: Container OOM and Crash Detection

fields @timestamp, @message, @logStream
| filter @message like /OOMKilled/ or @message like /CrashLoopBackOff/ or @message like /exit code 137/
| stats count(*) as crashes by @logStream, bin(10m)
| sort crashes desc

ECS Task: Application Errors with Container Name

fields @timestamp, @message, @logStream
| filter @message like /FATAL/ or @message like /Exception/ or @message like /ERROR/
| parse @logStream "*/container/*" as taskId, containerName
| stats count(*) as errorCount by containerName, bin(5m)
| sort @timestamp desc

RDS: Slow Queries (requires slow query log enabled)

fields @timestamp, @message
| filter @message like /Query_time/
| parse @message "Query_time: * Lock_time: * Rows_examined: *" as queryTime, lockTime, rowsExamined
| filter queryTime > 1
| stats
    count(*) as slowQueryCount,
    avg(queryTime) as avgQueryTime,
    max(queryTime) as maxQueryTime
  by bin(5m)
| sort @timestamp desc

VPC Flow Logs: Top Talkers (Bytes)

fields srcAddr, dstAddr, bytes, packets, action
| filter action = "ACCEPT"
| stats sum(bytes) as totalBytes by srcAddr
| sort totalBytes desc
| limit 20

VPC Flow Logs: Rejected Traffic by Port

fields srcAddr, dstAddr, dstPort, protocol, action
| filter action = "REJECT"
| stats count(*) as rejectCount by dstPort, protocol
| sort rejectCount desc
| limit 25

CloudTrail: Unauthorized API Calls

fields @timestamp, userIdentity.arn, eventName, sourceIPAddress, errorCode
| filter errorCode like /UnauthorizedAccess/ or errorCode like /AccessDenied/
| stats count(*) as deniedCount by userIdentity.arn, eventName
| sort deniedCount desc
| limit 20
Pro tip: Save frequently-used queries in the Logs Insights console under Queries > Saved queries. Saved queries persist across your account and can be shared with other users via the query ARN.

4. The parse Command — Extracting Fields from Unstructured Logs

Not all logs are JSON. The parse command extracts fields from free-text log lines using either glob patterns (with * as wildcard) or regular expressions. Extracted fields become queryable just like native JSON fields.

Glob Pattern Parsing — Apache Access Logs

Apache combined log format: 192.168.1.1 - - [10/Jun/2026:12:00:00 +0000] "GET /api/users HTTP/1.1" 200 1234 "-" "curl/7.68"

parse @message '* - - [*] "* * *" * * "-" "*"' as clientIp, requestTime, httpMethod, uriPath, httpVersion, statusCode, responseBytes, userAgent
| filter statusCode >= 400
| stats count(*) as errorCount by statusCode, uriPath
| sort errorCount desc
| limit 20

Regex Parsing — Custom Application Logs

For logs like: 2026-06-09T10:15:30Z WARN [OrderService] processOrder latency=342ms orderId=ORD-8821 userId=U4491

parse @message /(?P<level>\w+)\s+\[(?P<service>\w+)\]\s+\w+\s+latency=(?P<latencyMs>\d+)ms\s+orderId=(?P<orderId>\S+)\s+userId=(?P<userId>\S+)/
| filter level = "WARN" or level = "ERROR"
| stats avg(latencyMs) as avgLatency, max(latencyMs) as maxLatency, count(*) as count by service
| sort avgLatency desc

Parsing Nested JSON Fields

CloudWatch auto-parses top-level JSON fields. For nested objects, use dot notation or parse the string manually:

# For logs with nested structure like: {"event": {"type": "order", "amount": 99.99}, "userId": "U123"}
fields @timestamp, event.type, event.amount, userId
| filter event.type = "order"
| stats sum(event.amount) as totalRevenue, count(*) as orderCount by bin(1h)
| sort @timestamp asc

Extracting Multiple Fields with One Parse

# Extract from ELB access log format
parse @message "* * * * * * * * * * * * \"*\" \"*\" * *" as type, time, elb, client, target, requestProcessingTime, targetProcessingTime, responseProcessingTime, elbStatusCode, targetStatusCode, receivedBytes, sentBytes, request, userAgent, sslCipher, sslProtocol
| filter elbStatusCode like /5/
| stats count(*) as errorCount, avg(targetProcessingTime) as avgTargetTime by elbStatusCode
| sort errorCount desc
Note: The glob-based parse (with *) is faster and cheaper than regex parsing because it uses less compute per event. Use regex only when the log format requires it — for example, when fields are in variable order or separated by variable whitespace.

5. Aggregation and Visualization

Logs Insights provides rich aggregation functions that go beyond simple counts. Combined with the time-series visualization in the console, you can build meaningful operational charts directly from log data.

Percentile Aggregations

Percentiles are essential for latency analysis. The pct(field, N) function returns the Nth percentile value. Always report P99 alongside average — averages hide tail latency problems.

filter @type = "REPORT"
| stats
    count(*) as requests,
    min(@duration) as minMs,
    avg(@duration) as avgMs,
    pct(@duration, 50) as p50Ms,
    pct(@duration, 90) as p90Ms,
    pct(@duration, 95) as p95Ms,
    pct(@duration, 99) as p99Ms,
    max(@duration) as maxMs
  by bin(10m)
| sort @timestamp asc

Time Series Graphs

When your query uses stats ... by bin(N) where N is a time interval, the Logs Insights console automatically renders a time series chart. Click "Visualization" tab after running the query. Supported intervals: 1s, 1m, 5m, 10m, 30m, 1h, 6h, 1d.

# Error rate as percentage over time
fields @timestamp, @message
| filter @message like /ERROR/ or @message like /INFO/
| stats
    sum((@message like /ERROR/) ? 1 : 0) as errors,
    count(*) as total
  by bin(5m)
| fields errors / total * 100 as errorRatePct
| sort @timestamp asc

Saving Queries to the Console

After writing a useful query, click Save in the top-right of the query editor. Queries are saved by name to your account. You can also save them to specific log groups and share the saved query ARN with team members.

To save a query via CLI:

aws logs put-query-definition \
  --name "Lambda-P99-Duration" \
  --log-group-names "/aws/lambda/myapp-processor" \
  --query-string "filter @type = \"REPORT\" | stats pct(@duration, 99) as p99 by bin(5m) | sort @timestamp asc"

count_distinct for Unique Analysis

# Unique users hitting errors per hour
filter statusCode >= 500
| stats count_distinct(userId) as uniqueAffectedUsers, count(*) as totalErrors by bin(1h)
| sort @timestamp desc
Visualization tip: Add the query result as a CloudWatch Dashboard widget by clicking Add to dashboard in the query results panel. The widget will re-run the query on each dashboard load and supports auto-refresh intervals down to 1 minute.

6. Cross-Log-Group Queries

One of Logs Insights' most powerful features is the ability to query multiple log groups in a single query, eliminating the need to run separate queries and manually correlate results. This is invaluable for microservices architectures where a single user request spans multiple services.

Selecting Multiple Log Groups

In the console, hold Ctrl (Windows) or Cmd (Mac) and click multiple log groups in the selector. Alternatively, use the API to specify an array of log group names:

aws logs start-query \
  --log-group-names "/aws/lambda/order-service" "/aws/lambda/payment-service" "/aws/lambda/notification-service" \
  --start-time $(date -d '1 hour ago' +%s) \
  --end-time $(date +%s) \
  --query-string "fields @timestamp, @message, @logStream | filter @message like /ERROR/ | sort @timestamp desc | limit 100"

Correlating Requests Across Services by Trace ID

# Run against /aws/lambda/order-service and /aws/lambda/payment-service simultaneously
fields @timestamp, @message, @logStream
| filter @message like /traceId=abc123xyz/
| sort @timestamp asc

Cross-Account Queries (Observability Access Manager)

With AWS Observability Access Manager (OAM), you can query log groups from linked source accounts from a central monitoring account. Set up the link once:

# In the monitoring account — create a sink
aws oam create-sink --name "central-monitoring-sink"

# In each source account — create a link to the sink
aws oam create-link \
  --label-template "$AccountName" \
  --resource-types "AWS::Logs::LogGroup" \
  --sink-identifier "arn:aws:oam:us-east-1:MONITOR_ACCOUNT_ID:sink/SINK_ID"

Account-Level Query Limits

Logs Insights has the following service quotas to be aware of:

  • Maximum log groups per query: 50
  • Concurrent queries per account per region: 30 (requestable increase)
  • Maximum query duration: 15 minutes
  • Maximum results returned: 10,000 events
  • Maximum query string length: 10,000 characters
Note: When querying across many log groups, be especially careful about time range selection. Querying 50 log groups over 7 days can scan terabytes of data and generate significant cost. Always start with a narrow time window and expand only if needed.

7. Metric Filters and Alarms

Metric filters bridge the gap between raw log data and CloudWatch alarms. They scan incoming log events in real time (not retroactively), extract or count matching patterns, and publish a custom CloudWatch metric. You can then create alarms on that metric, build dashboards, and trigger automated responses.

Creating a Metric Filter via CLI

# Create a metric filter to count Lambda errors
aws logs put-metric-filter \
  --log-group-name "/aws/lambda/myapp-processor" \
  --filter-name "LambdaErrorCount" \
  --filter-pattern "ERROR" \
  --metric-transformations \
      metricName=LambdaErrors,metricNamespace=MyApp/Lambda,metricValue=1,defaultValue=0,unit=Count

Creating an Alarm on the Metric Filter

aws cloudwatch put-metric-alarm \
  --alarm-name "lambda-error-rate-high" \
  --alarm-description "Lambda error count exceeds 10 in 5 minutes" \
  --namespace "MyApp/Lambda" \
  --metric-name "LambdaErrors" \
  --statistic "Sum" \
  --period 300 \
  --evaluation-periods 1 \
  --threshold 10 \
  --comparison-operator "GreaterThanOrEqualToThreshold" \
  --treat-missing-data "notBreaching" \
  --alarm-actions "arn:aws:sns:us-east-1:123456789012:on-call-alerts"

Advanced Filter Patterns

# Match JSON field values
aws logs put-metric-filter \
  --log-group-name "/aws/lambda/api-handler" \
  --filter-name "Http5xxErrors" \
  --filter-pattern '{ $.statusCode >= 500 }' \
  --metric-transformations \
      metricName=Http5xxCount,metricNamespace=MyApp/API,metricValue=1,defaultValue=0

# Extract a numeric value from JSON logs (e.g., response latency)
aws logs put-metric-filter \
  --log-group-name "/aws/lambda/api-handler" \
  --filter-name "ResponseLatency" \
  --filter-pattern '[..., latency=*]' \
  --metric-transformations \
      metricName=ApiLatencyMs,metricNamespace=MyApp/API,metricValue='$.latency',defaultValue=0,unit=Milliseconds

Terraform: Metric Filter + Alarm

resource "aws_cloudwatch_log_metric_filter" "lambda_errors" {
  name           = "LambdaErrorCount"
  log_group_name = "/aws/lambda/myapp-processor"
  pattern        = "ERROR"

  metric_transformation {
    name          = "LambdaErrors"
    namespace     = "MyApp/Lambda"
    value         = "1"
    default_value = "0"
    unit          = "Count"
  }
}

resource "aws_cloudwatch_metric_alarm" "lambda_error_alarm" {
  alarm_name          = "lambda-error-rate-high"
  alarm_description   = "Lambda errors exceed threshold"
  namespace           = "MyApp/Lambda"
  metric_name         = "LambdaErrors"
  statistic           = "Sum"
  period              = 300
  evaluation_periods  = 1
  threshold           = 10
  comparison_operator = "GreaterThanOrEqualToThreshold"
  treat_missing_data  = "notBreaching"

  alarm_actions = [aws_sns_topic.alerts.arn]
  ok_actions    = [aws_sns_topic.alerts.arn]

  tags = {
    Environment = "production"
    Team        = "platform"
  }
}

resource "aws_sns_topic" "alerts" {
  name = "on-call-alerts"
}

resource "aws_sns_topic_subscription" "email" {
  topic_arn = aws_sns_topic.alerts.arn
  protocol  = "email"
  endpoint  = "oncall@example.com"
}
Important: Metric filters only apply to log events ingested after the filter is created. They do not retroactively process historical log data. If you need historical data converted to metrics, use Logs Insights to query the history and push results via put_metric_data in a script.

8. Contributor Insights — Top-N Analysis

Contributor Insights analyses log events and identifies the top contributors to a metric — for example, the top 10 IP addresses making the most requests, or the top Lambda functions consuming the most memory. Unlike Logs Insights, Contributor Insights runs continuously and generates real-time CloudWatch metrics, so you can set alarms on them.

Use Cases

  • API throttling: which client IDs or IP addresses are hitting rate limits most often
  • Top error sources: which services or functions produce the most errors
  • Heavy Lambda callers: which upstream services invoke a Lambda most frequently
  • DynamoDB hot keys: which partition keys receive the most read/write traffic
  • VPC top talkers: which source IPs generate the most bytes

Creating a Contributor Insights Rule via CLI

aws cloudwatch put-insight-rule \
  --rule-name "TopApiCallersByIp" \
  --rule-state ENABLED \
  --rule-definition '{
    "Schema": {
      "Name": "CloudWatchLogRule",
      "Version": 1
    },
    "LogGroupNames": ["/aws/apigateway/access-logs"],
    "LogFormat": "JSON",
    "Fields": {
      "1": "$.sourceIp",
      "2": "$.status"
    },
    "Contribution": {
      "Keys": ["$.sourceIp"],
      "ValueOf": "$.status",
      "Filters": [
        {
          "Match": "$.status",
          "GreaterThan": 499
        }
      ]
    },
    "AggregateOn": "Count"
  }'

Viewing Contributor Insights Reports

# Get top contributors for the last hour
aws cloudwatch get-insight-rule-report \
  --rule-name "TopApiCallersByIp" \
  --start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%SZ) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \
  --period 3600 \
  --max-contributor-count 10 \
  --metrics UniqueContributors SampleCount Sum Maximum

Alarm on Top Contributor Count (DDoS Detection)

aws cloudwatch put-metric-alarm \
  --alarm-name "api-unique-ips-spike" \
  --namespace "CloudWatchInsightRule" \
  --metric-name "UniqueContributors" \
  --dimensions Name=InsightRule,Value=TopApiCallersByIp \
  --statistic Maximum \
  --period 60 \
  --evaluation-periods 3 \
  --threshold 500 \
  --comparison-operator GreaterThanThreshold \
  --alarm-actions "arn:aws:sns:us-east-1:123456789012:security-alerts"
Pricing: Contributor Insights rules cost $0.50 per rule per month plus $0.02 per million log events matched. Enable rules only on high-value log groups to control costs. Rules can be disabled and re-enabled without losing the rule definition.

9. Log Anomaly Detection

CloudWatch Logs anomaly detection uses machine learning to automatically learn the normal patterns of your log data and alert when deviations occur. Unlike threshold-based alarms, anomaly detection adapts to seasonal patterns (daily/weekly cycles) without manual tuning.

How It Works

Anomaly detection works at two levels in CloudWatch Logs:

  1. Log pattern anomaly detection: Identifies new or unusual log patterns. CloudWatch groups log events into patterns and alerts when an existing pattern appears at an abnormal rate or a new pattern emerges.
  2. Metric anomaly detection: Applied to CloudWatch metrics derived from logs (via metric filters). Creates a dynamic band around the expected metric value.

Enabling Log Anomaly Detection

# Enable anomaly detector on a log group
aws logs create-log-anomaly-detector \
  --log-group-arn-list "arn:aws:logs:us-east-1:123456789012:log-group:/aws/lambda/myapp-processor" \
  --detector-name "myapp-processor-anomaly" \
  --evaluation-frequency "FIVE_MIN" \
  --filter-pattern "" \
  --anomaly-visibility-time 14

# List anomaly detectors
aws logs list-log-anomaly-detectors

# Get detected anomalies in the last 24 hours
aws logs list-anomalies \
  --anomaly-detector-arn "arn:aws:logs:us-east-1:123456789012:anomaly-detector:DETECTOR_ID" \
  --suppressed false

Metric Anomaly Detection (for Log-Derived Metrics)

# Enable anomaly detection on a metric filter-derived metric
aws cloudwatch put-anomaly-detector \
  --namespace "MyApp/Lambda" \
  --metric-name "LambdaErrors" \
  --stat "Sum" \
  --configuration '{"ExcludedTimeRanges":[], "MetricTimezone":"UTC"}'

# Create an alarm using anomaly detection band
aws cloudwatch put-metric-alarm \
  --alarm-name "lambda-errors-anomaly" \
  --alarm-description "Lambda errors deviate from expected baseline" \
  --namespace "MyApp/Lambda" \
  --metric-name "LambdaErrors" \
  --statistic "Sum" \
  --period 300 \
  --evaluation-periods 3 \
  --comparison-operator "GreaterThanUpperThreshold" \
  --threshold-metric-id "ad1" \
  --metrics '[
    {"Id":"m1","MetricStat":{"Metric":{"Namespace":"MyApp/Lambda","MetricName":"LambdaErrors"},"Period":300,"Stat":"Sum"}},
    {"Id":"ad1","Expression":"ANOMALY_DETECTION_BAND(m1, 2)","Label":"Lambda Errors (expected)"}
  ]' \
  --alarm-actions "arn:aws:sns:us-east-1:123456789012:on-call-alerts"
Baseline learning period: CloudWatch anomaly detection requires approximately 2 weeks of data to establish a reliable baseline that accounts for weekly seasonality. During the initial learning period, the detector may produce false positives. Consider suppressing alerts for the first 14 days after enabling a new detector.

Suppressing Known Maintenance Windows

# Suppress an anomaly (mark as expected)
aws logs update-anomaly \
  --anomaly-id "ANOMALY_ID" \
  --suppressed true \
  --suppression-period value=24,type=HOURS

10. Automation with Python boto3 and Terraform

The real power of Logs Insights comes when you embed it in automated workflows — scheduled reports, CI/CD pipeline checks, post-deployment validation, and on-call runbooks. The API is asynchronous: you start a query, poll for completion, and retrieve results.

Python boto3: Run a Query and Poll for Results

import boto3
import time
from datetime import datetime, timedelta, timezone

logs = boto3.client('logs', region_name='us-east-1')

def run_logs_insights_query(log_group, query_string, hours_back=1):
    """Run a Logs Insights query and wait for results."""
    end_time = datetime.now(timezone.utc)
    start_time = end_time - timedelta(hours=hours_back)

    # Start the query
    response = logs.start_query(
        logGroupName=log_group,
        startTime=int(start_time.timestamp()),
        endTime=int(end_time.timestamp()),
        queryString=query_string,
        limit=500
    )
    query_id = response['queryId']
    print(f"Query started: {query_id}")

    # Poll until complete
    while True:
        status = logs.get_query_results(queryId=query_id)
        query_status = status['status']

        if query_status == 'Complete':
            results = status['results']
            stats = status.get('statistics', {})
            print(f"Scanned {stats.get('bytesScanned', 0)/1e6:.2f} MB, "
                  f"returned {len(results)} rows")
            return results

        elif query_status in ('Failed', 'Cancelled', 'Timeout'):
            raise RuntimeError(f"Query {query_id} ended with status: {query_status}")

        else:
            # Running or Scheduled
            time.sleep(2)

# Example: find top 10 Lambda error messages in the last hour
results = run_logs_insights_query(
    log_group='/aws/lambda/myapp-processor',
    query_string="""
        filter @message like /ERROR/
        | parse @message "[*] *" as requestId, errorMsg
        | stats count(*) as errorCount by errorMsg
        | sort errorCount desc
        | limit 10
    """,
    hours_back=1
)

for row in results:
    row_dict = {field['field']: field['value'] for field in row}
    print(f"  {row_dict.get('errorCount', 'N/A'):>6} | {row_dict.get('errorMsg', 'N/A')[:80]}")

Querying Multiple Log Groups Simultaneously

def run_multi_group_query(log_groups: list, query_string: str, hours_back: int = 1):
    """Query multiple log groups at once."""
    end_time = datetime.now(timezone.utc)
    start_time = end_time - timedelta(hours=hours_back)

    response = logs.start_query(
        logGroupNames=log_groups,  # List of up to 50 log groups
        startTime=int(start_time.timestamp()),
        endTime=int(end_time.timestamp()),
        queryString=query_string,
        limit=1000
    )
    query_id = response['queryId']

    while True:
        status = logs.get_query_results(queryId=query_id)
        if status['status'] == 'Complete':
            return status['results']
        elif status['status'] in ('Failed', 'Cancelled'):
            raise RuntimeError(f"Query failed: {status['status']}")
        time.sleep(2)

# Find errors across all microservices
service_log_groups = [
    '/aws/lambda/order-service',
    '/aws/lambda/payment-service',
    '/aws/lambda/inventory-service',
    '/aws/lambda/notification-service',
]

results = run_multi_group_query(
    log_groups=service_log_groups,
    query_string="""
        fields @timestamp, @message, @logStream
        | filter @message like /ERROR/ or @message like /FATAL/
        | stats count(*) as errorCount by @logStream
        | sort errorCount desc
        | limit 20
    """,
    hours_back=6
)

Scheduled Daily Error Report via SNS

import boto3
import json
from datetime import datetime, timedelta, timezone

def lambda_handler(event, context):
    """Run nightly error summary and publish to SNS."""
    logs = boto3.client('logs')
    sns = boto3.client('sns')

    end_time = datetime.now(timezone.utc)
    start_time = end_time - timedelta(hours=24)

    response = logs.start_query(
        logGroupNames=[
            '/aws/lambda/order-service',
            '/aws/lambda/payment-service',
        ],
        startTime=int(start_time.timestamp()),
        endTime=int(end_time.timestamp()),
        queryString="""
            filter @message like /ERROR/
            | stats count(*) as errors by @logStream
            | sort errors desc
        """,
        limit=50
    )
    query_id = response['queryId']

    import time
    while True:
        result = logs.get_query_results(queryId=query_id)
        if result['status'] == 'Complete':
            break
        time.sleep(3)

    lines = ["Daily Error Summary — " + end_time.strftime("%Y-%m-%d"), ""]
    for row in result['results']:
        d = {f['field']: f['value'] for f in row}
        lines.append(f"  {d.get('errors','?'):>6} errors  |  {d.get('@logStream','?')}")

    sns.publish(
        TopicArn='arn:aws:sns:us-east-1:123456789012:daily-reports',
        Subject='Daily Error Summary',
        Message='\n'.join(lines)
    )
    return {'statusCode': 200, 'body': json.dumps({'rowsReported': len(result['results'])})}

Terraform: Full Observability Stack (Metric Filter + Alarm + SNS)

variable "lambda_function_name" {
  description = "Name of the Lambda function to monitor"
  type        = string
  default     = "myapp-processor"
}

variable "alert_email" {
  description = "Email address for alerts"
  type        = string
}

locals {
  log_group = "/aws/lambda/${var.lambda_function_name}"
}

# SNS Topic
resource "aws_sns_topic" "lambda_alerts" {
  name = "${var.lambda_function_name}-alerts"
}

resource "aws_sns_topic_subscription" "email" {
  topic_arn = aws_sns_topic.lambda_alerts.arn
  protocol  = "email"
  endpoint  = var.alert_email
}

# Metric Filter: Count errors
resource "aws_cloudwatch_log_metric_filter" "errors" {
  name           = "${var.lambda_function_name}-ErrorCount"
  log_group_name = local.log_group
  pattern        = "[level=ERROR, ...]"

  metric_transformation {
    name          = "ErrorCount"
    namespace     = "MyApp/Lambda"
    value         = "1"
    default_value = "0"
    unit          = "Count"
  }
}

# Alarm: Error count threshold
resource "aws_cloudwatch_metric_alarm" "error_alarm" {
  alarm_name          = "${var.lambda_function_name}-errors-high"
  alarm_description   = "Lambda error count exceeds 10 in 5 minutes"
  namespace           = "MyApp/Lambda"
  metric_name         = "ErrorCount"
  dimensions          = {}
  statistic           = "Sum"
  period              = 300
  evaluation_periods  = 1
  threshold           = 10
  comparison_operator = "GreaterThanOrEqualToThreshold"
  treat_missing_data  = "notBreaching"
  alarm_actions       = [aws_sns_topic.lambda_alerts.arn]
}

# Metric Filter: Track throttles
resource "aws_cloudwatch_log_metric_filter" "throttles" {
  name           = "${var.lambda_function_name}-ThrottleCount"
  log_group_name = local.log_group
  pattern        = "Task timed out"

  metric_transformation {
    name          = "TimeoutCount"
    namespace     = "MyApp/Lambda"
    value         = "1"
    default_value = "0"
    unit          = "Count"
  }
}

# Composite Alarm: fire only when both errors AND timeouts are high
resource "aws_cloudwatch_composite_alarm" "combined" {
  alarm_name        = "${var.lambda_function_name}-critical"
  alarm_description = "Errors AND timeouts are both elevated"

  alarm_rule = "ALARM(${aws_cloudwatch_metric_alarm.error_alarm.alarm_name})"

  alarm_actions = [aws_sns_topic.lambda_alerts.arn]
}
Cost control tip: Schedule your automated Logs Insights queries to run during off-peak hours when log volume is lower. Use narrow time windows (1–6 hours) for routine health checks. For weekly/monthly reports, export logs to S3 and use Athena instead — it costs 1000x less per GB when data is in Parquet with partitioning.

Frequently Asked Questions

How long is log data retained in CloudWatch Logs?

By default, log groups have indefinite retention — you pay for storage indefinitely. Always set explicit retention policies. Common settings: 7 days for debug logs, 30 days for application logs, 90 days for access logs, 1 year for audit/compliance logs. Set via console or: aws logs put-retention-policy --log-group-name "/aws/lambda/myapp" --retention-in-days 30

Can I query logs from deleted log streams?

Yes. Logs Insights queries the underlying log data, not just active streams. As long as the log group exists and the data is within the retention period, you can query it regardless of whether the originating resource (Lambda function, EC2 instance) still exists.

Why is my Logs Insights query returning 0 results for JSON logs?

CloudWatch only auto-parses JSON if the entire log event is a valid JSON object starting with {. If your application wraps JSON in a prefix (e.g., INFO {"key": "value"}), CloudWatch treats it as plain text. Use parse @message "* *" as level, jsonPart to extract the JSON part, then query its fields. Alternatively, configure your application to emit pure JSON log events.

What is the maximum time range for a Logs Insights query?

There is no hard limit on time range, but queries time out after 15 minutes. For very large log groups, a 7-day query may timeout. Best practice: stay within 24 hours for high-volume log groups, and use 7–30 days only for low-volume groups.

How do metric filters interact with log subscriptions?

Metric filters and log subscriptions (Kinesis, Lambda) are independent — both process the same incoming log events in parallel. A metric filter counts or extracts values; a subscription destination receives the full log event. You can have up to 2 subscription filters and unlimited metric filters per log group.