AWS Multi-Region Architecture: Active-Active and Disaster Recovery Patterns

AWS Multi-Region Architecture and Disaster Recovery

A single AWS region, no matter how resilient, is not enough for applications that must remain online when an entire availability zone — or even a full region — fails. Hurricanes, widespread power outages, submarine cable cuts, and software bugs that corrupt data across all AZs in one region are real events that have taken down major services. Multi-region architecture is how you protect against them.

This guide covers every layer of a production multi-region system: choosing between Active-Active and Active-Passive strategies based on your RTO/RPO requirements, configuring Route 53 routing policies for traffic control, replicating data with DynamoDB Global Tables and Aurora Global Database, accelerating global traffic with AWS Global Accelerator, validating your architecture with chaos engineering via AWS FIS, and maintaining observability across all regions from a single pane of glass. Every section includes real CLI commands, Terraform HCL, and JSON record definitions you can use immediately.

1. DR Strategies: RTO/RPO Comparison

Before choosing an architecture, align with your business on two numbers: Recovery Time Objective (RTO) — how many minutes/hours of downtime are acceptable — and Recovery Point Objective (RPO) — how much data loss (measured in time) is acceptable. These two numbers drive your entire cost and complexity budget.

AWS defines four canonical disaster recovery patterns, in order of increasing cost and decreasing RTO/RPO:

StrategyRTORPORelative CostDescription
Backup & RestoreHours – DaysHours$Automated backups (RDS snapshots, S3 versioning) restored on failover. No warm resources. Cheapest but slowest.
Pilot Light10–30 minMinutes$$Minimal core infrastructure (DB replicas, AMIs) running in DR region. Compute is off until failover, then auto-scaled up.
Warm StandbyMinutesSeconds$$$Scaled-down but fully functional system running in DR region. Scale up on failover. Traffic is live on primary only.
Active-ActiveSecondsNear-zero$$$$Full production load in multiple regions simultaneously. No failover needed — load shifts instantly via DNS or Global Accelerator.
Key Insight: Most teams over-engineer DR. If your SLA allows 4 hours downtime per year (99.95% uptime), Pilot Light is usually sufficient and 3–5x cheaper than Active-Active. Calculate the revenue impact of 1 hour of downtime, multiply by your annual failure probability, and compare that cost to the annual infrastructure cost of each strategy.

For Backup & Restore, your primary tools are:

# Enable automated RDS backups with 7-day retention
aws rds modify-db-instance \
  --db-instance-identifier prod-postgres \
  --backup-retention-period 7 \
  --preferred-backup-window "03:00-04:00" \
  --apply-immediately

# Copy snapshot cross-region for DR
aws rds copy-db-snapshot \
  --source-db-snapshot-identifier arn:aws:rds:us-east-1:123456789012:snapshot:prod-postgres-2026-06-09 \
  --target-db-snapshot-identifier prod-postgres-dr-2026-06-09 \
  --region eu-west-1

# List available snapshots in DR region
aws rds describe-db-snapshots \
  --region eu-west-1 \
  --query 'DBSnapshots[?Status==`available`].[DBSnapshotIdentifier,SnapshotCreateTime]' \
  --output table

For Pilot Light, you pre-create the networking layer (VPC, subnets, security groups) and maintain a running read replica or Aurora secondary. When disaster strikes, you promote the replica and scale up compute via Auto Scaling Groups, reducing recovery time from hours to 15–30 minutes.

2. Route 53 Routing Policies for Multi-Region

Route 53 is the traffic director for all multi-region patterns. It decides which regional endpoint a client reaches before any TCP connection is established. The four routing policies relevant to multi-region work are Latency, Failover, Geolocation, and Weighted — and they can be combined via Traffic Flow policy trees.

Latency-Based Routing (Active-Active)

Route 53 measures historical latency between AWS regions and the requesting IP address and routes to the lowest-latency region. This is the foundation of Active-Active global architectures:

# change-batch.json — latency records for 3 regions
{
  "Changes": [
    {
      "Action": "UPSERT",
      "ResourceRecordSet": {
        "Name": "api.example.com",
        "Type": "A",
        "SetIdentifier": "us-east-1",
        "Region": "us-east-1",
        "HealthCheckId": "hc-us-east-1-id",
        "AliasTarget": {
          "HostedZoneId": "Z35SXDOTRQ7X7K",
          "DNSName": "prod-alb-us.us-east-1.elb.amazonaws.com",
          "EvaluateTargetHealth": true
        }
      }
    },
    {
      "Action": "UPSERT",
      "ResourceRecordSet": {
        "Name": "api.example.com",
        "Type": "A",
        "SetIdentifier": "eu-west-1",
        "Region": "eu-west-1",
        "HealthCheckId": "hc-eu-west-1-id",
        "AliasTarget": {
          "HostedZoneId": "Z32O12XQLNTSW2",
          "DNSName": "prod-alb-eu.eu-west-1.elb.amazonaws.com",
          "EvaluateTargetHealth": true
        }
      }
    },
    {
      "Action": "UPSERT",
      "ResourceRecordSet": {
        "Name": "api.example.com",
        "Type": "A",
        "SetIdentifier": "ap-southeast-1",
        "Region": "ap-southeast-1",
        "HealthCheckId": "hc-ap-southeast-1-id",
        "AliasTarget": {
          "HostedZoneId": "Z1LMS91P8CMLE5",
          "DNSName": "prod-alb-ap.ap-southeast-1.elb.amazonaws.com",
          "EvaluateTargetHealth": true
        }
      }
    }
  ]
}
aws route53 change-resource-record-sets \
  --hosted-zone-id Z1234567890ABCDEF \
  --change-batch file://change-batch.json

Failover Routing (Active-Passive)

Failover routing designates one record as PRIMARY and another as SECONDARY. Route 53 only routes to the secondary when the primary's health check fails:

# Create health check for primary region endpoint
aws route53 create-health-check \
  --caller-reference $(date +%s) \
  --health-check-config '{
    "Type": "HTTPS",
    "FullyQualifiedDomainName": "us-east-1-health.example.com",
    "Port": 443,
    "ResourcePath": "/health",
    "RequestInterval": 10,
    "FailureThreshold": 2,
    "EnableSNI": true
  }'

# PRIMARY record
aws route53 change-resource-record-sets --hosted-zone-id Z1234 --change-batch '{
  "Changes": [{
    "Action": "UPSERT",
    "ResourceRecordSet": {
      "Name": "app.example.com", "Type": "A",
      "SetIdentifier": "primary-us-east-1",
      "Failover": "PRIMARY",
      "HealthCheckId": "HEALTH-CHECK-ID",
      "AliasTarget": {
        "HostedZoneId": "Z35SXDOTRQ7X7K",
        "DNSName": "primary-alb.us-east-1.elb.amazonaws.com",
        "EvaluateTargetHealth": true
      }
    }
  }]
}'

# SECONDARY record (no health check required — always serves as fallback)
aws route53 change-resource-record-sets --hosted-zone-id Z1234 --change-batch '{
  "Changes": [{
    "Action": "UPSERT",
    "ResourceRecordSet": {
      "Name": "app.example.com", "Type": "A",
      "SetIdentifier": "secondary-eu-west-1",
      "Failover": "SECONDARY",
      "AliasTarget": {
        "HostedZoneId": "Z32O12XQLNTSW2",
        "DNSName": "standby-alb.eu-west-1.elb.amazonaws.com",
        "EvaluateTargetHealth": true
      }
    }
  }]
}'
TTL matters for failover speed: Route 53 itself detects failures in 10–30 seconds (with 10-second interval, threshold 2). But DNS TTL controls how long clients cache the old answer. Set TTL to 60 seconds for failover records so clients re-query within a minute of failure. A 300-second TTL means up to 5 additional minutes of directed-to-dead-endpoint traffic after Route 53 has already detected the failure.

Geolocation Routing (Data Residency)

Geolocation routing is essential for GDPR compliance — European users must be routed to EU infrastructure so their data never leaves the EU:

aws route53 change-resource-record-sets --hosted-zone-id Z1234 --change-batch '{
  "Changes": [
    {
      "Action": "UPSERT",
      "ResourceRecordSet": {
        "Name": "api.example.com", "Type": "A",
        "SetIdentifier": "europe",
        "GeoLocation": {"ContinentCode": "EU"},
        "AliasTarget": {
          "HostedZoneId": "Z32O12XQLNTSW2",
          "DNSName": "eu-alb.eu-west-1.elb.amazonaws.com",
          "EvaluateTargetHealth": true
        }
      }
    },
    {
      "Action": "UPSERT",
      "ResourceRecordSet": {
        "Name": "api.example.com", "Type": "A",
        "SetIdentifier": "default",
        "GeoLocation": {"CountryCode": "*"},
        "AliasTarget": {
          "HostedZoneId": "Z35SXDOTRQ7X7K",
          "DNSName": "us-alb.us-east-1.elb.amazonaws.com",
          "EvaluateTargetHealth": true
        }
      }
    }
  ]
}'

3. Active-Active Architecture

In an Active-Active architecture, two or more regions simultaneously serve production traffic. There is no "primary" region — every region is primary. When a region fails, Route 53 or Global Accelerator automatically stops directing traffic there, and the surviving regions absorb the load. Recovery time is measured in seconds (the time for DNS TTL to expire and health checks to detect failure).

The fundamental challenge of Active-Active is data consistency: writes happening in us-east-1 must be visible to reads in eu-west-1 with minimal lag. AWS provides two managed solutions for this.

DynamoDB Global Tables

DynamoDB Global Tables is the simplest multi-region data solution on AWS. It automatically replicates your DynamoDB table across all specified regions with sub-second replication latency. Each region has a full read-write replica — there is no master.

# Create table (must have PAY_PER_REQUEST billing and streams enabled)
aws dynamodb create-table \
  --table-name UserSessions \
  --attribute-definitions AttributeName=userId,AttributeType=S \
  --key-schema AttributeName=userId,KeyType=HASH \
  --billing-mode PAY_PER_REQUEST \
  --stream-specification StreamEnabled=true,StreamViewType=NEW_AND_OLD_IMAGES \
  --region us-east-1

# Add global table replicas
aws dynamodb create-global-table \
  --global-table-name UserSessions \
  --replication-group RegionName=us-east-1 RegionName=eu-west-1 RegionName=ap-southeast-1 \
  --region us-east-1

# Or add a replica to an existing table
aws dynamodb update-table \
  --table-name UserSessions \
  --replica-updates '[{"Create":{"RegionName":"ap-southeast-1"}}]' \
  --region us-east-1

As Terraform:

resource "aws_dynamodb_table" "user_sessions" {
  name             = "UserSessions"
  billing_mode     = "PAY_PER_REQUEST"
  hash_key         = "userId"
  stream_enabled   = true
  stream_view_type = "NEW_AND_OLD_IMAGES"

  attribute {
    name = "userId"
    type = "S"
  }

  replica {
    region_name = "eu-west-1"
  }

  replica {
    region_name = "ap-southeast-1"
  }

  tags = {
    Environment = "production"
    Tier        = "global"
  }
}
Conflict resolution: DynamoDB Global Tables uses "last writer wins" based on wall-clock timestamp. If two regions write to the same item within the replication window (typically <1 second), the write with the higher timestamp wins. Design your access patterns to avoid concurrent cross-region writes to the same item — partition by region, user, or session when possible.

Aurora Global Database

For relational workloads, Aurora Global Database replicates a single Aurora cluster to up to 5 secondary regions with <1 second replication lag using dedicated infrastructure (not the normal replication path):

# Create Aurora global cluster
aws rds create-global-cluster \
  --global-cluster-identifier prod-global \
  --engine aurora-postgresql \
  --engine-version 15.4 \
  --storage-encrypted \
  --region us-east-1

# Create primary cluster in us-east-1
aws rds create-db-cluster \
  --db-cluster-identifier prod-primary \
  --engine aurora-postgresql \
  --engine-version 15.4 \
  --global-cluster-identifier prod-global \
  --master-username admin \
  --master-user-password "$DB_PASSWORD" \
  --db-subnet-group-name prod-subnet-group \
  --vpc-security-group-ids sg-0abc123 \
  --region us-east-1

# Add secondary cluster in eu-west-1 (read-only replica region)
aws rds create-db-cluster \
  --db-cluster-identifier prod-secondary-eu \
  --engine aurora-postgresql \
  --engine-version 15.4 \
  --global-cluster-identifier prod-global \
  --db-subnet-group-name prod-subnet-group-eu \
  --vpc-security-group-ids sg-0def456 \
  --region eu-west-1

Aurora Global Database supports managed failover — if the primary region becomes unavailable, you promote a secondary to primary in approximately 1 minute with no data loss (RPO = 0 during planned failover, <1 second during unplanned). Connect your application using the cluster endpoint DNS that Aurora updates automatically on failover.

4. Active-Passive and Pilot Light

Active-Passive keeps a hot standby in a second region that receives replicated data but no live traffic. On failure, traffic shifts to the standby. Pilot Light is a cost-optimized variant where the standby's compute layer is stopped (or minimal), and only the data layer runs continuously.

Pilot Light Setup

The typical Pilot Light pattern uses RDS Read Replicas for data replication and pre-built AMIs/launch templates for fast compute scale-up:

# Create cross-region read replica in DR region
aws rds create-db-instance-read-replica \
  --db-instance-identifier prod-postgres-dr \
  --source-db-instance-identifier arn:aws:rds:us-east-1:123456789012:db:prod-postgres \
  --db-instance-class db.r6g.xlarge \
  --region eu-west-1 \
  --availability-zone eu-west-1a \
  --publicly-accessible false \
  --multi-az false

# Verify replication lag (should be <30 seconds for Pilot Light)
aws rds describe-db-instances \
  --db-instance-identifier prod-postgres-dr \
  --region eu-west-1 \
  --query 'DBInstances[0].StatusInfos'

Automated Failover Runbook

When the primary region fails, execute this runbook in order. Automate it as an SSM Automation document for one-click execution:

#!/bin/bash
# failover-runbook.sh — Pilot Light failover to eu-west-1
set -e

PRIMARY_REGION="us-east-1"
DR_REGION="eu-west-1"
DB_REPLICA="prod-postgres-dr"
HOSTED_ZONE_ID="Z1234567890ABCDEF"
APP_DNS="app.example.com"
DR_ALB="dr-alb.eu-west-1.elb.amazonaws.com"

echo "=== STEP 1: Promote RDS read replica to standalone instance ==="
aws rds promote-read-replica \
  --db-instance-identifier $DB_REPLICA \
  --region $DR_REGION

echo "Waiting for promotion to complete..."
aws rds wait db-instance-available \
  --db-instance-identifier $DB_REPLICA \
  --region $DR_REGION

echo "=== STEP 2: Scale up Auto Scaling Group in DR region ==="
aws autoscaling update-auto-scaling-group \
  --auto-scaling-group-name prod-asg-dr \
  --min-size 4 \
  --max-size 20 \
  --desired-capacity 8 \
  --region $DR_REGION

echo "=== STEP 3: Update Route 53 to point to DR region ==="
aws route53 change-resource-record-sets \
  --hosted-zone-id $HOSTED_ZONE_ID \
  --change-batch "{
    \"Changes\": [{
      \"Action\": \"UPSERT\",
      \"ResourceRecordSet\": {
        \"Name\": \"$APP_DNS\",
        \"Type\": \"CNAME\",
        \"TTL\": 60,
        \"ResourceRecords\": [{\"Value\": \"$DR_ALB\"}]
      }
    }]
  }"

echo "=== STEP 4: Disable writes to primary (prevent split-brain) ==="
# Set primary DB to read-only to prevent dirty writes during recovery
aws rds modify-db-instance \
  --db-instance-identifier prod-postgres \
  --no-apply-immediately \
  --db-parameter-group-name readonly-parameter-group \
  --region $PRIMARY_REGION || true  # May fail if region is down — that's OK

echo "=== Failover complete. DR region is now PRIMARY ==="
echo "Estimated completion time from script start: 15-25 minutes"
Test your runbook monthly. A runbook that has never been executed in production conditions is a hypothesis, not a plan. Schedule a quarterly DR drill where you execute the full failover in a staging environment and measure actual RTO. Real RTO is almost always 2–3x the theoretical estimate due to dependency chains nobody documented.

5. Data Replication Patterns

Every persistence layer in your architecture needs a cross-region replication strategy. Here are the three most common ones.

S3 Cross-Region Replication (CRR)

CRR asynchronously copies objects from a source bucket to a destination bucket in another region. It replicates new objects (and optionally existing objects via Batch Replication). RPO is typically under 15 minutes for most objects, with the S3 Replication Time Control (S3-RTC) SLA guaranteeing replication within 15 minutes for 99.99% of objects:

# Source bucket must have versioning enabled
aws s3api put-bucket-versioning \
  --bucket prod-assets-us-east-1 \
  --versioning-configuration Status=Enabled

# Destination bucket (in DR region) must also have versioning
aws s3api put-bucket-versioning \
  --bucket prod-assets-eu-west-1 \
  --region eu-west-1 \
  --versioning-configuration Status=Enabled

# Create replication configuration
cat > replication.json <<'EOF'
{
  "Role": "arn:aws:iam::123456789012:role/s3-replication-role",
  "Rules": [
    {
      "ID": "ReplicateAll",
      "Status": "Enabled",
      "Filter": {"Prefix": ""},
      "Destination": {
        "Bucket": "arn:aws:s3:::prod-assets-eu-west-1",
        "ReplicationTime": {
          "Status": "Enabled",
          "Time": {"Minutes": 15}
        },
        "Metrics": {
          "Status": "Enabled",
          "EventThreshold": {"Minutes": 15}
        },
        "StorageClass": "STANDARD_IA"
      },
      "DeleteMarkerReplication": {"Status": "Enabled"}
    }
  ]
}
EOF

aws s3api put-bucket-replication \
  --bucket prod-assets-us-east-1 \
  --replication-configuration file://replication.json

ElastiCache Global Datastore

For Redis caching across regions, ElastiCache Global Datastore replicates a Redis cluster to secondary regions with <1 second lag. The secondary region serves reads locally, dramatically reducing cache miss rates for globally distributed users:

# Create global datastore from existing primary cluster
aws elasticache create-global-replication-group \
  --global-replication-group-id-suffix prod-cache \
  --primary-replication-group-id prod-redis-primary \
  --global-replication-group-description "Production global cache"

# Add secondary region
aws elasticache create-replication-group \
  --replication-group-id prod-redis-eu \
  --replication-group-description "EU secondary cache" \
  --global-replication-group-id global-prod-cache \
  --region eu-west-1

# Check replication status
aws elasticache describe-global-replication-groups \
  --global-replication-group-id global-prod-cache \
  --show-member-info

Terraform: Complete Data Replication Stack

# S3 bucket with cross-region replication
resource "aws_s3_bucket" "primary" {
  bucket   = "prod-assets-${var.primary_region}"
  provider = aws.primary
}

resource "aws_s3_bucket_versioning" "primary" {
  bucket   = aws_s3_bucket.primary.id
  provider = aws.primary
  versioning_configuration { status = "Enabled" }
}

resource "aws_s3_bucket_replication_configuration" "crr" {
  bucket   = aws_s3_bucket.primary.id
  role     = aws_iam_role.replication.arn
  provider = aws.primary

  rule {
    id     = "replicate-all"
    status = "Enabled"

    destination {
      bucket        = aws_s3_bucket.dr.arn
      storage_class = "STANDARD_IA"
    }
  }
}

6. AWS Global Accelerator

AWS Global Accelerator is a networking service that routes traffic through AWS's private global network rather than the public internet. When a user in Tokyo connects to your application in us-east-1, normally their packets traverse 15–20 public internet hops. With Global Accelerator, they connect to the nearest AWS edge location in Tokyo (typically 1–2 hops) and then ride the AWS backbone to us-east-1 — reducing latency by 20–50% and eliminating most packet loss from internet congestion.

Global Accelerator provides two static anycast IP addresses that never change. This is a significant operational advantage over Route 53, which returns different DNS names per region — Global Accelerator clients always connect to the same two IPs regardless of which region is currently serving them.

# Create accelerator
aws globalaccelerator create-accelerator \
  --name prod-accelerator \
  --ip-address-type IPV4 \
  --enabled \
  --region us-west-2  # Global Accelerator API lives in us-west-2

# Create listener (TCP 443)
aws globalaccelerator create-listener \
  --accelerator-arn arn:aws:globalaccelerator::123456789012:accelerator/abc123 \
  --protocol TCP \
  --port-ranges '[{"FromPort":443,"ToPort":443}]' \
  --region us-west-2

# Create endpoint groups — one per region
aws globalaccelerator create-endpoint-group \
  --listener-arn arn:aws:globalaccelerator::123456789012:accelerator/abc123/listener/def456 \
  --endpoint-group-region us-east-1 \
  --traffic-dial-percentage 100 \
  --health-check-path "/health" \
  --health-check-protocol HTTPS \
  --health-check-interval-seconds 10 \
  --threshold-count 2 \
  --endpoints '[{"EndpointId":"arn:aws:elasticloadbalancing:us-east-1:123456789012:loadbalancer/app/prod-alb/abc","Weight":100}]' \
  --region us-west-2

aws globalaccelerator create-endpoint-group \
  --listener-arn arn:aws:globalaccelerator::123456789012:accelerator/abc123/listener/def456 \
  --endpoint-group-region eu-west-1 \
  --traffic-dial-percentage 100 \
  --health-check-path "/health" \
  --health-check-protocol HTTPS \
  --health-check-interval-seconds 10 \
  --threshold-count 2 \
  --endpoints '[{"EndpointId":"arn:aws:elasticloadbalancing:eu-west-1:123456789012:loadbalancer/app/prod-alb-eu/def","Weight":100}]' \
  --region us-west-2
Global Accelerator vs CloudFront: CloudFront is a CDN — it caches content at edge locations and is optimal for static assets, HTML, and cacheable API responses. Global Accelerator is a network accelerator — it does not cache anything but significantly improves latency and reliability for dynamic, non-cacheable traffic (API calls, WebSockets, real-time data). Use both together: CloudFront in front of Global Accelerator for origin-facing traffic routing, CloudFront alone for cached assets.

To instantly shift traffic away from a failing region without DNS changes, update the traffic dial percentage to zero:

# Emergency: dial down a failing region to 0% (instant, no DNS propagation wait)
aws globalaccelerator update-endpoint-group \
  --endpoint-group-arn arn:aws:globalaccelerator::123456789012:accelerator/abc123/listener/def456/endpoint-group/ghi789 \
  --traffic-dial-percentage 0 \
  --region us-west-2

# Gradually shift traffic back during recovery (canary restore)
for pct in 10 25 50 75 100; do
  echo "Restoring $pct% traffic to us-east-1..."
  aws globalaccelerator update-endpoint-group \
    --endpoint-group-arn arn:aws:globalaccelerator::123456789012:accelerator/abc123/listener/def456/endpoint-group/ghi789 \
    --traffic-dial-percentage $pct \
    --region us-west-2
  sleep 300  # 5-minute bake time at each traffic level
done

7. Chaos Engineering for Multi-Region

A multi-region architecture that has never failed over in production is a theory. Chaos engineering — deliberately injecting failures to verify your system's response — is the only way to validate that your DR design actually works. AWS Fault Injection Service (FIS) makes this safe and repeatable.

FIS Experiment: Simulate Region-Level API Failures

{
  "description": "Simulate us-east-1 partial API degradation",
  "targets": {
    "east-ec2-instances": {
      "resourceType": "aws:ec2:instance",
      "resourceTags": {"Environment": "production", "Region": "us-east-1"},
      "selectionMode": "PERCENT(30)"
    }
  },
  "actions": {
    "inject-network-latency": {
      "actionId": "aws:ssm:send-command",
      "description": "Add 500ms latency to all outbound traffic",
      "parameters": {
        "documentArn": "arn:aws:ssm:::document/AWSFIS-Run-Network-Latency",
        "documentParameters": "{\"DelayMilliseconds\":\"500\",\"DurationSeconds\":\"300\"}",
        "duration": "PT5M"
      },
      "targets": {"Instances": "east-ec2-instances"}
    }
  },
  "stopConditions": [
    {
      "source": "aws:cloudwatch:alarm",
      "value": "arn:aws:cloudwatch:us-east-1:123456789012:alarm:ErrorRateTooHigh"
    }
  ],
  "roleArn": "arn:aws:iam::123456789012:role/FISExperimentRole",
  "tags": {"Purpose": "DR-validation", "ChaosLevel": "moderate"}
}
# Create FIS experiment template from JSON
aws fis create-experiment-template \
  --cli-input-json file://fis-experiment.json \
  --region us-east-1

# Start the experiment (requires explicit confirmation in production)
aws fis start-experiment \
  --experiment-template-id EXT1234567890 \
  --region us-east-1

# Monitor experiment status
aws fis get-experiment \
  --id EXP1234567890 \
  --region us-east-1 \
  --query 'experiment.{Status:state.status,Actions:actions}'

Failover Test Checklist

Run this checklist quarterly for each region in your Active-Active or Pilot Light setup:

#!/bin/bash
# dr-test-checklist.sh

echo "DR TEST CHECKLIST — $(date)"
echo "=================================="

echo "[1] Verify health check endpoint in DR region"
curl -sf https://dr.example.com/health && echo "PASS" || echo "FAIL"

echo "[2] Verify DB replica lag is under 30 seconds"
LAG=$(aws rds describe-db-instances \
  --db-instance-identifier prod-postgres-dr \
  --region eu-west-1 \
  --query 'DBInstances[0].StatusInfos[0].Message' \
  --output text)
echo "Replication lag: $LAG"

echo "[3] Verify Route 53 health checks are all healthy"
aws route53 list-health-checks \
  --query 'HealthChecks[*].{ID:Id,Status:HealthCheckConfig.FullyQualifiedDomainName}' \
  --output table

echo "[4] Verify Global Accelerator endpoints"
aws globalaccelerator list-endpoint-groups \
  --listener-arn $LISTENER_ARN \
  --region us-west-2 \
  --query 'EndpointGroups[*].{Region:EndpointGroupRegion,Dial:TrafficDialPercentage}' \
  --output table

echo "[5] Measure DNS propagation time after health check failure injection..."
START=$(date +%s)
# (inject failure, then measure time until DNS resolves to DR endpoint)
END=$(date +%s)
echo "Measured failover time: $((END-START)) seconds"
Game Day protocol: Run chaos experiments with the on-call team present during business hours — never at 2am, never without observability ready. Define explicit success criteria (e.g., "99% of requests succeed within 5 seconds during regional failure") and roll back immediately if stop conditions trigger. Document outcomes and update runbooks after every game day.

8. Cross-Region Observability

Monitoring a multi-region system is harder than monitoring a single-region one because metrics, logs, and alarms are regional by default. You need to aggregate all signals into a single pane of glass to detect cross-region issues — like a replication lag spike that is fine in isolation but catastrophic during a failover attempt.

CloudWatch Cross-Account Cross-Region Observability

# In the MONITORING account — enable cross-region observability sink
aws oam create-sink \
  --name prod-central-sink \
  --region us-east-1

# Share telemetry types into the sink
aws oam put-sink-policy \
  --sink-identifier arn:aws:oam:us-east-1:MONITOR_ACCT:sink/prod-central-sink \
  --policy '{
    "Version": "2012-10-17",
    "Statement": [{
      "Effect": "Allow",
      "Principal": {"AWS": "arn:aws:iam::APP_ACCT:root"},
      "Action": ["oam:CreateLink"],
      "Resource": "*",
      "Condition": {
        "ForAllValues:StringEquals": {
          "oam:ResourceTypes": [
            "AWS::CloudWatch::Metric",
            "AWS::Logs::LogGroup",
            "AWS::XRay::Trace"
          ]
        }
      }
    }]
  }'

# In the APPLICATION account — link to monitoring sink
aws oam create-link \
  --label-template "$AccountName" \
  --resource-types AWS::CloudWatch::Metric AWS::Logs::LogGroup AWS::XRay::Trace \
  --sink-identifier arn:aws:oam:us-east-1:MONITOR_ACCT:sink/prod-central-sink \
  --region eu-west-1  # Link from EU region

Centralized Log Aggregation to S3

Stream CloudWatch Logs from all regions to a central S3 bucket in the monitoring account for long-term retention and Athena querying:

# Create Kinesis Data Firehose delivery stream in each region
aws firehose create-delivery-stream \
  --delivery-stream-name prod-logs-to-s3 \
  --extended-s3-destination-configuration '{
    "RoleARN": "arn:aws:iam::123456789012:role/firehose-role",
    "BucketARN": "arn:aws:s3:::central-logs-bucket",
    "Prefix": "region=eu-west-1/year=!{timestamp:yyyy}/month=!{timestamp:MM}/day=!{timestamp:dd}/",
    "ErrorOutputPrefix": "errors/",
    "BufferingHints": {"SizeInMBs": 128, "IntervalInSeconds": 300},
    "CompressionFormat": "GZIP",
    "DataFormatConversionConfiguration": {
      "Enabled": true,
      "OutputFormatConfiguration": {
        "Serializer": {"ParquetSerDe": {}}
      }
    }
  }' \
  --region eu-west-1

# Subscription filter: send all ERROR logs to Firehose
aws logs put-subscription-filter \
  --log-group-name /aws/ec2/prod-app \
  --filter-name ErrorsToFirehose \
  --filter-pattern "ERROR" \
  --destination-arn arn:aws:firehose:eu-west-1:123456789012:deliverystream/prod-logs-to-s3 \
  --region eu-west-1

Cross-Region Composite Alarm

Composite alarms can combine signals from multiple regions using CloudWatch's cross-region metric search:

# Create composite alarm that fires if ANY region has high error rate
aws cloudwatch put-composite-alarm \
  --alarm-name MultiRegionErrorAlert \
  --alarm-description "Any region exceeds 1% error rate" \
  --alarm-rule "ALARM(us-east-1-ErrorRate) OR ALARM(eu-west-1-ErrorRate) OR ALARM(ap-southeast-1-ErrorRate)" \
  --alarm-actions arn:aws:sns:us-east-1:123456789012:ops-pagerduty \
  --region us-east-1  # Composite alarms live in one home region

# Create the per-region component alarms they reference
for region in us-east-1 eu-west-1 ap-southeast-1; do
  aws cloudwatch put-metric-alarm \
    --alarm-name "${region}-ErrorRate" \
    --metric-name 5XXError \
    --namespace AWS/ApplicationELB \
    --statistic Average \
    --period 60 \
    --threshold 0.01 \
    --comparison-operator GreaterThanThreshold \
    --evaluation-periods 3 \
    --region $region
done
Replication lag alerting: Always create a CloudWatch alarm on DynamoDB Global Tables replication lag and Aurora Global Database AuroraGlobalDBReplicationLag metric. A lag spike above 5 seconds during normal operations is a warning sign; above 30 seconds before a planned failover means your RPO guarantee is at risk. Wire these alarms into your DR runbook go/no-go checks.

Frequently Asked Questions

How much does multi-region add to my AWS bill?

Data transfer out charges ($0.09/GB in us-east-1) apply for cross-region replication. For a typical application replicating 100GB/day, that is ~$270/month for data transfer alone. DynamoDB Global Tables charges for replicated writes at the same rate as regular writes — effectively doubling write costs. Aurora Global Database adds ~20% to the primary cluster cost for the replication infrastructure. Active-Active roughly doubles your compute costs. Always model the cost of the DR strategy against the cost of downtime before deciding.

Should I use Global Accelerator or Route 53 for failover?

Use Global Accelerator when you need sub-30-second failover with no DNS propagation delays, when your clients are IP-pinned (can't change DNS), or when you need the latency improvement of AWS's backbone routing. Route 53 failover works well when 60–90 second failover is acceptable and when you want to keep costs lower (Global Accelerator costs ~$2.50/day plus data transfer fees). Many production architectures use both: Global Accelerator for the application tier, Route 53 health checks for monitoring and alerting.

What is the minimum viable multi-region setup?

The 80/20 answer: Pilot Light with automated Route 53 failover. Run an RDS Read Replica in a second region, pre-build AMIs monthly, create a Route 53 failover record pair with health checks, and automate the replica-promotion + ASG scale-up in a runbook. This gets you 15–30 minute RTO with near-zero RPO for database writes, at roughly 15–20% of your primary region cost. It protects against the most common catastrophic failures (region-wide outage) without doubling your infrastructure bill.