AWS Well-Architected Framework: All 6 Pillars with Checklists (2026)

AWS Well-Architected Framework 6 Pillars

The AWS Well-Architected Framework is not a certification checklist — it is a structured way of thinking about architecture trade-offs across six dimensions that matter in production: operational excellence, security, reliability, performance efficiency, cost optimization, and sustainability. This guide goes deep on every pillar with actionable checklists, real CLI commands, IAM policy JSON, Terraform examples, and a full walkthrough of the Well-Architected Tool.

1. Framework Overview & How Reviews Work

AWS released the first version of the Well-Architected Framework in 2015 based on patterns observed across thousands of customer architectures. The framework has since expanded to six pillars, a suite of Lenses, and a dedicated tool in the AWS Console that walks you through structured questions and produces a scored report with risk levels and improvement recommendations.

A Well-Architected Review (WAR) is a structured conversation between an architect and a workload owner. It is not an audit. The goal is to surface architectural risks early — ideally before launch, but also during operations — and produce a prioritised list of improvements. AWS Solution Architects conduct formal WARs for customers, but you can run self-service reviews using the AWS Well-Architected Tool at any time, at no cost.

How the Well-Architected Tool Works

The tool organises reviews around workloads. A workload is a collection of resources and code that delivers a business value — a microservice, a data pipeline, a SaaS application. For each workload you answer a set of questions per pillar. Answers map to one of three risk levels:

Risk LevelColourMeaning
High Risk Issue (HRI)RedA best practice not followed with significant potential impact
Medium Risk Issue (MRI)AmberA best practice partially followed or a known trade-off
No IssueGreenBest practice fully met
Key Principle: The Well-Architected Framework is not prescriptive — it acknowledges trade-offs. The right answer for a startup MVP is different from a regulated financial workload. The tool captures your reasoning via "notes" fields, so reviewers understand why a best practice was intentionally skipped.

After completing a review, the tool generates a report listing all HRIs and MRIs grouped by pillar. You can export the report as a PDF or JSON, track improvement plans over time, and mark issues as resolved as you remediate them. Subsequent reviews show progress over time — a powerful mechanism for communicating architecture maturity to leadership.

2. Pillar 1: Operational Excellence

Operational Excellence focuses on running and monitoring systems to deliver business value and continually improving supporting processes and procedures. The key insight: operations is code. Every runbook, alarm threshold, and deployment pipeline should be version-controlled, tested, and reviewed like application code.

Design Principles

  • Perform operations as code — use CloudFormation, CDK, or Terraform for all infrastructure. Store in Git. Use CodePipeline for deployment.
  • Make frequent, small, reversible changes — deploy feature flags, use blue/green deployments, keep changes small enough to roll back in under five minutes.
  • Refine operations procedures frequently — run game days quarterly. Simulate failure scenarios. Update runbooks after every incident.
  • Anticipate failure — pre-mortem every major change. Ask: what could go wrong? How would we detect it? How would we recover?
  • Learn from all operational failures — blameless post-mortems. Track all incidents in a central system. Trend data over quarters.

Infrastructure as Code with AWS CDK

All resources should be defined as code. The following CDK snippet creates an ECS service with health checks and alarms — capturing the operational intent alongside the infrastructure definition:

import * as cdk from 'aws-cdk-lib';
import * as ecs from 'aws-cdk-lib/aws-ecs';
import * as cloudwatch from 'aws-cdk-lib/aws-cloudwatch';

const service = new ecs.FargateService(this, 'AppService', {
  cluster,
  taskDefinition,
  desiredCount: 3,
  minHealthyPercent: 100,   // Zero-downtime deployments
  maxHealthyPercent: 200,
  circuitBreaker: { rollback: true },  // Auto-rollback on failure
  enableExecuteCommand: false,         // Disable in prod; use SSM Session Manager
});

// Alarm on high error rate
const errorAlarm = new cloudwatch.Alarm(this, 'ErrorRateAlarm', {
  metric: service.metricCpuUtilization(),
  threshold: 80,
  evaluationPeriods: 3,
  alarmDescription: 'CPU > 80% for 3 consecutive minutes',
  actionsEnabled: true,
});

Observability: The Three Pillars

Operational Excellence requires that you can observe your system's behaviour in production. Implement all three pillars of observability:

  • Metrics — CloudWatch custom metrics with EMF (Embedded Metric Format) for structured, queryable metrics from Lambda and ECS
  • Logs — structured JSON logging to CloudWatch Logs Insights. Never log plaintext — use structured fields so you can query filter @message like /ERROR/
  • Traces — AWS X-Ray for distributed tracing across Lambda, ECS, API Gateway, and Step Functions

Game Days

A game day is a scheduled exercise where your team deliberately injects failures into a production or staging system to verify that runbooks work, alarms fire, and on-call engineers can recover without the pressure of an actual incident. Run game days at least quarterly. Scenarios to test: AZ failure simulation, database failover, deploy a broken version and verify rollback, throttle a downstream dependency, inject 5xx errors into an API.

Operational Excellence Checklist

#PracticeHow to Verify
1All infrastructure defined as IaC (CDK/Terraform/CloudFormation)No manually created resources in console
2Deployments via CI/CD pipeline (CodePipeline, GitHub Actions)Zero manual aws deploy commands in runbooks
3Runbooks stored in Git, linked from PagerDuty/OpsGenieEvery alarm has a linked runbook URL
4Structured JSON logging enabled on all servicesCloudWatch Logs Insights query returns parsed fields
5X-Ray tracing enabled on Lambda, API GW, ECSService map visible in X-Ray console
6CloudWatch alarms on error rate, latency P99, and queue depthAlarm inventory audit passes
7Blue/green or canary deployments for all production changesCodeDeploy configuration shows linear/canary strategy
8Game day run in last 90 daysGame day report in wiki
9Blameless post-mortem within 48h of every P1/P2 incidentIncident tracker shows post-mortem link on closed incidents
10Deployment frequency and MTTR tracked as business KPIsDashboard shows DORA metrics

3. Pillar 2: Security

The Security pillar covers protecting information, systems, and assets while delivering business value. The AWS Shared Responsibility Model means AWS secures the cloud infrastructure; you secure everything in the cloud. The six areas of the Security pillar are: Identity and Access Management, Detection, Infrastructure Protection, Data Protection, Incident Response, and Application Security.

IAM Least Privilege — Real Policy Example

Every Lambda function, ECS task, and EC2 instance should have a dedicated IAM role scoped to exactly the resources it needs. The following policy grants a Lambda function read access to one DynamoDB table and write access to one specific S3 prefix — nothing else:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "ReadOrdersTable",
      "Effect": "Allow",
      "Action": ["dynamodb:GetItem", "dynamodb:Query", "dynamodb:BatchGetItem"],
      "Resource": "arn:aws:dynamodb:us-east-1:123456789012:table/orders"
    },
    {
      "Sid": "WriteReportsBucket",
      "Effect": "Allow",
      "Action": ["s3:PutObject"],
      "Resource": "arn:aws:s3:::company-reports/lambda-output/*"
    },
    {
      "Sid": "DecryptWithCMK",
      "Effect": "Allow",
      "Action": ["kms:Decrypt", "kms:GenerateDataKey"],
      "Resource": "arn:aws:kms:us-east-1:123456789012:key/mrk-abc123",
      "Condition": {
        "StringEquals": {
          "kms:ViaService": "dynamodb.us-east-1.amazonaws.com"
        }
      }
    },
    {
      "Sid": "DenyEverythingElse",
      "Effect": "Deny",
      "NotAction": [
        "dynamodb:GetItem", "dynamodb:Query", "dynamodb:BatchGetItem",
        "s3:PutObject", "kms:Decrypt", "kms:GenerateDataKey"
      ],
      "Resource": "*"
    }
  ]
}

Enforce MFA with Service Control Policies

Service Control Policies (SCPs) apply at the AWS Organizations level and restrict what can be done in member accounts — even by root users. Use this SCP to deny all non-MFA actions for human users:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "DenyWithoutMFA",
      "Effect": "Deny",
      "Action": "*",
      "Resource": "*",
      "Condition": {
        "BoolIfExists": {
          "aws:MultiFactorAuthPresent": "false"
        },
        "StringNotLike": {
          "aws:PrincipalArn": [
            "arn:aws:iam::*:role/CodePipelineRole",
            "arn:aws:iam::*:role/TerraformDeployRole"
          ]
        }
      }
    }
  ]
}

Encryption at Rest and in Transit

Enable encryption on every storage service using AWS KMS Customer Managed Keys (CMKs) — not AWS-managed keys. CMKs give you control over key rotation, access policies, and CloudTrail audit logs of every decrypt operation.

# Create a CMK with automatic yearly rotation
aws kms create-key \
  --description "Orders service data key" \
  --enable-key-rotation

# Store the key ID
KEY_ID=$(aws kms list-keys --query 'Keys[0].KeyId' --output text)

# Create an alias
aws kms create-alias \
  --alias-name alias/orders-service \
  --target-key-id $KEY_ID

# Encrypt an S3 bucket with this CMK
aws s3api put-bucket-encryption \
  --bucket company-orders-data \
  --server-side-encryption-configuration '{
    "Rules": [{
      "ApplyServerSideEncryptionByDefault": {
        "SSEAlgorithm": "aws:kms",
        "KMSMasterKeyID": "alias/orders-service"
      },
      "BucketKeyEnabled": true
    }]
  }'

Detective Controls

Enable GuardDuty, Security Hub, and AWS Config in every account and every region — including regions you don't actively use. Attackers prefer unused regions precisely because they are unmonitored.

# Enable GuardDuty with all protection plans
aws guardduty create-detector \
  --enable \
  --finding-publishing-frequency FIFTEEN_MINUTES \
  --data-sources '{
    "S3Logs": {"Enable": true},
    "Kubernetes": {"AuditLogs": {"Enable": true}},
    "MalwareProtection": {"ScanEc2InstanceWithFindings": {"EbsVolumes": true}}
  }'

# Enable Security Hub with CIS benchmark standard
ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)
REGION=$(aws configure get region)

aws securityhub enable-security-hub \
  --enable-default-standards

# Enable CIS AWS Foundations Benchmark v1.4
aws securityhub batch-enable-standards \
  --standards-subscription-requests \
    "StandardsArn=arn:aws:securityhub:${REGION}::standards/cis-aws-foundations-benchmark/v/1.4.0"

Security Checklist

#ControlImplementation
1MFA on all human IAM users and root accountSCP denies actions without MFA; root account uses hardware MFA
2No long-lived access keys on IAM usersRotate or remove; use IAM roles for all workloads
3All S3 buckets block public access by defaultaws s3control put-public-access-block --account-id
4All data encrypted at rest with CMKsKMS CMK per service; automatic rotation enabled
5All data in transit uses TLS 1.2+ALB policy minimum TLS 1.2; API Gateway enforces HTTPS
6GuardDuty enabled in all regionsAWS Organizations delegated admin; all protection plans on
7Security Hub with CIS benchmark enabledAuto-remediation for critical findings via Lambda
8CloudTrail multi-region trail with log integrity--enable-log-file-validation --kms-key-id
9Incident response runbook documented and testedGame day includes security incident scenario
10VPC endpoints for all AWS services accessed from private subnetsNo traffic to S3/DynamoDB/SQS crosses public internet

4. Pillar 3: Reliability

The Reliability pillar focuses on ensuring a workload performs its intended function correctly and consistently when it is expected to. This means designing for failure — assuming components will fail and building the system so it recovers automatically. The key areas are: foundations (service quotas, network topology), workload architecture (distributed system design, fault isolation), change management, and failure management.

Multi-AZ Architecture

Every production workload must span at least two Availability Zones. An AZ failure is not a hypothetical — AWS has had documented AZ outages that affected single-AZ deployments. Use the following architecture pattern:

  • ALB across 2–3 AZs, target groups with health check interval of 10s
  • ECS or Auto Scaling Group with AZ rebalancing enabled
  • RDS Multi-AZ with automated failover (typically 30–60s RTO)
  • ElastiCache with Multi-AZ replica and automatic failover
  • S3 — inherently multi-AZ; 11 nines durability

Route 53 Health Checks and Failover Routing

Implement Route 53 health checks to detect endpoint failures and trigger DNS failover within 10–30 seconds:

# Create a health check for primary region
aws route53 create-health-check \
  --caller-reference $(date +%s) \
  --health-check-config '{
    "Type": "HTTPS",
    "ResourcePath": "/health",
    "FullyQualifiedDomainName": "api.us-east-1.myapp.com",
    "Port": 443,
    "RequestInterval": 10,
    "FailureThreshold": 3,
    "MeasureLatency": true,
    "Regions": ["us-east-1", "eu-west-1", "ap-southeast-1"]
  }'

# Store health check ID
HC_ID=$(aws route53 list-health-checks \
  --query 'HealthChecks[-1].Id' --output text)

# Create failover record set — Primary
aws route53 change-resource-record-sets \
  --hosted-zone-id Z1234567890ABC \
  --change-batch '{
    "Changes": [{
      "Action": "CREATE",
      "ResourceRecordSet": {
        "Name": "api.myapp.com",
        "Type": "A",
        "SetIdentifier": "primary-us-east-1",
        "Failover": "PRIMARY",
        "HealthCheckId": "'$HC_ID'",
        "AliasTarget": {
          "HostedZoneId": "Z35SXDOTRQ7X7K",
          "DNSName": "alb-us-east-1.elb.amazonaws.com",
          "EvaluateTargetHealth": true
        }
      }
    }]
  }'

Chaos Engineering with AWS Fault Injection Simulator

AWS Fault Injection Simulator (FIS) lets you run controlled chaos experiments directly from the console or CLI. Define experiments once, run them repeatedly on a schedule to build confidence in your recovery mechanisms:

# Create an FIS experiment template: terminate 30% of ECS tasks in one AZ
aws fis create-experiment-template \
  --description "Terminate 30% of ECS tasks in us-east-1a" \
  --targets '{
    "ecs-tasks": {
      "resourceType": "aws:ecs:task",
      "resourceArns": ["arn:aws:ecs:us-east-1:123456789012:cluster/prod"],
      "selectionMode": "PERCENT(30)",
      "filters": [{"path": "AvailabilityZone", "values": ["us-east-1a"]}]
    }
  }' \
  --actions '{
    "terminate-tasks": {
      "actionId": "aws:ecs:stop-task",
      "targets": {"Tasks": "ecs-tasks"}
    }
  }' \
  --stop-conditions '[{
    "source": "aws:cloudwatch:alarm",
    "value": "arn:aws:cloudwatch:us-east-1:123456789012:alarm/HighErrorRate"
  }]' \
  --role-arn arn:aws:iam::123456789012:role/FISExperimentRole

Service Quota Management

AWS imposes default quotas on almost every service. Running into a quota in production causes failures that look like infrastructure bugs. Proactively request quota increases before you need them:

# List all applied quotas for Lambda
aws service-quotas list-applied-quotas-for-service \
  --service-code lambda \
  --query 'Quotas[?Value!=null].[QuotaName,Value,Unit]' \
  --output table

# Request a quota increase
aws service-quotas request-service-quota-increase \
  --service-code lambda \
  --quota-code L-B99A9384 \
  --desired-value 3000  # Concurrent executions per region

Reliability Checklist

#PracticeHow to Verify
1All critical services deployed across 2+ AZsAWS Config rule DESIRED_INSTANCE_TENANCY + manual AZ count check
2RDS Multi-AZ or Aurora Global Database enabledRDS console shows "Multi-AZ: Yes"
3Route 53 health checks with failover routing configuredRoute 53 health check status shows healthy; simulate AZ failure
4Backup policy: daily snapshots, tested restore quarterlyAWS Backup plan exists; last restore test < 90 days ago
5RTO and RPO documented and tested for each workloadDocumented in architecture runbook; FIS experiment validates RTO
6Circuit breakers on all synchronous service callsCode review; load test shows graceful degradation on dependency failure
7Retry with exponential backoff and jitter on all AWS SDK callsSDK config or middleware includes retry logic
8Service quotas reviewed and increased before reaching 80%CloudWatch metrics for quota utilisation; auto-request via Quotas API
9Dead letter queues on all SQS queues and Lambda async invocationsaws sqs get-queue-attributes --attribute-names RedrivePolicy
10Chaos engineering experiment run in last 90 daysFIS experiment history shows last run date

5. Pillar 4: Performance Efficiency

Performance Efficiency is about using computing resources efficiently to meet system requirements and maintaining that efficiency as demand changes and technologies evolve. The four areas are: selection (right compute, storage, database, network), review (keep up with AWS innovation), monitoring, and trade-offs (use caching, compression, eventual consistency where appropriate).

Right Instance Selection and Graviton

AWS Graviton3 processors (ARM-based) offer 25–40% better price-performance than equivalent x86 instances for most general-purpose workloads. Switch to Graviton for ECS Fargate, Lambda (arm64), and EC2 with a one-line change to your CDK or Terraform code:

// ECS Fargate task with Graviton (arm64)
const taskDefinition = new ecs.FargateTaskDefinition(this, 'Task', {
  cpu: 512,
  memoryLimitMiB: 1024,
  runtimePlatform: {
    cpuArchitecture: ecs.CpuArchitecture.ARM64,
    operatingSystemFamily: ecs.OperatingSystemFamily.LINUX,
  },
});

// Lambda on arm64 — same price, faster execution for most runtimes
const fn = new lambda.Function(this, 'Fn', {
  runtime: lambda.Runtime.PYTHON_3_12,
  architecture: lambda.Architecture.ARM_64,  // arm64
  handler: 'index.handler',
  code: lambda.Code.fromAsset('lambda'),
});

Caching Strategy

Implement caching at multiple layers to reduce latency and database load. A well-designed caching strategy can reduce database calls by 90% for read-heavy workloads:

  • CloudFront — CDN caching for static assets and cacheable API responses. Cache hit ratios above 80% are achievable for typical web workloads.
  • ElastiCache (Redis) — application-level caching for session data, computed results, and frequently queried database records. Use cluster mode for horizontal scaling.
  • DAX (DynamoDB Accelerator) — in-memory cache for DynamoDB. Sub-millisecond response times for cached reads without any application code changes.
  • API Gateway caching — cache API responses at the gateway level for GET endpoints. TTL of 60–300 seconds eliminates Lambda cold starts for cached responses.

Benchmarking and Load Testing

Never guess at performance — measure it. Use AWS Distributed Load Testing or k6 to run load tests from within your VPC before every major launch:

# Run a distributed load test using AWS Distributed Load Testing solution
# Deploy the CDK stack first, then trigger via API

# Or use k6 from an EC2 instance in the target region
k6 run --vus 1000 --duration 5m \
  --out json=results.json \
  script.js

# Analyse with CloudWatch Logs Insights
aws logs start-query \
  --log-group-name /aws/lambda/orders-api \
  --start-time $(date -d '1 hour ago' +%s) \
  --end-time $(date +%s) \
  --query-string 'filter @type = "REPORT"
    | stats avg(@duration), max(@duration),
            pct(@duration, 95), pct(@duration, 99)
      by bin(5m)'

Performance Efficiency Checklist

#PracticeAction
1Graviton instances used for all new compute resourcesarm64 architecture in ECS task defs and Lambda functions
2CloudFront CDN in front of all public APIs and static assetsCloudFront cache hit ratio > 70% in CloudWatch metrics
3ElastiCache or DAX for database read cachingCache hit ratio monitored; miss rate alerts configured
4Auto Scaling configured with predictive scaling for known traffic patternsTarget tracking policy on ECS service and ALB request count
5Lambda memory right-sized using AWS Lambda Power TuningLambda Power Tuning tool run on all production functions
6RDS instance class right-sized monthlyCloudWatch CPU and connection utilisation < 70% average
7P99 latency targets defined and alerted onCloudWatch alarm on latency exceeding SLA threshold
8Load test run before every major launchLoad test report attached to change request
9Serverless-first for event-driven and batch workloadsStep Functions + Lambda for all batch pipelines
10Spot Instances used for all fault-tolerant batch computeEC2 Spot Fleet or Fargate Spot in ECS Capacity Providers

6. Pillar 5: Cost Optimization

Cost Optimization is about delivering business value at the lowest price point. The five areas are: practise cloud financial management, expenditure and usage awareness, cost-effective resources, manage demand and supply, and optimise over time. Most organisations waste 30–40% of their cloud spend — this is recoverable with the right practices.

Cost Visibility with AWS Cost Explorer

You cannot optimise what you cannot see. Start by tagging all resources and creating Cost Explorer reports grouped by service, tag, and region:

# Query cost by service for the last 30 days
aws ce get-cost-and-usage \
  --time-period Start=$(date -d '30 days ago' +%Y-%m-%d),End=$(date +%Y-%m-%d) \
  --granularity MONTHLY \
  --metrics BlendedCost \
  --group-by Type=DIMENSION,Key=SERVICE \
  --query 'ResultsByTime[0].Groups[*].{Service:Keys[0],Cost:Metrics.BlendedCost.Amount}' \
  --output table | sort -k2 -rn

# Find untagged resources (missing cost-centre tag)
aws ce get-cost-and-usage \
  --time-period Start=$(date -d '7 days ago' +%Y-%m-%d),End=$(date +%Y-%m-%d) \
  --granularity DAILY \
  --metrics BlendedCost \
  --filter '{
    "Not": {
      "Tags": {
        "Key": "cost-centre",
        "MatchOptions": ["PRESENT"]
      }
    }
  }' \
  --query 'ResultsByTime[*].Total.BlendedCost.Amount'

# Get Savings Plan recommendations
aws ce get-savings_plans_purchase_recommendation \
  --savings-plans-type COMPUTE_SP \
  --term-in-years ONE_YEAR \
  --payment-option NO_UPFRONT \
  --lookback-period-in-days SIXTY_DAYS

Reserved Instances and Savings Plans

For steady-state workloads running more than 50% of the time, Compute Savings Plans save 60–66% over on-demand pricing with maximum flexibility (applies automatically to EC2, Lambda, and Fargate). Reserved Instances save up to 72% for EC2 but are instance-family and region-specific:

Commitment TypeSavings vs On-DemandFlexibilityBest For
On-Demand0%MaximumUnpredictable spiky workloads
Spot Instances70–90%Must handle interruptionBatch, ML training, stateless web
Compute Savings Plans60–66%High (any EC2, Lambda, Fargate)Stable compute baseline
EC2 Reserved InstancesUp to 72%Low (instance type locked)Stable, known instance families
Savings Plans (1yr)~42%HighModerate savings with flexibility

S3 Lifecycle Policies

S3 stores are often the largest hidden cost. Implement lifecycle policies to transition objects to cheaper storage classes automatically:

aws s3api put-bucket-lifecycle-configuration \
  --bucket company-data-archive \
  --lifecycle-configuration '{
    "Rules": [{
      "ID": "archive-and-expire",
      "Status": "Enabled",
      "Filter": {"Prefix": "logs/"},
      "Transitions": [
        {
          "Days": 30,
          "StorageClass": "STANDARD_IA"
        },
        {
          "Days": 90,
          "StorageClass": "GLACIER_IR"
        },
        {
          "Days": 365,
          "StorageClass": "DEEP_ARCHIVE"
        }
      ],
      "Expiration": {"Days": 2555},
      "AbortIncompleteMultipartUpload": {"DaysAfterInitiation": 7}
    }]
  }'

Cost Optimization Checklist

#PracticeTool / Command
1All resources tagged with cost-centre, environment, team, serviceTag policies in AWS Organizations; Config rule REQUIRED_TAGS
2AWS Cost Anomaly Detection enabled with daily email alertsCost Explorer → Cost Anomaly Detection → Create monitor
3Compute Savings Plans covering baseline EC2/Fargate/Lambda usageaws ce get-savings-plans-purchase-recommendation
4Spot Instances for all fault-tolerant batch and ML training workloadsSpot Fleet or ECS Capacity Provider with Fargate Spot
5S3 lifecycle policies on all buckets older than 30 daysaws s3api get-bucket-lifecycle-configuration audit
6RDS instance right-sizing reviewed monthly using CloudWatch metricsAWS Compute Optimizer for RDS recommendations
7Dev and staging environments shut down outside business hoursInstance Scheduler or AWS Systems Manager State Manager
8Unattached EBS volumes, unused Elastic IPs, and idle load balancers cleaned up weeklyAWS Trusted Advisor low utilisation checks
9Lambda functions right-sized using AWS Lambda Power TuningAWS Lambda Power Tuning open-source Step Functions tool
10Monthly FinOps review meeting with engineering and financeCost Explorer custom dashboard shared with both teams

7. Pillar 6: Sustainability

Added in 2021, the Sustainability pillar focuses on minimising the environmental impacts of running cloud workloads. AWS reports that moving from on-premises infrastructure to AWS reduces carbon footprint by up to 96% for most customers — but the choices you make within AWS still matter. The six areas are: region selection, user behaviour patterns, software and architecture patterns, data patterns, hardware and services patterns, and development and deployment process.

Carbon Footprint and Region Selection

AWS regions vary significantly in their energy mix. AWS's renewable energy commitments mean some regions run on 100% renewable energy already. For new workloads, check the AWS Customer Carbon Footprint Tool and prefer regions with high renewable energy percentages:

Regions with 100% renewable energy commitment (2026): eu-west-1 (Ireland), eu-north-1 (Stockholm), eu-central-1 (Frankfurt), ca-central-1 (Canada), us-west-2 (Oregon). For global deployments with flexibility on latency, prefer these regions as primary.

Efficient Code and Managed Services

Managed services like DynamoDB, Aurora Serverless v2, and Fargate automatically scale to zero during idle periods — something EC2 cannot do. Migrating batch workloads from always-on EC2 to Step Functions + Lambda can reduce compute resource consumption by 80% for intermittent workloads.

ARM Processors for Sustainability

Graviton processors consume significantly less power than x86 equivalents for the same amount of compute work. Switching a fleet of c5.2xlarge (Intel) instances to c7g.2xlarge (Graviton3) delivers roughly the same performance at 60% of the power consumption — measurable in the AWS Carbon Footprint Tool within weeks of migration.

Sustainability Checklist

#PracticeBenefit
1Deployed in a region with high renewable energy percentageDirect carbon reduction proportional to workload size
2Graviton (arm64) used for all compute where possible~60% less power per compute unit vs x86
3Serverless-first for event-driven workloads (scale to zero)No idle compute resource consumption
4EC2 instances right-sized (CPU < 70% average)Avoids powering hardware that sits mostly idle
5S3 Intelligent-Tiering enabled for unpredictable access patternsData stored at lowest-energy storage tier automatically
6Data compression enabled (gzip, Snappy) on all data transfersReduces bytes transferred → less network energy
7Spot Instances used for batch workloadsUses spare capacity that would otherwise be idle
8Carbon Footprint Tool reviewed quarterlyTracks improvement over time; surfaces high-impact areas

8. Well-Architected Tool: Running a Review via CLI

The Well-Architected Tool is available in the AWS Console and via API. Running reviews programmatically via CLI allows you to integrate them into your engineering workflow — for example, as part of a quarterly architecture governance process.

# Create a workload
aws wellarchitected create-workload \
  --workload-name "Orders Service" \
  --description "Core order processing microservice" \
  --environment PRODUCTION \
  --aws-regions us-east-1 eu-west-1 \
  --review-owner "platform-team@mycompany.com" \
  --lenses wellarchitected

# Store the workload ID
WORKLOAD_ID=$(aws wellarchitected list-workloads \
  --workload-name-prefix "Orders Service" \
  --query 'WorkloadSummaries[0].WorkloadId' --output text)

# List all questions for the Security pillar
aws wellarchitected list-answers \
  --workload-id $WORKLOAD_ID \
  --lens-alias wellarchitected \
  --pillar-id security \
  --query 'AnswerSummaries[*].[QuestionId,QuestionTitle,Risk]' \
  --output table

# Update an answer (mark best practices selected)
aws wellarchitected update-answer \
  --workload-id $WORKLOAD_ID \
  --lens-alias wellarchitected \
  --question-id sec_securely_operate_multi_region \
  --selected-choices "sec_securely_operate_aws_account" "sec_securely_operate_control_objectives" \
  --notes "Using AWS Organizations with SCPs, GuardDuty delegated admin, Security Hub in all regions."

# Get the milestone report
aws wellarchitected create-milestone \
  --workload-id $WORKLOAD_ID \
  --milestone-name "Q2-2026-Baseline"

# Get risk summary
aws wellarchitected get-workload \
  --workload-id $WORKLOAD_ID \
  --query 'Workload.RiskCounts'

Interpreting the Report

The risk summary returns counts of HRIs and MRIs per pillar. A baseline score for a new workload in its first review is typically 20–40 HRIs. This is normal — the goal is to reduce HRIs to zero and MRIs to acceptable levels over successive quarters.

Risk Level Guide: Prioritise HRIs in the Security and Reliability pillars first — these can cause outages and breaches. HRIs in Cost Optimization and Sustainability can wait. MRIs in Performance Efficiency are often acceptable trade-offs for early-stage products.

9. Lenses: SaaS, Serverless, Analytics

AWS Well-Architected Lenses extend the core framework with domain-specific questions and best practices. The lenses most relevant to modern cloud-native teams are:

Serverless Application Lens

Adds 58 additional questions covering Lambda function design, event-driven architectures, API Gateway patterns, Step Functions state machine design, and DynamoDB access patterns. Key additions over the core framework:

  • Function payload size and cold start optimisation
  • Event-driven error handling (DLQs, idempotency)
  • Step Functions vs SQS vs EventBridge for orchestration/choreography trade-offs
  • DynamoDB single-table design validation

SaaS Lens

Adds 45 questions specific to multi-tenant SaaS architectures: tenant isolation models (silo, pool, bridge), per-tenant metering, tenant-aware deployment pipelines, and onboarding automation. Particularly important if you are building a B2B product where one tenant's data or compute must never affect another.

Analytics Lens

Adds questions covering data lake architecture (S3, Glue, Athena, Redshift), streaming pipelines (Kinesis, MSK), data quality controls, and cost optimisation specific to data workloads (Redshift Reserved Nodes, Athena query optimisation, Glue DPU right-sizing).

# Associate multiple lenses with a workload
aws wellarchitected associate-lenses \
  --workload-id $WORKLOAD_ID \
  --lens-aliases wellarchitected serverless saas-lens

# List available lenses
aws wellarchitected list-lenses \
  --lens-type AWS_OFFICIAL \
  --query 'LensSummaries[*].[LensAlias,LensName,LensVersion]' \
  --output table

10. Implementing Improvements: Prioritisation & IaC Remediation

A Well-Architected review produces a list of findings. Without a structured process for remediation, reports accumulate and nothing gets fixed. Use this prioritisation matrix to decide what to fix first:

PriorityRisk LevelPillarTypical Remediation Timeline
P1HRISecurity, ReliabilityWithin 2 weeks; assign dedicated engineer
P2HRIOperational Excellence, PerformanceWithin 4 weeks; add to sprint backlog
P3MRIAny pillarWithin next quarter; add to roadmap
P4HRI/MRICost, SustainabilityQuarterly FinOps review cycle

Terraform Remediation Examples

Most architectural improvements can be delivered as Terraform or CDK changes. The following examples fix common HRIs:

# HRI Fix: Enable RDS Multi-AZ (Reliability pillar)
resource "aws_db_instance" "orders" {
  identifier        = "orders-db"
  engine            = "postgres"
  engine_version    = "16.2"
  instance_class    = "db.t4g.medium"  # Graviton — cost + sustainability win
  allocated_storage = 100

  multi_az               = true          # Was: false — HRI fixed
  storage_encrypted      = true          # Was: false — Security HRI fixed
  kms_key_id             = aws_kms_key.rds.arn
  deletion_protection    = true          # Operational Excellence HRI fixed
  skip_final_snapshot    = false
  final_snapshot_identifier = "orders-db-final-${formatdate("YYYYMMDD", timestamp())}"

  backup_retention_period = 30           # Was: 7 — Reliability MRI improved
  backup_window          = "03:00-04:00"
  maintenance_window     = "Mon:04:00-Mon:05:00"

  enabled_cloudwatch_logs_exports = ["postgresql", "upgrade"]  # OE improvement

  tags = {
    cost-centre = "platform"
    environment = "production"
    service     = "orders"
  }
}

# HRI Fix: Enable S3 versioning and encryption (Reliability + Security)
resource "aws_s3_bucket" "data" {
  bucket = "company-orders-data"
}

resource "aws_s3_bucket_versioning" "data" {
  bucket = aws_s3_bucket.data.id
  versioning_configuration { status = "Enabled" }
}

resource "aws_s3_bucket_server_side_encryption_configuration" "data" {
  bucket = aws_s3_bucket.data.id
  rule {
    apply_server_side_encryption_by_default {
      sse_algorithm     = "aws:kms"
      kms_master_key_id = aws_kms_key.s3.arn
    }
    bucket_key_enabled = true
  }
}

resource "aws_s3_bucket_public_access_block" "data" {
  bucket                  = aws_s3_bucket.data.id
  block_public_acls       = true
  block_public_policy     = true
  ignore_public_acls      = true
  restrict_public_buckets = true
}

# HRI Fix: Enable CloudTrail in all regions with integrity validation
resource "aws_cloudtrail" "org_trail" {
  name                          = "org-audit-trail"
  s3_bucket_name                = aws_s3_bucket.cloudtrail.id
  is_multi_region_trail         = true         # Was: false — HRI fixed
  enable_log_file_validation    = true         # Integrity validation
  include_global_service_events = true
  kms_key_id                    = aws_kms_key.cloudtrail.arn

  event_selector {
    read_write_type           = "All"
    include_management_events = true

    data_resource {
      type   = "AWS::S3::Object"
      values = ["arn:aws:s3:::company-orders-data/"]
    }
  }
}

Track Improvements Over Time

Create a milestone in the Well-Architected Tool after completing each remediation sprint. Milestones snapshot the current risk state — compare milestones quarter-over-quarter to demonstrate architecture maturity improvement to stakeholders.

# Create quarterly milestone after remediations
aws wellarchitected create-milestone \
  --workload-id $WORKLOAD_ID \
  --milestone-name "Q3-2026-After-Sprint-22"

# Compare risk counts between milestones
aws wellarchitected get-milestone \
  --workload-id $WORKLOAD_ID \
  --milestone-number 1 \
  --query 'Milestone.Workload.RiskCounts'

aws wellarchitected get-milestone \
  --workload-id $WORKLOAD_ID \
  --milestone-number 2 \
  --query 'Milestone.Workload.RiskCounts'

11. Frequently Asked Questions

Q: Is the Well-Architected Tool free?
Yes. The Well-Architected Tool itself is free. You pay for the AWS resources you use to remediate findings.
Q: How long does a Well-Architected Review take?
A self-service review of a single workload against all six pillars typically takes 2–4 hours for an experienced architect. A formal AWS-facilitated review with a Solutions Architect takes 4–8 hours spread across multiple sessions.
Q: Do I need to be Well-Architected to go live?
No. The framework describes best practices, not requirements. However, most enterprises require a WAR as part of production readiness reviews. It is best practice to run a baseline review before launch and address all Security and Reliability HRIs before go-live.
Q: Which pillar should I start with?
Start with Security and Reliability. A breach or outage causes far more business damage than inefficient cost or suboptimal performance. After those two pillars are green, focus on Operational Excellence, then Cost, then Performance and Sustainability.
Q: Can I create custom lenses?
Yes. You can create custom lenses via the Well-Architected Tool API or console. Custom lenses use the same JSON schema as AWS lenses and support custom questions, choices, and risk rules. This is useful for enforcing company-specific standards across all workloads.

Related Articles

AWS Security Best Practices: IAM, Encryption, Compliance

Deep dive into IAM least privilege, MFA with SCPs, KMS, Security Hub, GuardDuty, and CloudTrail — the full security pillar toolkit.

AWS Cost Optimization: FinOps Strategies for 2026

Savings Plans, Spot Instances, right-sizing, lifecycle policies, and FinOps culture — everything in the Cost Optimization pillar.

AWS CloudWatch Monitoring and Observability

Metrics, Logs Insights, alarms, dashboards, and X-Ray tracing — building the observability foundation for Operational Excellence.

AWS IAM Roles and Policies: Complete Security Guide

Permission boundaries, resource-based policies, SCPs, and IAM Access Analyzer — mastering least privilege in AWS.

AWS Multi-Region Architecture: Active-Active and Failover

Route 53 failover, Global Accelerator, DynamoDB Global Tables, and active-active patterns for maximum reliability.