AWS Well-Architected Framework: All 6 Pillars with Checklists (2026)

AWS Well-Architected Framework 6 Pillars

The AWS Well-Architected Framework is not a certification checklist — it is a structured way of thinking about architecture trade-offs across six dimensions that matter in production: operational excellence, security, reliability, performance efficiency, cost optimization, and sustainability. This guide goes deep on every pillar with actionable checklists, real CLI commands, IAM policy JSON, Terraform examples, and a full walkthrough of the Well-Architected Tool.

1. Framework Overview & How Reviews Work

AWS released the first version of the Well-Architected Framework in 2015 based on patterns observed across thousands of customer architectures. The framework has since expanded to six pillars, a suite of Lenses, and a dedicated tool in the AWS Console that walks you through structured questions and produces a scored report with risk levels and improvement recommendations.

A Well-Architected Review (WAR) is a structured conversation between an architect and a workload owner. It is not an audit. The goal is to surface architectural risks early — ideally before launch, but also during operations — and produce a prioritised list of improvements. AWS Solution Architects conduct formal WARs for customers, but you can run self-service reviews using the AWS Well-Architected Tool at any time, at no cost.

How the Well-Architected Tool Works

The tool organises reviews around workloads. A workload is a collection of resources and code that delivers a business value — a microservice, a data pipeline, a SaaS application. For each workload you answer a set of questions per pillar. Answers map to one of three risk levels:

Risk Level	Colour	Meaning
High Risk Issue (HRI)	Red	A best practice not followed with significant potential impact
Medium Risk Issue (MRI)	Amber	A best practice partially followed or a known trade-off
No Issue	Green	Best practice fully met

Key Principle: The Well-Architected Framework is not prescriptive — it acknowledges trade-offs. The right answer for a startup MVP is different from a regulated financial workload. The tool captures your reasoning via "notes" fields, so reviewers understand why a best practice was intentionally skipped.

After completing a review, the tool generates a report listing all HRIs and MRIs grouped by pillar. You can export the report as a PDF or JSON, track improvement plans over time, and mark issues as resolved as you remediate them. Subsequent reviews show progress over time — a powerful mechanism for communicating architecture maturity to leadership.

2. Pillar 1: Operational Excellence

Operational Excellence focuses on running and monitoring systems to deliver business value and continually improving supporting processes and procedures. The key insight: operations is code. Every runbook, alarm threshold, and deployment pipeline should be version-controlled, tested, and reviewed like application code.

Design Principles

Perform operations as code — use CloudFormation, CDK, or Terraform for all infrastructure. Store in Git. Use CodePipeline for deployment.
Make frequent, small, reversible changes — deploy feature flags, use blue/green deployments, keep changes small enough to roll back in under five minutes.
Refine operations procedures frequently — run game days quarterly. Simulate failure scenarios. Update runbooks after every incident.
Anticipate failure — pre-mortem every major change. Ask: what could go wrong? How would we detect it? How would we recover?
Learn from all operational failures — blameless post-mortems. Track all incidents in a central system. Trend data over quarters.

Infrastructure as Code with AWS CDK

All resources should be defined as code. The following CDK snippet creates an ECS service with health checks and alarms — capturing the operational intent alongside the infrastructure definition:

import * as cdk from 'aws-cdk-lib';
import * as ecs from 'aws-cdk-lib/aws-ecs';
import * as cloudwatch from 'aws-cdk-lib/aws-cloudwatch';

const service = new ecs.FargateService(this, 'AppService', {
  cluster,
  taskDefinition,
  desiredCount: 3,
  minHealthyPercent: 100,   // Zero-downtime deployments
  maxHealthyPercent: 200,
  circuitBreaker: { rollback: true },  // Auto-rollback on failure
  enableExecuteCommand: false,         // Disable in prod; use SSM Session Manager
});

// Alarm on high error rate
const errorAlarm = new cloudwatch.Alarm(this, 'ErrorRateAlarm', {
  metric: service.metricCpuUtilization(),
  threshold: 80,
  evaluationPeriods: 3,
  alarmDescription: 'CPU > 80% for 3 consecutive minutes',
  actionsEnabled: true,
});

Observability: The Three Pillars

Operational Excellence requires that you can observe your system's behaviour in production. Implement all three pillars of observability:

Metrics — CloudWatch custom metrics with EMF (Embedded Metric Format) for structured, queryable metrics from Lambda and ECS
Logs — structured JSON logging to CloudWatch Logs Insights. Never log plaintext — use structured fields so you can query filter @message like /ERROR/
Traces — AWS X-Ray for distributed tracing across Lambda, ECS, API Gateway, and Step Functions

Game Days

A game day is a scheduled exercise where your team deliberately injects failures into a production or staging system to verify that runbooks work, alarms fire, and on-call engineers can recover without the pressure of an actual incident. Run game days at least quarterly. Scenarios to test: AZ failure simulation, database failover, deploy a broken version and verify rollback, throttle a downstream dependency, inject 5xx errors into an API.

Operational Excellence Checklist

#	Practice	How to Verify
1	All infrastructure defined as IaC (CDK/Terraform/CloudFormation)	No manually created resources in console
2	Deployments via CI/CD pipeline (CodePipeline, GitHub Actions)	Zero manual `aws deploy` commands in runbooks
3	Runbooks stored in Git, linked from PagerDuty/OpsGenie	Every alarm has a linked runbook URL
4	Structured JSON logging enabled on all services	CloudWatch Logs Insights query returns parsed fields
5	X-Ray tracing enabled on Lambda, API GW, ECS	Service map visible in X-Ray console
6	CloudWatch alarms on error rate, latency P99, and queue depth	Alarm inventory audit passes
7	Blue/green or canary deployments for all production changes	CodeDeploy configuration shows linear/canary strategy
8	Game day run in last 90 days	Game day report in wiki
9	Blameless post-mortem within 48h of every P1/P2 incident	Incident tracker shows post-mortem link on closed incidents
10	Deployment frequency and MTTR tracked as business KPIs	Dashboard shows DORA metrics

3. Pillar 2: Security

The Security pillar covers protecting information, systems, and assets while delivering business value. The AWS Shared Responsibility Model means AWS secures the cloud infrastructure; you secure everything in the cloud. The six areas of the Security pillar are: Identity and Access Management, Detection, Infrastructure Protection, Data Protection, Incident Response, and Application Security.

IAM Least Privilege — Real Policy Example

Every Lambda function, ECS task, and EC2 instance should have a dedicated IAM role scoped to exactly the resources it needs. The following policy grants a Lambda function read access to one DynamoDB table and write access to one specific S3 prefix — nothing else:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "ReadOrdersTable",
      "Effect": "Allow",
      "Action": ["dynamodb:GetItem", "dynamodb:Query", "dynamodb:BatchGetItem"],
      "Resource": "arn:aws:dynamodb:us-east-1:123456789012:table/orders"
    },
    {
      "Sid": "WriteReportsBucket",
      "Effect": "Allow",
      "Action": ["s3:PutObject"],
      "Resource": "arn:aws:s3:::company-reports/lambda-output/*"
    },
    {
      "Sid": "DecryptWithCMK",
      "Effect": "Allow",
      "Action": ["kms:Decrypt", "kms:GenerateDataKey"],
      "Resource": "arn:aws:kms:us-east-1:123456789012:key/mrk-abc123",
      "Condition": {
        "StringEquals": {
          "kms:ViaService": "dynamodb.us-east-1.amazonaws.com"
        }
      }
    },
    {
      "Sid": "DenyEverythingElse",
      "Effect": "Deny",
      "NotAction": [
        "dynamodb:GetItem", "dynamodb:Query", "dynamodb:BatchGetItem",
        "s3:PutObject", "kms:Decrypt", "kms:GenerateDataKey"
      ],
      "Resource": "*"
    }
  ]
}

Enforce MFA with Service Control Policies

Service Control Policies (SCPs) apply at the AWS Organizations level and restrict what can be done in member accounts — even by root users. Use this SCP to deny all non-MFA actions for human users:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "DenyWithoutMFA",
      "Effect": "Deny",
      "Action": "*",
      "Resource": "*",
      "Condition": {
        "BoolIfExists": {
          "aws:MultiFactorAuthPresent": "false"
        },
        "StringNotLike": {
          "aws:PrincipalArn": [
            "arn:aws:iam::*:role/CodePipelineRole",
            "arn:aws:iam::*:role/TerraformDeployRole"
          ]
        }
      }
    }
  ]
}

Encryption at Rest and in Transit

Enable encryption on every storage service using AWS KMS Customer Managed Keys (CMKs) — not AWS-managed keys. CMKs give you control over key rotation, access policies, and CloudTrail audit logs of every decrypt operation.

# Create a CMK with automatic yearly rotation
aws kms create-key \
  --description "Orders service data key" \
  --enable-key-rotation

# Store the key ID
KEY_ID=$(aws kms list-keys --query 'Keys[0].KeyId' --output text)

# Create an alias
aws kms create-alias \
  --alias-name alias/orders-service \
  --target-key-id $KEY_ID

# Encrypt an S3 bucket with this CMK
aws s3api put-bucket-encryption \
  --bucket company-orders-data \
  --server-side-encryption-configuration '{
    "Rules": [{
      "ApplyServerSideEncryptionByDefault": {
        "SSEAlgorithm": "aws:kms",
        "KMSMasterKeyID": "alias/orders-service"
      },
      "BucketKeyEnabled": true
    }]
  }'

Detective Controls

Enable GuardDuty, Security Hub, and AWS Config in every account and every region — including regions you don't actively use. Attackers prefer unused regions precisely because they are unmonitored.

# Enable GuardDuty with all protection plans
aws guardduty create-detector \
  --enable \
  --finding-publishing-frequency FIFTEEN_MINUTES \
  --data-sources '{
    "S3Logs": {"Enable": true},
    "Kubernetes": {"AuditLogs": {"Enable": true}},
    "MalwareProtection": {"ScanEc2InstanceWithFindings": {"EbsVolumes": true}}
  }'

# Enable Security Hub with CIS benchmark standard
ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)
REGION=$(aws configure get region)

aws securityhub enable-security-hub \
  --enable-default-standards

# Enable CIS AWS Foundations Benchmark v1.4
aws securityhub batch-enable-standards \
  --standards-subscription-requests \
    "StandardsArn=arn:aws:securityhub:${REGION}::standards/cis-aws-foundations-benchmark/v/1.4.0"

Security Checklist

#	Control	Implementation
1	MFA on all human IAM users and root account	SCP denies actions without MFA; root account uses hardware MFA
2	No long-lived access keys on IAM users	Rotate or remove; use IAM roles for all workloads
3	All S3 buckets block public access by default	`aws s3control put-public-access-block --account-id`
4	All data encrypted at rest with CMKs	KMS CMK per service; automatic rotation enabled
5	All data in transit uses TLS 1.2+	ALB policy minimum TLS 1.2; API Gateway enforces HTTPS
6	GuardDuty enabled in all regions	AWS Organizations delegated admin; all protection plans on
7	Security Hub with CIS benchmark enabled	Auto-remediation for critical findings via Lambda
8	CloudTrail multi-region trail with log integrity	`--enable-log-file-validation --kms-key-id`
9	Incident response runbook documented and tested	Game day includes security incident scenario
10	VPC endpoints for all AWS services accessed from private subnets	No traffic to S3/DynamoDB/SQS crosses public internet

4. Pillar 3: Reliability

The Reliability pillar focuses on ensuring a workload performs its intended function correctly and consistently when it is expected to. This means designing for failure — assuming components will fail and building the system so it recovers automatically. The key areas are: foundations (service quotas, network topology), workload architecture (distributed system design, fault isolation), change management, and failure management.

Multi-AZ Architecture

Every production workload must span at least two Availability Zones. An AZ failure is not a hypothetical — AWS has had documented AZ outages that affected single-AZ deployments. Use the following architecture pattern:

ALB across 2–3 AZs, target groups with health check interval of 10s
ECS or Auto Scaling Group with AZ rebalancing enabled
RDS Multi-AZ with automated failover (typically 30–60s RTO)
ElastiCache with Multi-AZ replica and automatic failover
S3 — inherently multi-AZ; 11 nines durability

Route 53 Health Checks and Failover Routing

Implement Route 53 health checks to detect endpoint failures and trigger DNS failover within 10–30 seconds:

# Create a health check for primary region
aws route53 create-health-check \
  --caller-reference $(date +%s) \
  --health-check-config '{
    "Type": "HTTPS",
    "ResourcePath": "/health",
    "FullyQualifiedDomainName": "api.us-east-1.myapp.com",
    "Port": 443,
    "RequestInterval": 10,
    "FailureThreshold": 3,
    "MeasureLatency": true,
    "Regions": ["us-east-1", "eu-west-1", "ap-southeast-1"]
  }'

# Store health check ID
HC_ID=$(aws route53 list-health-checks \
  --query 'HealthChecks[-1].Id' --output text)

# Create failover record set — Primary
aws route53 change-resource-record-sets \
  --hosted-zone-id Z1234567890ABC \
  --change-batch '{
    "Changes": [{
      "Action": "CREATE",
      "ResourceRecordSet": {
        "Name": "api.myapp.com",
        "Type": "A",
        "SetIdentifier": "primary-us-east-1",
        "Failover": "PRIMARY",
        "HealthCheckId": "'$HC_ID'",
        "AliasTarget": {
          "HostedZoneId": "Z35SXDOTRQ7X7K",
          "DNSName": "alb-us-east-1.elb.amazonaws.com",
          "EvaluateTargetHealth": true
        }
      }
    }]
  }'

Chaos Engineering with AWS Fault Injection Simulator

AWS Fault Injection Simulator (FIS) lets you run controlled chaos experiments directly from the console or CLI. Define experiments once, run them repeatedly on a schedule to build confidence in your recovery mechanisms:

# Create an FIS experiment template: terminate 30% of ECS tasks in one AZ
aws fis create-experiment-template \
  --description "Terminate 30% of ECS tasks in us-east-1a" \
  --targets '{
    "ecs-tasks": {
      "resourceType": "aws:ecs:task",
      "resourceArns": ["arn:aws:ecs:us-east-1:123456789012:cluster/prod"],
      "selectionMode": "PERCENT(30)",
      "filters": [{"path": "AvailabilityZone", "values": ["us-east-1a"]}]
    }
  }' \
  --actions '{
    "terminate-tasks": {
      "actionId": "aws:ecs:stop-task",
      "targets": {"Tasks": "ecs-tasks"}
    }
  }' \
  --stop-conditions '[{
    "source": "aws:cloudwatch:alarm",
    "value": "arn:aws:cloudwatch:us-east-1:123456789012:alarm/HighErrorRate"
  }]' \
  --role-arn arn:aws:iam::123456789012:role/FISExperimentRole

Service Quota Management

AWS imposes default quotas on almost every service. Running into a quota in production causes failures that look like infrastructure bugs. Proactively request quota increases before you need them:

# List all applied quotas for Lambda
aws service-quotas list-applied-quotas-for-service \
  --service-code lambda \
  --query 'Quotas[?Value!=null].[QuotaName,Value,Unit]' \
  --output table

# Request a quota increase
aws service-quotas request-service-quota-increase \
  --service-code lambda \
  --quota-code L-B99A9384 \
  --desired-value 3000  # Concurrent executions per region

Reliability Checklist

#	Practice	How to Verify
1	All critical services deployed across 2+ AZs	AWS Config rule `DESIRED_INSTANCE_TENANCY` + manual AZ count check
2	RDS Multi-AZ or Aurora Global Database enabled	RDS console shows "Multi-AZ: Yes"
3	Route 53 health checks with failover routing configured	Route 53 health check status shows healthy; simulate AZ failure
4	Backup policy: daily snapshots, tested restore quarterly	AWS Backup plan exists; last restore test < 90 days ago
5	RTO and RPO documented and tested for each workload	Documented in architecture runbook; FIS experiment validates RTO
6	Circuit breakers on all synchronous service calls	Code review; load test shows graceful degradation on dependency failure
7	Retry with exponential backoff and jitter on all AWS SDK calls	SDK config or middleware includes retry logic
8	Service quotas reviewed and increased before reaching 80%	CloudWatch metrics for quota utilisation; auto-request via Quotas API
9	Dead letter queues on all SQS queues and Lambda async invocations	`aws sqs get-queue-attributes --attribute-names RedrivePolicy`
10	Chaos engineering experiment run in last 90 days	FIS experiment history shows last run date

5. Pillar 4: Performance Efficiency

Performance Efficiency is about using computing resources efficiently to meet system requirements and maintaining that efficiency as demand changes and technologies evolve. The four areas are: selection (right compute, storage, database, network), review (keep up with AWS innovation), monitoring, and trade-offs (use caching, compression, eventual consistency where appropriate).

Right Instance Selection and Graviton

AWS Graviton3 processors (ARM-based) offer 25–40% better price-performance than equivalent x86 instances for most general-purpose workloads. Switch to Graviton for ECS Fargate, Lambda (arm64), and EC2 with a one-line change to your CDK or Terraform code:

// ECS Fargate task with Graviton (arm64)
const taskDefinition = new ecs.FargateTaskDefinition(this, 'Task', {
  cpu: 512,
  memoryLimitMiB: 1024,
  runtimePlatform: {
    cpuArchitecture: ecs.CpuArchitecture.ARM64,
    operatingSystemFamily: ecs.OperatingSystemFamily.LINUX,
  },
});

// Lambda on arm64 — same price, faster execution for most runtimes
const fn = new lambda.Function(this, 'Fn', {
  runtime: lambda.Runtime.PYTHON_3_12,
  architecture: lambda.Architecture.ARM_64,  // arm64
  handler: 'index.handler',
  code: lambda.Code.fromAsset('lambda'),
});

Caching Strategy

Implement caching at multiple layers to reduce latency and database load. A well-designed caching strategy can reduce database calls by 90% for read-heavy workloads:

CloudFront — CDN caching for static assets and cacheable API responses. Cache hit ratios above 80% are achievable for typical web workloads.
ElastiCache (Redis) — application-level caching for session data, computed results, and frequently queried database records. Use cluster mode for horizontal scaling.
DAX (DynamoDB Accelerator) — in-memory cache for DynamoDB. Sub-millisecond response times for cached reads without any application code changes.
API Gateway caching — cache API responses at the gateway level for GET endpoints. TTL of 60–300 seconds eliminates Lambda cold starts for cached responses.

Benchmarking and Load Testing

Never guess at performance — measure it. Use AWS Distributed Load Testing or k6 to run load tests from within your VPC before every major launch:

# Run a distributed load test using AWS Distributed Load Testing solution
# Deploy the CDK stack first, then trigger via API

# Or use k6 from an EC2 instance in the target region
k6 run --vus 1000 --duration 5m \
  --out json=results.json \
  script.js

# Analyse with CloudWatch Logs Insights
aws logs start-query \
  --log-group-name /aws/lambda/orders-api \
  --start-time $(date -d '1 hour ago' +%s) \
  --end-time $(date +%s) \
  --query-string 'filter @type = "REPORT"
    | stats avg(@duration), max(@duration),
            pct(@duration, 95), pct(@duration, 99)
      by bin(5m)'

Performance Efficiency Checklist

#	Practice	Action
1	Graviton instances used for all new compute resources	arm64 architecture in ECS task defs and Lambda functions
2	CloudFront CDN in front of all public APIs and static assets	CloudFront cache hit ratio > 70% in CloudWatch metrics
3	ElastiCache or DAX for database read caching	Cache hit ratio monitored; miss rate alerts configured
4	Auto Scaling configured with predictive scaling for known traffic patterns	Target tracking policy on ECS service and ALB request count
5	Lambda memory right-sized using AWS Lambda Power Tuning	Lambda Power Tuning tool run on all production functions
6	RDS instance class right-sized monthly	CloudWatch CPU and connection utilisation < 70% average
7	P99 latency targets defined and alerted on	CloudWatch alarm on latency exceeding SLA threshold
8	Load test run before every major launch	Load test report attached to change request
9	Serverless-first for event-driven and batch workloads	Step Functions + Lambda for all batch pipelines
10	Spot Instances used for all fault-tolerant batch compute	EC2 Spot Fleet or Fargate Spot in ECS Capacity Providers

6. Pillar 5: Cost Optimization

Cost Optimization is about delivering business value at the lowest price point. The five areas are: practise cloud financial management, expenditure and usage awareness, cost-effective resources, manage demand and supply, and optimise over time. Most organisations waste 30–40% of their cloud spend — this is recoverable with the right practices.

Cost Visibility with AWS Cost Explorer

You cannot optimise what you cannot see. Start by tagging all resources and creating Cost Explorer reports grouped by service, tag, and region:

# Query cost by service for the last 30 days
aws ce get-cost-and-usage \
  --time-period Start=$(date -d '30 days ago' +%Y-%m-%d),End=$(date +%Y-%m-%d) \
  --granularity MONTHLY \
  --metrics BlendedCost \
  --group-by Type=DIMENSION,Key=SERVICE \
  --query 'ResultsByTime[0].Groups[*].{Service:Keys[0],Cost:Metrics.BlendedCost.Amount}' \
  --output table | sort -k2 -rn

# Find untagged resources (missing cost-centre tag)
aws ce get-cost-and-usage \
  --time-period Start=$(date -d '7 days ago' +%Y-%m-%d),End=$(date +%Y-%m-%d) \
  --granularity DAILY \
  --metrics BlendedCost \
  --filter '{
    "Not": {
      "Tags": {
        "Key": "cost-centre",
        "MatchOptions": ["PRESENT"]
      }
    }
  }' \
  --query 'ResultsByTime[*].Total.BlendedCost.Amount'

# Get Savings Plan recommendations
aws ce get-savings_plans_purchase_recommendation \
  --savings-plans-type COMPUTE_SP \
  --term-in-years ONE_YEAR \
  --payment-option NO_UPFRONT \
  --lookback-period-in-days SIXTY_DAYS

Reserved Instances and Savings Plans

For steady-state workloads running more than 50% of the time, Compute Savings Plans save 60–66% over on-demand pricing with maximum flexibility (applies automatically to EC2, Lambda, and Fargate). Reserved Instances save up to 72% for EC2 but are instance-family and region-specific:

Commitment Type	Savings vs On-Demand	Flexibility	Best For
On-Demand	0%	Maximum	Unpredictable spiky workloads
Spot Instances	70–90%	Must handle interruption	Batch, ML training, stateless web
Compute Savings Plans	60–66%	High (any EC2, Lambda, Fargate)	Stable compute baseline
EC2 Reserved Instances	Up to 72%	Low (instance type locked)	Stable, known instance families
Savings Plans (1yr)	~42%	High	Moderate savings with flexibility

S3 Lifecycle Policies

S3 stores are often the largest hidden cost. Implement lifecycle policies to transition objects to cheaper storage classes automatically:

aws s3api put-bucket-lifecycle-configuration \
  --bucket company-data-archive \
  --lifecycle-configuration '{
    "Rules": [{
      "ID": "archive-and-expire",
      "Status": "Enabled",
      "Filter": {"Prefix": "logs/"},
      "Transitions": [
        {
          "Days": 30,
          "StorageClass": "STANDARD_IA"
        },
        {
          "Days": 90,
          "StorageClass": "GLACIER_IR"
        },
        {
          "Days": 365,
          "StorageClass": "DEEP_ARCHIVE"
        }
      ],
      "Expiration": {"Days": 2555},
      "AbortIncompleteMultipartUpload": {"DaysAfterInitiation": 7}
    }]
  }'

Cost Optimization Checklist

#	Practice	Tool / Command
1	All resources tagged with cost-centre, environment, team, service	Tag policies in AWS Organizations; Config rule `REQUIRED_TAGS`
2	AWS Cost Anomaly Detection enabled with daily email alerts	Cost Explorer → Cost Anomaly Detection → Create monitor
3	Compute Savings Plans covering baseline EC2/Fargate/Lambda usage	`aws ce get-savings-plans-purchase-recommendation`
4	Spot Instances for all fault-tolerant batch and ML training workloads	Spot Fleet or ECS Capacity Provider with Fargate Spot
5	S3 lifecycle policies on all buckets older than 30 days	`aws s3api get-bucket-lifecycle-configuration` audit
6	RDS instance right-sizing reviewed monthly using CloudWatch metrics	AWS Compute Optimizer for RDS recommendations
7	Dev and staging environments shut down outside business hours	Instance Scheduler or AWS Systems Manager State Manager
8	Unattached EBS volumes, unused Elastic IPs, and idle load balancers cleaned up weekly	AWS Trusted Advisor low utilisation checks
9	Lambda functions right-sized using AWS Lambda Power Tuning	AWS Lambda Power Tuning open-source Step Functions tool
10	Monthly FinOps review meeting with engineering and finance	Cost Explorer custom dashboard shared with both teams

7. Pillar 6: Sustainability

Added in 2021, the Sustainability pillar focuses on minimising the environmental impacts of running cloud workloads. AWS reports that moving from on-premises infrastructure to AWS reduces carbon footprint by up to 96% for most customers — but the choices you make within AWS still matter. The six areas are: region selection, user behaviour patterns, software and architecture patterns, data patterns, hardware and services patterns, and development and deployment process.

Carbon Footprint and Region Selection

AWS regions vary significantly in their energy mix. AWS's renewable energy commitments mean some regions run on 100% renewable energy already. For new workloads, check the AWS Customer Carbon Footprint Tool and prefer regions with high renewable energy percentages:

Regions with 100% renewable energy commitment (2026): eu-west-1 (Ireland), eu-north-1 (Stockholm), eu-central-1 (Frankfurt), ca-central-1 (Canada), us-west-2 (Oregon). For global deployments with flexibility on latency, prefer these regions as primary.

Efficient Code and Managed Services

Managed services like DynamoDB, Aurora Serverless v2, and Fargate automatically scale to zero during idle periods — something EC2 cannot do. Migrating batch workloads from always-on EC2 to Step Functions + Lambda can reduce compute resource consumption by 80% for intermittent workloads.

ARM Processors for Sustainability

Graviton processors consume significantly less power than x86 equivalents for the same amount of compute work. Switching a fleet of c5.2xlarge (Intel) instances to c7g.2xlarge (Graviton3) delivers roughly the same performance at 60% of the power consumption — measurable in the AWS Carbon Footprint Tool within weeks of migration.

Sustainability Checklist

#	Practice	Benefit
1	Deployed in a region with high renewable energy percentage	Direct carbon reduction proportional to workload size
2	Graviton (arm64) used for all compute where possible	~60% less power per compute unit vs x86
3	Serverless-first for event-driven workloads (scale to zero)	No idle compute resource consumption
4	EC2 instances right-sized (CPU < 70% average)	Avoids powering hardware that sits mostly idle
5	S3 Intelligent-Tiering enabled for unpredictable access patterns	Data stored at lowest-energy storage tier automatically
6	Data compression enabled (gzip, Snappy) on all data transfers	Reduces bytes transferred → less network energy
7	Spot Instances used for batch workloads	Uses spare capacity that would otherwise be idle
8	Carbon Footprint Tool reviewed quarterly	Tracks improvement over time; surfaces high-impact areas

8. Well-Architected Tool: Running a Review via CLI

The Well-Architected Tool is available in the AWS Console and via API. Running reviews programmatically via CLI allows you to integrate them into your engineering workflow — for example, as part of a quarterly architecture governance process.

# Create a workload
aws wellarchitected create-workload \
  --workload-name "Orders Service" \
  --description "Core order processing microservice" \
  --environment PRODUCTION \
  --aws-regions us-east-1 eu-west-1 \
  --review-owner "platform-team@mycompany.com" \
  --lenses wellarchitected

# Store the workload ID
WORKLOAD_ID=$(aws wellarchitected list-workloads \
  --workload-name-prefix "Orders Service" \
  --query 'WorkloadSummaries[0].WorkloadId' --output text)

# List all questions for the Security pillar
aws wellarchitected list-answers \
  --workload-id $WORKLOAD_ID \
  --lens-alias wellarchitected \
  --pillar-id security \
  --query 'AnswerSummaries[*].[QuestionId,QuestionTitle,Risk]' \
  --output table

# Update an answer (mark best practices selected)
aws wellarchitected update-answer \
  --workload-id $WORKLOAD_ID \
  --lens-alias wellarchitected \
  --question-id sec_securely_operate_multi_region \
  --selected-choices "sec_securely_operate_aws_account" "sec_securely_operate_control_objectives" \
  --notes "Using AWS Organizations with SCPs, GuardDuty delegated admin, Security Hub in all regions."

# Get the milestone report
aws wellarchitected create-milestone \
  --workload-id $WORKLOAD_ID \
  --milestone-name "Q2-2026-Baseline"

# Get risk summary
aws wellarchitected get-workload \
  --workload-id $WORKLOAD_ID \
  --query 'Workload.RiskCounts'

Interpreting the Report

The risk summary returns counts of HRIs and MRIs per pillar. A baseline score for a new workload in its first review is typically 20–40 HRIs. This is normal — the goal is to reduce HRIs to zero and MRIs to acceptable levels over successive quarters.

Risk Level Guide: Prioritise HRIs in the Security and Reliability pillars first — these can cause outages and breaches. HRIs in Cost Optimization and Sustainability can wait. MRIs in Performance Efficiency are often acceptable trade-offs for early-stage products.

9. Lenses: SaaS, Serverless, Analytics

AWS Well-Architected Lenses extend the core framework with domain-specific questions and best practices. The lenses most relevant to modern cloud-native teams are:

Serverless Application Lens

Adds 58 additional questions covering Lambda function design, event-driven architectures, API Gateway patterns, Step Functions state machine design, and DynamoDB access patterns. Key additions over the core framework:

Function payload size and cold start optimisation
Event-driven error handling (DLQs, idempotency)
Step Functions vs SQS vs EventBridge for orchestration/choreography trade-offs
DynamoDB single-table design validation

SaaS Lens

Adds 45 questions specific to multi-tenant SaaS architectures: tenant isolation models (silo, pool, bridge), per-tenant metering, tenant-aware deployment pipelines, and onboarding automation. Particularly important if you are building a B2B product where one tenant's data or compute must never affect another.

Analytics Lens

Adds questions covering data lake architecture (S3, Glue, Athena, Redshift), streaming pipelines (Kinesis, MSK), data quality controls, and cost optimisation specific to data workloads (Redshift Reserved Nodes, Athena query optimisation, Glue DPU right-sizing).

# Associate multiple lenses with a workload
aws wellarchitected associate-lenses \
  --workload-id $WORKLOAD_ID \
  --lens-aliases wellarchitected serverless saas-lens

# List available lenses
aws wellarchitected list-lenses \
  --lens-type AWS_OFFICIAL \
  --query 'LensSummaries[*].[LensAlias,LensName,LensVersion]' \
  --output table

10. Implementing Improvements: Prioritisation & IaC Remediation

A Well-Architected review produces a list of findings. Without a structured process for remediation, reports accumulate and nothing gets fixed. Use this prioritisation matrix to decide what to fix first:

Priority	Risk Level	Pillar	Typical Remediation Timeline
P1	HRI	Security, Reliability	Within 2 weeks; assign dedicated engineer
P2	HRI	Operational Excellence, Performance	Within 4 weeks; add to sprint backlog
P3	MRI	Any pillar	Within next quarter; add to roadmap
P4	HRI/MRI	Cost, Sustainability	Quarterly FinOps review cycle

Terraform Remediation Examples

Most architectural improvements can be delivered as Terraform or CDK changes. The following examples fix common HRIs:

# HRI Fix: Enable RDS Multi-AZ (Reliability pillar)
resource "aws_db_instance" "orders" {
  identifier        = "orders-db"
  engine            = "postgres"
  engine_version    = "16.2"
  instance_class    = "db.t4g.medium"  # Graviton — cost + sustainability win
  allocated_storage = 100

  multi_az               = true          # Was: false — HRI fixed
  storage_encrypted      = true          # Was: false — Security HRI fixed
  kms_key_id             = aws_kms_key.rds.arn
  deletion_protection    = true          # Operational Excellence HRI fixed
  skip_final_snapshot    = false
  final_snapshot_identifier = "orders-db-final-${formatdate("YYYYMMDD", timestamp())}"

  backup_retention_period = 30           # Was: 7 — Reliability MRI improved
  backup_window          = "03:00-04:00"
  maintenance_window     = "Mon:04:00-Mon:05:00"

  enabled_cloudwatch_logs_exports = ["postgresql", "upgrade"]  # OE improvement

  tags = {
    cost-centre = "platform"
    environment = "production"
    service     = "orders"
  }
}

# HRI Fix: Enable S3 versioning and encryption (Reliability + Security)
resource "aws_s3_bucket" "data" {
  bucket = "company-orders-data"
}

resource "aws_s3_bucket_versioning" "data" {
  bucket = aws_s3_bucket.data.id
  versioning_configuration { status = "Enabled" }
}

resource "aws_s3_bucket_server_side_encryption_configuration" "data" {
  bucket = aws_s3_bucket.data.id
  rule {
    apply_server_side_encryption_by_default {
      sse_algorithm     = "aws:kms"
      kms_master_key_id = aws_kms_key.s3.arn
    }
    bucket_key_enabled = true
  }
}

resource "aws_s3_bucket_public_access_block" "data" {
  bucket                  = aws_s3_bucket.data.id
  block_public_acls       = true
  block_public_policy     = true
  ignore_public_acls      = true
  restrict_public_buckets = true
}

# HRI Fix: Enable CloudTrail in all regions with integrity validation
resource "aws_cloudtrail" "org_trail" {
  name                          = "org-audit-trail"
  s3_bucket_name                = aws_s3_bucket.cloudtrail.id
  is_multi_region_trail         = true         # Was: false — HRI fixed
  enable_log_file_validation    = true         # Integrity validation
  include_global_service_events = true
  kms_key_id                    = aws_kms_key.cloudtrail.arn

  event_selector {
    read_write_type           = "All"
    include_management_events = true

    data_resource {
      type   = "AWS::S3::Object"
      values = ["arn:aws:s3:::company-orders-data/"]
    }
  }
}

Track Improvements Over Time

Create a milestone in the Well-Architected Tool after completing each remediation sprint. Milestones snapshot the current risk state — compare milestones quarter-over-quarter to demonstrate architecture maturity improvement to stakeholders.

# Create quarterly milestone after remediations
aws wellarchitected create-milestone \
  --workload-id $WORKLOAD_ID \
  --milestone-name "Q3-2026-After-Sprint-22"

# Compare risk counts between milestones
aws wellarchitected get-milestone \
  --workload-id $WORKLOAD_ID \
  --milestone-number 1 \
  --query 'Milestone.Workload.RiskCounts'

aws wellarchitected get-milestone \
  --workload-id $WORKLOAD_ID \
  --milestone-number 2 \
  --query 'Milestone.Workload.RiskCounts'

11. Frequently Asked Questions

Q: Is the Well-Architected Tool free?
Yes. The Well-Architected Tool itself is free. You pay for the AWS resources you use to remediate findings.

Q: How long does a Well-Architected Review take?
A self-service review of a single workload against all six pillars typically takes 2–4 hours for an experienced architect. A formal AWS-facilitated review with a Solutions Architect takes 4–8 hours spread across multiple sessions.

Q: Do I need to be Well-Architected to go live?
No. The framework describes best practices, not requirements. However, most enterprises require a WAR as part of production readiness reviews. It is best practice to run a baseline review before launch and address all Security and Reliability HRIs before go-live.

Q: Which pillar should I start with?
Start with Security and Reliability. A breach or outage causes far more business damage than inefficient cost or suboptimal performance. After those two pillars are green, focus on Operational Excellence, then Cost, then Performance and Sustainability.

Q: Can I create custom lenses?
Yes. You can create custom lenses via the Well-Architected Tool API or console. Custom lenses use the same JSON schema as AWS lenses and support custom questions, choices, and risk rules. This is useful for enforcing company-specific standards across all workloads.