AWS SNS to SQS Fan-Out Pattern: Event-Driven Architecture at Scale (2026)
The SNS-to-SQS fan-out pattern is the backbone of scalable, loosely coupled event-driven systems on AWS. One SNS topic fans out a single event to dozens of independent SQS queues — each consumed by a different microservice — with zero coupling between producers and consumers. This guide goes beyond the basics: you'll get subscription filter policies for intelligent message routing, Lambda consumer tuning, dead-letter queue automation, cross-account fan-out, full observability with CloudWatch and X-Ray, and production-grade code in Python boto3, Terraform, and Java Spring Cloud AWS.
Table of Contents
- 1. The Fan-Out Pattern — Why It Solves Tight Coupling
- 2. SNS Deep-Dive: Topics, Subscriptions, Message Attributes
- 3. SQS Deep-Dive: Standard vs FIFO, Visibility Timeout, DLQ
- 4. Fan-Out Setup: CLI + Terraform
- 5. Message Filtering — Attribute-Based Routing
- 6. Lambda Consumers — Event Source Mapping and Error Handling
- 7. Producer + Consumer Code: Python boto3 and Java Spring
- 8. Advanced Patterns: Fan-In, Cross-Account, Cross-Region
- 9. Dead Letter Queues — Setup, Monitoring, Auto-Reprocessing
- 10. Observability — CloudWatch Metrics and X-Ray Tracing
1. The Fan-Out Pattern — Why It Solves Tight Coupling
In a tightly coupled system, when an order is placed the Order Service directly calls the Inventory Service, the Notification Service, the Analytics Service, the Fraud Detection Service, and the Loyalty Points Service — one by one, synchronously. If any downstream service is slow or unavailable, the entire order placement fails or slows to a crawl. Adding a sixth downstream consumer requires editing the Order Service code and redeploying it.
The fan-out pattern eliminates this coupling entirely. The Order Service publishes one event to a single SNS topic. SNS delivers that same event in parallel to N independent SQS queues — one per consuming service — without the producer knowing or caring who is listening. Each consumer processes at its own pace, with its own concurrency and scaling rules. Adding a new downstream service means creating a new SQS queue and subscribing it to the topic; the Order Service is never touched.
Real E-Commerce Order Processing Fan-Out
Consider this production architecture at an e-commerce company processing 500,000 orders per day:
Order Service (producer)
│
▼
SNS Topic: orders-events
│
┌────┼────────────────────────────────────────┐
│ │ │
▼ ▼ ▼
SQS: SQS: SQS:
inventory notifications analytics
│ │ │
▼ ▼ ▼
Lambda Lambda ECS Task
(update (send email/ (clickstream
stock) push/SMS) pipeline)
Also subscribed:
SQS: fraud-detection → Lambda (ML scoring)
SQS: loyalty-points → Lambda (award points)
SQS: erp-sync → ECS (SAP integration)
Each queue is independent. The analytics queue can be a standard SQS queue with high throughput. The loyalty-points queue can be FIFO to ensure a customer never gets double-awarded. The fraud-detection queue has a 30-second visibility timeout because the ML model takes time. None of these constraints affect any other service.
The fan-out pattern also gives you natural circuit breaking. If the ERP sync service goes down for 4 hours, messages accumulate in its SQS queue (retained for up to 14 days by default). When the service recovers, it drains the backlog without any re-publishing from the Order Service. No events are lost.
When to Use Fan-Out
| Scenario | Use Fan-Out? | Reasoning |
|---|---|---|
| Multiple services need the same event | Yes | Core use case — avoids N point-to-point integrations |
| Consumers process at different speeds | Yes | Each SQS queue buffers independently |
| New consumers added frequently | Yes | Subscribe new SQS queue, no producer changes |
| Only one consumer ever | No | Direct SQS publish is simpler |
| Request-response needed | No | Use SQS with replyTo pattern instead |
2. SNS Deep-Dive: Topics, Subscriptions, Message Attributes
Amazon SNS is a pub/sub messaging service that pushes messages to subscribers in milliseconds. For the fan-out pattern, the critical concepts are standard topics (for SQS fan-out), FIFO topics (for ordered fan-out), message attributes (for filter policies), and the raw message delivery option.
Topic Types
SNS offers two topic types. Standard topics deliver messages with best-effort ordering and at-least-once delivery, supporting up to 300 million subscriptions and unlimited throughput. FIFO topics guarantee strict ordering per message group and exactly-once delivery to SQS FIFO queues — useful for financial events where ordering matters, at the cost of 300 TPS per API action. For most fan-out patterns, standard topics are the right choice.
Message Attributes
Message attributes are key-value pairs you attach to an SNS message. They travel alongside the message body and are the mechanism behind subscription filter policies — SNS evaluates them before deciding whether to deliver a message to a given SQS subscriber. Each message can carry up to 10 attributes of types String, Number, Binary, or String.Array.
# Publish with message attributes (CLI)
aws sns publish \
--topic-arn arn:aws:sns:us-east-1:123456789012:orders-events \
--message '{"orderId":"ORD-9821","total":249.99,"region":"us-east","type":"PREMIUM"}' \
--message-attributes '{
"orderType": {
"DataType": "String",
"StringValue": "PREMIUM"
},
"region": {
"DataType": "String",
"StringValue": "us-east"
},
"totalAmount": {
"DataType": "Number",
"StringValue": "249.99"
}
}'
Raw Message Delivery
By default, SNS wraps the original message body in a JSON envelope when delivering to SQS. The SQS message body looks like this:
{
"Type": "Notification",
"MessageId": "a7c3f8d2-...",
"TopicArn": "arn:aws:sns:us-east-1:123456789012:orders-events",
"Subject": null,
"Message": "{\"orderId\":\"ORD-9821\",\"total\":249.99}",
"Timestamp": "2026-06-08T10:30:00.000Z",
"SignatureVersion": "1",
"Signature": "...",
"MessageAttributes": { ... }
}
If you enable Raw Message Delivery on the subscription, the SQS message body contains exactly what you published — no envelope. This simplifies consumer code and reduces message size. Enable it unless you need the SNS metadata (timestamps, topic ARN) in the consumer.
# Enable raw message delivery on a subscription
aws sns set-subscription-attributes \
--subscription-arn arn:aws:sns:us-east-1:123456789012:orders-events:abc123 \
--attribute-name RawMessageDelivery \
--attribute-value true
SQS Queue Policy for SNS
For SNS to send messages to an SQS queue, the queue must have a resource-based policy allowing SNS to call sqs:SendMessage. Without this, deliveries silently fail.
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "Allow-SNS-SendMessage",
"Effect": "Allow",
"Principal": {
"Service": "sns.amazonaws.com"
},
"Action": "sqs:SendMessage",
"Resource": "arn:aws:sqs:us-east-1:123456789012:inventory-queue",
"Condition": {
"ArnEquals": {
"aws:SourceArn": "arn:aws:sns:us-east-1:123456789012:orders-events"
}
}
}
]
}
3. SQS Deep-Dive: Standard vs FIFO, Visibility Timeout, Long Polling
In a fan-out pattern, each SQS queue is an independent buffer. Understanding the tuning knobs for each queue — visibility timeout, message retention, long polling, receive message wait time — is essential for production reliability.
Visibility Timeout
When a consumer reads a message from SQS, the message becomes invisible to other consumers for the visibility timeout duration. If the consumer successfully processes and deletes the message before the timeout expires, it is gone. If the consumer crashes or takes too long, the message reappears in the queue and another consumer picks it up. Set the visibility timeout to at least 6× your average processing time to avoid duplicate processing due to slow consumers.
# Set visibility timeout to 120 seconds
aws sqs set-queue-attributes \
--queue-url https://sqs.us-east-1.amazonaws.com/123456789012/inventory-queue \
--attributes VisibilityTimeout=120
# Extend visibility timeout mid-processing (Python)
sqs.change_message_visibility(
QueueUrl=queue_url,
ReceiptHandle=receipt_handle,
VisibilityTimeout=120 # reset the clock
)
Long Polling
Short polling (WaitTimeSeconds=0) queries a random subset of SQS servers and returns immediately — even if the queue is empty — causing empty responses that cost money and waste CPU cycles. Long polling (WaitTimeSeconds=1–20) waits up to 20 seconds for a message to arrive before returning. Always use long polling in production.
aws sqs set-queue-attributes \
--queue-url https://sqs.us-east-1.amazonaws.com/123456789012/inventory-queue \
--attributes ReceiveMessageWaitTimeSeconds=20
Standard vs FIFO for Fan-Out
| Queue Type | Best For (in fan-out) | Limitations |
|---|---|---|
| Standard | Analytics, notifications, logging — high throughput, ordering irrelevant | At-least-once; implement idempotency in consumer |
| FIFO | Inventory updates, ledger entries — must process in order per entity | 300 TPS; SNS topic must also be FIFO |
Dead-Letter Queue Setup
Every SQS queue in a fan-out should have a dedicated DLQ. Set maxReceiveCount to 3–5: after that many failed delivery attempts, the message moves to the DLQ instead of being deleted. Without a DLQ, a poison-pill message loops forever and blocks your consumer.
# Create DLQ first
aws sqs create-queue --queue-name inventory-queue-dlq
# Get DLQ ARN
DLQ_ARN=$(aws sqs get-queue-attributes \
--queue-url https://sqs.us-east-1.amazonaws.com/123456789012/inventory-queue-dlq \
--attribute-names QueueArn \
--query 'Attributes.QueueArn' --output text)
# Attach DLQ to main queue
aws sqs set-queue-attributes \
--queue-url https://sqs.us-east-1.amazonaws.com/123456789012/inventory-queue \
--attributes '{
"RedrivePolicy": "{\"deadLetterTargetArn\":\"'"$DLQ_ARN"'\",\"maxReceiveCount\":\"3\"}"
}'
4. Fan-Out Setup: CLI + Terraform
Let's build the full fan-out topology: one SNS topic, three SQS queues (inventory, notifications, analytics), three DLQs, and three SNS subscriptions.
CLI Setup
#!/bin/bash
REGION="us-east-1"
ACCOUNT="123456789012"
# 1. Create SNS topic
TOPIC_ARN=$(aws sns create-topic \
--name orders-events \
--region $REGION \
--query 'TopicArn' --output text)
echo "SNS Topic: $TOPIC_ARN"
# 2. Create queues + DLQs
for SERVICE in inventory notifications analytics; do
# Create DLQ
aws sqs create-queue --queue-name "${SERVICE}-queue-dlq" --region $REGION
DLQ_ARN="arn:aws:sqs:${REGION}:${ACCOUNT}:${SERVICE}-queue-dlq"
# Create main queue with DLQ
QUEUE_URL=$(aws sqs create-queue \
--queue-name "${SERVICE}-queue" \
--attributes '{
"VisibilityTimeout":"60",
"ReceiveMessageWaitTimeSeconds":"20",
"MessageRetentionPeriod":"86400",
"RedrivePolicy":"{\"deadLetterTargetArn\":\"'"$DLQ_ARN"'\",\"maxReceiveCount\":\"3\"}"
}' \
--region $REGION \
--query 'QueueUrl' --output text)
QUEUE_ARN="arn:aws:sqs:${REGION}:${ACCOUNT}:${SERVICE}-queue"
# 3. Grant SNS permission to send to this queue
aws sqs set-queue-attributes \
--queue-url "$QUEUE_URL" \
--attributes '{
"Policy": "{\"Version\":\"2012-10-17\",\"Statement\":[{\"Effect\":\"Allow\",\"Principal\":{\"Service\":\"sns.amazonaws.com\"},\"Action\":\"sqs:SendMessage\",\"Resource\":\"'"$QUEUE_ARN"'\",\"Condition\":{\"ArnEquals\":{\"aws:SourceArn\":\"'"$TOPIC_ARN"'\"}}}]}"
}'
# 4. Subscribe queue to SNS topic
aws sns subscribe \
--topic-arn "$TOPIC_ARN" \
--protocol sqs \
--notification-endpoint "$QUEUE_ARN" \
--attributes RawMessageDelivery=true \
--region $REGION
echo "Created: ${SERVICE}-queue → subscribed to SNS"
done
Terraform
variable "region" { default = "us-east-1" }
variable "services" { default = ["inventory", "notifications", "analytics"] }
provider "aws" { region = var.region }
# SNS Topic
resource "aws_sns_topic" "orders_events" {
name = "orders-events"
}
# DLQs
resource "aws_sqs_queue" "dlq" {
for_each = toset(var.services)
name = "${each.key}-queue-dlq"
message_retention_seconds = 1209600 # 14 days
}
# Main queues
resource "aws_sqs_queue" "main" {
for_each = toset(var.services)
name = "${each.key}-queue"
visibility_timeout_seconds = 60
receive_wait_time_seconds = 20
message_retention_seconds = 86400
redrive_policy = jsonencode({
deadLetterTargetArn = aws_sqs_queue.dlq[each.key].arn
maxReceiveCount = 3
})
}
# Queue policies (allow SNS to send)
resource "aws_sqs_queue_policy" "allow_sns" {
for_each = toset(var.services)
queue_url = aws_sqs_queue.main[each.key].id
policy = jsonencode({
Version = "2012-10-17"
Statement = [{
Effect = "Allow"
Principal = { Service = "sns.amazonaws.com" }
Action = "sqs:SendMessage"
Resource = aws_sqs_queue.main[each.key].arn
Condition = {
ArnEquals = {
"aws:SourceArn" = aws_sns_topic.orders_events.arn
}
}
}]
})
}
# SNS subscriptions
resource "aws_sns_topic_subscription" "fan_out" {
for_each = toset(var.services)
topic_arn = aws_sns_topic.orders_events.arn
protocol = "sqs"
endpoint = aws_sqs_queue.main[each.key].arn
raw_message_delivery = true
}
output "topic_arn" { value = aws_sns_topic.orders_events.arn }
output "queue_urls" { value = { for s in var.services : s => aws_sqs_queue.main[s].url } }
5. Message Filtering — Attribute-Based Routing
Without filter policies, every SQS subscriber receives every message published to the SNS topic. That wastes compute and forces consumers to discard irrelevant events. SNS subscription filter policies let you route messages to specific queues based on message attributes — think of it as server-side content-based routing with zero consumer code changes.
Filter Policy Syntax
A filter policy is a JSON object where each key is a message attribute name and the value is an array of accepted values (exact match, prefix match, or numeric range).
# Only deliver to inventory-queue if orderType is STANDARD or PREMIUM
{
"orderType": ["STANDARD", "PREMIUM"]
}
# Only deliver to fraud-queue if totalAmount >= 500 (numeric range)
{
"totalAmount": [{"numeric": [">=", 500]}]
}
# Route EU orders to eu-analytics-queue (prefix match)
{
"region": [{"prefix": "eu-"}]
}
# Complex: PREMIUM orders in us-east OR eu-west
{
"orderType": ["PREMIUM"],
"region": ["us-east", "eu-west"]
} # AND logic across keys, OR logic within arrays
Applying a Filter Policy via CLI
# Apply filter to the fraud-detection subscription
# Get the subscription ARN first
SUB_ARN=$(aws sns list-subscriptions-by-topic \
--topic-arn arn:aws:sns:us-east-1:123456789012:orders-events \
--query "Subscriptions[?Endpoint=='arn:aws:sqs:us-east-1:123456789012:fraud-queue'].SubscriptionArn" \
--output text)
# Set filter: only high-value orders (>= $500)
aws sns set-subscription-attributes \
--subscription-arn "$SUB_ARN" \
--attribute-name FilterPolicy \
--attribute-value '{"totalAmount":[{"numeric":[">=",500]}]}'
# Analytics gets everything — no filter needed (default)
Real-World Routing Matrix
| SQS Queue | Filter Policy | Receives |
|---|---|---|
| inventory-queue | {"orderType":["STANDARD","PREMIUM","SUBSCRIPTION"]} | All paid orders (not REFUND events) |
| notifications-queue | (none) | All events — sends confirmation email regardless |
| fraud-queue | {"totalAmount":[{"numeric":[">=",500]}]} | High-value orders only |
| eu-compliance-queue | {"region":[{"prefix":"eu-"}]} | EU orders for GDPR audit trail |
| loyalty-queue | {"orderType":["STANDARD","PREMIUM"]} | Eligible orders (not refunds, not subscriptions) |
{"orderType":["STANDARD","PREMIUM",{"exists":false}]} to also match messages where the attribute is absent.
Terraform Filter Policies
resource "aws_sns_topic_subscription" "fraud_detection" {
topic_arn = aws_sns_topic.orders_events.arn
protocol = "sqs"
endpoint = aws_sqs_queue.main["fraud"].arn
raw_message_delivery = true
filter_policy = jsonencode({
totalAmount = [{ numeric = [">=", 500] }]
})
}
resource "aws_sns_topic_subscription" "eu_compliance" {
topic_arn = aws_sns_topic.orders_events.arn
protocol = "sqs"
endpoint = aws_sqs_queue.main["eu-compliance"].arn
raw_message_delivery = true
filter_policy = jsonencode({
region = [{ prefix = "eu-" }]
})
}
6. Lambda Consumers — Event Source Mapping and Error Handling
Lambda is the most common consumer for SQS queues in a fan-out pattern. Lambda's SQS event source mapping polls the queue on your behalf, invokes your function with batches of messages, and handles deletion of successfully processed messages automatically.
Event Source Mapping Configuration
# Create event source mapping: Lambda processes inventory-queue
aws lambda create-event-source-mapping \
--function-name inventory-processor \
--event-source-arn arn:aws:sqs:us-east-1:123456789012:inventory-queue \
--batch-size 10 \
--maximum-batching-window-in-seconds 5 \
--function-response-types ReportBatchItemFailures
Batch Size Tuning
The batch size controls how many SQS messages are passed to a single Lambda invocation. A batch size of 10 with a batching window of 5 seconds means Lambda waits up to 5 seconds to accumulate up to 10 messages before invoking your function — reducing invocation costs for bursty traffic. For CPU-intensive processing, use batch size 1. For I/O-bound work (database writes, API calls), use 10–100.
ReportBatchItemFailures — Partial Batch Handling
Without ReportBatchItemFailures, if any message in the batch fails, the entire batch is retried — including the messages that succeeded. This causes duplicate processing. With ReportBatchItemFailures enabled, your Lambda function returns a batchItemFailures response indicating only the failed message IDs, and only those messages return to the queue.
import json
import boto3
import logging
logger = logging.getLogger()
logger.setLevel(logging.INFO)
def handler(event, context):
"""
Lambda consumer for inventory-queue.
Uses ReportBatchItemFailures for partial failure handling.
"""
failed_message_ids = []
for record in event['Records']:
message_id = record['messageId']
try:
body = json.loads(record['body'])
process_inventory_update(body)
logger.info(f"Processed order {body['orderId']}")
except Exception as e:
logger.error(f"Failed to process {message_id}: {e}")
failed_message_ids.append({"itemIdentifier": message_id})
# Only failed messages will be retried
return {
"batchItemFailures": failed_message_ids
}
def process_inventory_update(order: dict):
"""Update inventory for each line item in the order."""
dynamodb = boto3.resource('dynamodb')
table = dynamodb.Table('Inventory')
for item in order.get('lineItems', []):
table.update_item(
Key={'productId': item['productId']},
UpdateExpression='SET stock = stock - :qty',
ExpressionAttributeValues={':qty': item['quantity']},
ConditionExpression='stock >= :qty'
)
Lambda DLQ for Async Invocations
Note that the SQS queue already has a DLQ. When Lambda is triggered via event source mapping (synchronous invocation by SQS), failed messages go to the SQS DLQ after maxReceiveCount retries — not to the Lambda DLQ. Configure the SQS DLQ, not the Lambda DLQ, for fan-out consumers.
7. Producer + Consumer Code: Python boto3 and Java Spring Cloud AWS
Python boto3 — Producer (Order Service)
import boto3
import json
import uuid
from datetime import datetime
from typing import Optional
class OrderEventProducer:
"""Publishes order events to SNS for fan-out distribution."""
def __init__(self, topic_arn: str, region: str = "us-east-1"):
self.sns = boto3.client('sns', region_name=region)
self.topic_arn = topic_arn
def publish_order_placed(
self,
order_id: str,
customer_id: str,
total_amount: float,
order_type: str,
region: str,
line_items: list
) -> str:
"""Publish an OrderPlaced event to SNS."""
payload = {
"eventType": "OrderPlaced",
"eventId": str(uuid.uuid4()),
"timestamp": datetime.utcnow().isoformat() + "Z",
"orderId": order_id,
"customerId": customer_id,
"totalAmount": total_amount,
"orderType": order_type,
"region": region,
"lineItems": line_items
}
response = self.sns.publish(
TopicArn=self.topic_arn,
Message=json.dumps(payload),
MessageAttributes={
"eventType": {
"DataType": "String",
"StringValue": "OrderPlaced"
},
"orderType": {
"DataType": "String",
"StringValue": order_type
},
"region": {
"DataType": "String",
"StringValue": region
},
"totalAmount": {
"DataType": "Number",
"StringValue": str(total_amount)
}
}
)
message_id = response['MessageId']
print(f"Published OrderPlaced event {message_id} for order {order_id}")
return message_id
# Usage
if __name__ == "__main__":
producer = OrderEventProducer(
topic_arn="arn:aws:sns:us-east-1:123456789012:orders-events"
)
producer.publish_order_placed(
order_id="ORD-2026-9821",
customer_id="CUST-4471",
total_amount=749.95,
order_type="PREMIUM",
region="us-east",
line_items=[
{"productId": "SKU-001", "quantity": 2, "price": 299.99},
{"productId": "SKU-042", "quantity": 1, "price": 149.97}
]
)
Python boto3 — Consumer (Notifications Service, polling SQS directly)
import boto3
import json
import time
import logging
logger = logging.getLogger(__name__)
class NotificationConsumer:
"""Polls SQS for order events and sends customer notifications."""
def __init__(self, queue_url: str, region: str = "us-east-1"):
self.sqs = boto3.client('sqs', region_name=region)
self.ses = boto3.client('ses', region_name=region)
self.queue_url = queue_url
def run(self):
logger.info(f"Starting consumer for {self.queue_url}")
while True:
messages = self.sqs.receive_message(
QueueUrl=self.queue_url,
MaxNumberOfMessages=10,
WaitTimeSeconds=20, # long polling
AttributeNames=['All'],
MessageAttributeNames=['All']
).get('Messages', [])
for msg in messages:
try:
self._process(msg)
self.sqs.delete_message(
QueueUrl=self.queue_url,
ReceiptHandle=msg['ReceiptHandle']
)
except Exception as e:
logger.error(f"Failed to process {msg['MessageId']}: {e}")
# Do NOT delete — message will reappear after visibility timeout
# After maxReceiveCount retries it goes to DLQ automatically
def _process(self, msg: dict):
order = json.loads(msg['Body']) # raw message delivery enabled
order_id = order['orderId']
customer_id = order['customerId']
# Idempotency check (DynamoDB or Redis)
if self._already_notified(order_id):
logger.info(f"Duplicate notification skipped for {order_id}")
return
self._send_confirmation_email(order)
self._mark_notified(order_id)
logger.info(f"Sent confirmation for order {order_id} to customer {customer_id}")
def _already_notified(self, order_id: str) -> bool:
# Check DynamoDB idempotency table
return False # simplified
def _mark_notified(self, order_id: str):
pass # Write to DynamoDB
def _send_confirmation_email(self, order: dict):
pass # Call SES
Java Spring Cloud AWS — Consumer
// pom.xml dependency
// <dependency>
// <groupId>io.awspring.cloud</groupId>
// <artifactId>spring-cloud-aws-starter-sqs</artifactId>
// <version>3.1.1</version>
// </dependency>
package com.techoral.order.consumer;
import io.awspring.cloud.sqs.annotation.SqsListener;
import io.awspring.cloud.sqs.listener.acknowledgement.Acknowledgement;
import org.springframework.stereotype.Component;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import com.fasterxml.jackson.databind.ObjectMapper;
import java.util.Map;
@Component
public class InventoryConsumer {
private static final Logger log = LoggerFactory.getLogger(InventoryConsumer.class);
private final InventoryService inventoryService;
private final ObjectMapper objectMapper;
public InventoryConsumer(InventoryService inventoryService, ObjectMapper objectMapper) {
this.inventoryService = inventoryService;
this.objectMapper = objectMapper;
}
/**
* @SqsListener polls inventory-queue automatically.
* Spring Cloud AWS handles polling, threading, and deletion on success.
* Throw any exception to trigger redelivery (message stays in queue).
*/
@SqsListener(value = "${aws.sqs.inventory-queue-url}",
maxConcurrentMessages = "5",
maxMessagesPerPoll = "10")
public void handleOrderPlaced(String rawMessage, Acknowledgement acknowledgement) {
try {
Map<String, Object> order = objectMapper.readValue(
rawMessage, new com.fasterxml.jackson.core.type.TypeReference<>() {}
);
String orderId = (String) order.get("orderId");
log.info("Processing inventory for order: {}", orderId);
inventoryService.reserveStock(order);
acknowledgement.acknowledge(); // explicit ack — deletes from SQS
} catch (InsufficientStockException e) {
log.warn("Insufficient stock — sending to compensating flow: {}", e.getMessage());
// Publish compensating event to SNS (saga pattern)
acknowledgement.acknowledge(); // ack so it doesn't retry
} catch (Exception e) {
log.error("Transient failure — message will retry: {}", e.getMessage());
// Do NOT acknowledge — SQS will redeliver after visibility timeout
throw new RuntimeException("Transient failure", e);
}
}
}
// application.yml
// aws:
// sqs:
// inventory-queue-url: https://sqs.us-east-1.amazonaws.com/123456789012/inventory-queue
// region:
// static: us-east-1
8. Advanced Patterns: Fan-In, Cross-Account, Cross-Region
Fan-Out + Fan-In (Aggregation Pattern)
Fan-out distributes work; fan-in collects results. In an order processing pipeline, the Order Service fans out to five processors, each writes its result to a dedicated SQS result queue, and an aggregator Lambda polls all five queues and assembles the final order status when all processors report back.
# Fan-in with Step Functions (preferred approach)
# Step Functions Map state runs 5 Lambda functions in parallel
# and waits for all results before continuing to the aggregation state
{
"Comment": "Fan-out order processing with fan-in",
"StartAt": "PublishToSNS",
"States": {
"PublishToSNS": {
"Type": "Task",
"Resource": "arn:aws:states:::sns:publish",
"Parameters": {
"TopicArn": "arn:aws:sns:us-east-1:123456789012:orders-events",
"Message.$": "States.JsonToString($.order)"
},
"Next": "WaitForProcessors"
},
"WaitForProcessors": {
"Type": "Wait",
"Seconds": 30,
"Next": "AggregateResults"
},
"AggregateResults": {
"Type": "Task",
"Resource": "arn:aws:lambda:us-east-1:123456789012:function:order-aggregator",
"End": true
}
}
}
Cross-Account Fan-Out
In enterprises, different AWS accounts own different services (security best practice). SNS can fan out to SQS queues in different AWS accounts. The SQS queue in the target account must allow the source SNS topic in its resource-based policy.
# Target account (Account B) SQS queue policy
# Allows Account A's SNS topic to send messages
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "CrossAccountSNSFanOut",
"Effect": "Allow",
"Principal": {
"Service": "sns.amazonaws.com"
},
"Action": "sqs:SendMessage",
"Resource": "arn:aws:sqs:us-east-1:ACCOUNT_B:compliance-queue",
"Condition": {
"ArnEquals": {
"aws:SourceArn": "arn:aws:sns:us-east-1:ACCOUNT_A:orders-events"
}
}
}
]
}
# Subscribe Account B's SQS queue from Account A's SNS topic
# (run from Account A with cross-account permissions)
aws sns subscribe \
--topic-arn arn:aws:sns:us-east-1:ACCOUNT_A:orders-events \
--protocol sqs \
--notification-endpoint arn:aws:sqs:us-east-1:ACCOUNT_B:compliance-queue \
--attributes RawMessageDelivery=true
Signature field that references the SNS topic in Account A — if Account B consumers try to validate the signature, they need network access to Account A's SNS API. Raw delivery sidesteps this entirely.
Cross-Region Fan-Out
SNS can deliver to SQS queues in a different AWS region. This is useful for disaster recovery and multi-region active-active setups. Latency increases (typically 50–200ms cross-region), and data sovereignty regulations may restrict message content crossing regions.
# Subscribe a queue in eu-west-1 to a topic in us-east-1
aws sns subscribe \
--topic-arn arn:aws:sns:us-east-1:123456789012:orders-events \
--protocol sqs \
--notification-endpoint arn:aws:sqs:eu-west-1:123456789012:eu-orders-queue \
--region us-east-1
9. Dead Letter Queues — Setup, Monitoring, and Auto-Reprocessing
A DLQ is not a graveyard — it is a quarantine zone. Messages in DLQs represent real business failures: a database was unreachable, a third-party API returned a 500, a message was malformed. Every DLQ needs an alert and, for recoverable failures, an automated reprocessing mechanism.
CloudWatch Alarm on DLQ Depth
aws cloudwatch put-metric-alarm \
--alarm-name "inventory-dlq-not-empty" \
--alarm-description "Inventory DLQ has messages — investigate immediately" \
--metric-name ApproximateNumberOfMessagesVisible \
--namespace AWS/SQS \
--dimensions Name=QueueName,Value=inventory-queue-dlq \
--statistic Sum \
--period 60 \
--evaluation-periods 1 \
--threshold 1 \
--comparison-operator GreaterThanOrEqualToThreshold \
--treat-missing-data notBreaching \
--alarm-actions arn:aws:sns:us-east-1:123456789012:ops-alerts
Terraform DLQ Alarm (all fan-out DLQs)
resource "aws_cloudwatch_metric_alarm" "dlq_not_empty" {
for_each = toset(var.services)
alarm_name = "${each.key}-dlq-not-empty"
alarm_description = "${each.key} DLQ has messages — investigate"
metric_name = "ApproximateNumberOfMessagesVisible"
namespace = "AWS/SQS"
statistic = "Sum"
period = 60
evaluation_periods = 1
threshold = 1
comparison_operator = "GreaterThanOrEqualToThreshold"
treat_missing_data = "notBreaching"
dimensions = {
QueueName = "${each.key}-queue-dlq"
}
alarm_actions = [aws_sns_topic.ops_alerts.arn]
}
Automated DLQ Reprocessing Lambda
For transient failures (downstream service temporarily unavailable), automate DLQ message replay. This Lambda reads from a DLQ and republishes messages to the original queue, with exponential back-off between batches.
import boto3
import json
import time
import os
import logging
logger = logging.getLogger(__name__)
sqs = boto3.client('sqs')
DLQ_URL = os.environ['DLQ_URL']
TARGET_URL = os.environ['TARGET_QUEUE_URL']
BATCH_SIZE = int(os.environ.get('BATCH_SIZE', '10'))
def handler(event, context):
"""
Replays messages from DLQ back to the main queue.
Triggered by CloudWatch Events (schedule) or manual invocation.
"""
total_replayed = 0
total_failed = 0
while True:
response = sqs.receive_message(
QueueUrl=DLQ_URL,
MaxNumberOfMessages=min(BATCH_SIZE, 10),
WaitTimeSeconds=2
)
messages = response.get('Messages', [])
if not messages:
logger.info(f"DLQ empty. Replayed={total_replayed}, Failed={total_failed}")
break
for msg in messages:
try:
# Re-send original message body to the main queue
sqs.send_message(
QueueUrl=TARGET_URL,
MessageBody=msg['Body'],
MessageAttributes=msg.get('MessageAttributes', {})
)
sqs.delete_message(
QueueUrl=DLQ_URL,
ReceiptHandle=msg['ReceiptHandle']
)
total_replayed += 1
except Exception as e:
logger.error(f"Failed to replay {msg['MessageId']}: {e}")
total_failed += 1
time.sleep(0.5) # Respect main queue rate limits
return {
"statusCode": 200,
"replayed": total_replayed,
"failed": total_failed
}
10. Observability — CloudWatch Metrics and X-Ray Tracing
A fan-out topology with 6 SQS queues, 6 DLQs, and 6 Lambda consumers has 60+ CloudWatch metrics worth monitoring. Here are the signals that matter and how to wire them together for end-to-end observability.
Key CloudWatch SQS Metrics
| Metric | Alarm Threshold | Meaning |
|---|---|---|
| ApproximateNumberOfMessagesVisible | > 1000 sustained 5 min | Consumer falling behind — scale out or check for errors |
| ApproximateAgeOfOldestMessage | > 300 seconds | Consumer stalled or dead — messages aging dangerously |
| NumberOfMessagesSent | < 1 over 10 min (business hours) | Producer may be failing — no events flowing |
| NumberOfMessagesDeleted | N/A (track rate) | Consumer throughput — compare with Sent for backlog rate |
| ApproximateNumberOfMessagesNotVisible | High relative to Visible | Messages in-flight — Lambda concurrency may be maxed |
CloudWatch Dashboard (CLI)
aws cloudwatch put-dashboard \
--dashboard-name "OrdersFanOutHealth" \
--dashboard-body '{
"widgets": [
{
"type": "metric",
"properties": {
"title": "SQS Queue Depths",
"metrics": [
["AWS/SQS","ApproximateNumberOfMessagesVisible","QueueName","inventory-queue"],
["AWS/SQS","ApproximateNumberOfMessagesVisible","QueueName","notifications-queue"],
["AWS/SQS","ApproximateNumberOfMessagesVisible","QueueName","analytics-queue"],
["AWS/SQS","ApproximateNumberOfMessagesVisible","QueueName","fraud-queue"]
],
"period": 60,
"stat": "Maximum",
"view": "timeSeries"
}
},
{
"type": "metric",
"properties": {
"title": "DLQ Depths (should be 0)",
"metrics": [
["AWS/SQS","ApproximateNumberOfMessagesVisible","QueueName","inventory-queue-dlq"],
["AWS/SQS","ApproximateNumberOfMessagesVisible","QueueName","notifications-queue-dlq"],
["AWS/SQS","ApproximateNumberOfMessagesVisible","QueueName","fraud-queue-dlq"]
],
"period": 60,
"stat": "Sum",
"view": "timeSeries"
}
}
]
}'
X-Ray End-to-End Tracing
AWS X-Ray traces a message from the moment it is published to SNS, through SQS, into the Lambda consumer, and into any downstream services the consumer calls. Enable active tracing on Lambda and use the X-Ray SDK to propagate the trace context.
# Enable X-Ray active tracing on Lambda
aws lambda update-function-configuration \
--function-name inventory-processor \
--tracing-config Mode=Active
# Enable X-Ray on the SNS topic (for publisher tracing)
aws sns set-topic-attributes \
--topic-arn arn:aws:sns:us-east-1:123456789012:orders-events \
--attribute-name TracingConfig \
--attribute-value Active
# Python Lambda with X-Ray SDK
from aws_xray_sdk.core import xray_recorder, patch_all
import boto3, json
patch_all() # auto-instrument boto3 calls
@xray_recorder.capture('process_inventory_update')
def process_inventory_update(order: dict):
xray_recorder.current_subsegment().put_annotation('orderId', order['orderId'])
xray_recorder.current_subsegment().put_annotation('orderType', order['orderType'])
# All downstream boto3 calls (DynamoDB, etc.) appear in the X-Ray trace
dynamodb = boto3.resource('dynamodb')
table = dynamodb.Table('Inventory')
# ... update stock ...
SNS Metrics to Monitor
# SNS publishes these metrics automatically:
# NumberOfMessagesSent — messages published by producers
# NumberOfNotificationsDelivered — successfully delivered to subscribers
# NumberOfNotificationsFailed — failed deliveries (alert on this!)
# NumberOfNotificationsFilteredOut — filtered by subscription filter policies
aws cloudwatch get-metric-statistics \
--namespace AWS/SNS \
--metric-name NumberOfNotificationsFailed \
--dimensions Name=TopicName,Value=orders-events \
--start-time 2026-06-08T00:00:00Z \
--end-time 2026-06-08T23:59:59Z \
--period 3600 \
--statistics Sum
ApproximateAgeOfOldestMessage alarms per queue to enforce these SLOs automatically.
Putting It All Together — Production Checklist
- Every SQS queue has a DLQ with
maxReceiveCount=3 - Every DLQ has a CloudWatch alarm (threshold: >= 1 message)
- Raw message delivery enabled on all SQS subscriptions
- Long polling (
ReceiveMessageWaitTimeSeconds=20) on all queues - Lambda consumers use
ReportBatchItemFailures - Lambda concurrency reserved per function (prevent noisy-neighbor)
- SNS subscription filter policies applied where consumers don't need all events
- X-Ray active tracing on all Lambda consumers and the SNS topic
- CloudWatch dashboard covering queue depth, DLQ depth, Lambda error rate
- Producer uses message attributes for all filterable fields