AWS SNS to SQS Fan-Out Pattern: Event-Driven Architecture at Scale (2026)

The SNS-to-SQS fan-out pattern is the backbone of scalable, loosely coupled event-driven systems on AWS. One SNS topic fans out a single event to dozens of independent SQS queues — each consumed by a different microservice — with zero coupling between producers and consumers. This guide goes beyond the basics: you'll get subscription filter policies for intelligent message routing, Lambda consumer tuning, dead-letter queue automation, cross-account fan-out, full observability with CloudWatch and X-Ray, and production-grade code in Python boto3, Terraform, and Java Spring Cloud AWS.

1. The Fan-Out Pattern — Why It Solves Tight Coupling
2. SNS Deep-Dive: Topics, Subscriptions, Message Attributes
3. SQS Deep-Dive: Standard vs FIFO, Visibility Timeout, DLQ
4. Fan-Out Setup: CLI + Terraform
5. Message Filtering — Attribute-Based Routing
6. Lambda Consumers — Event Source Mapping and Error Handling
7. Producer + Consumer Code: Python boto3 and Java Spring
8. Advanced Patterns: Fan-In, Cross-Account, Cross-Region
9. Dead Letter Queues — Setup, Monitoring, Auto-Reprocessing
10. Observability — CloudWatch Metrics and X-Ray Tracing

1. The Fan-Out Pattern — Why It Solves Tight Coupling

In a tightly coupled system, when an order is placed the Order Service directly calls the Inventory Service, the Notification Service, the Analytics Service, the Fraud Detection Service, and the Loyalty Points Service — one by one, synchronously. If any downstream service is slow or unavailable, the entire order placement fails or slows to a crawl. Adding a sixth downstream consumer requires editing the Order Service code and redeploying it.

The fan-out pattern eliminates this coupling entirely. The Order Service publishes one event to a single SNS topic. SNS delivers that same event in parallel to N independent SQS queues — one per consuming service — without the producer knowing or caring who is listening. Each consumer processes at its own pace, with its own concurrency and scaling rules. Adding a new downstream service means creating a new SQS queue and subscribing it to the topic; the Order Service is never touched.

Real E-Commerce Order Processing Fan-Out

Consider this production architecture at an e-commerce company processing 500,000 orders per day:

Order Service (producer)
        │
        ▼
  SNS Topic: orders-events
        │
   ┌────┼────────────────────────────────────────┐
   │    │                                        │
   ▼    ▼                                        ▼
SQS:  SQS:                                     SQS:
inventory  notifications                       analytics
   │         │                                   │
   ▼         ▼                                   ▼
Lambda    Lambda                              ECS Task
(update   (send email/                       (clickstream
 stock)    push/SMS)                          pipeline)

        Also subscribed:
        SQS: fraud-detection → Lambda (ML scoring)
        SQS: loyalty-points  → Lambda (award points)
        SQS: erp-sync        → ECS  (SAP integration)

Each queue is independent. The analytics queue can be a standard SQS queue with high throughput. The loyalty-points queue can be FIFO to ensure a customer never gets double-awarded. The fraud-detection queue has a 30-second visibility timeout because the ML model takes time. None of these constraints affect any other service.

Key Design Principle: SNS delivers messages to SQS with at-least-once semantics. Each SQS subscriber gets its own independent copy of the message — a message consumed (and deleted) from the inventory queue does not affect the copy sitting in the notifications queue. This is fundamentally different from a single queue with competing consumers.

The fan-out pattern also gives you natural circuit breaking. If the ERP sync service goes down for 4 hours, messages accumulate in its SQS queue (retained for up to 14 days by default). When the service recovers, it drains the backlog without any re-publishing from the Order Service. No events are lost.

When to Use Fan-Out

Scenario	Use Fan-Out?	Reasoning
Multiple services need the same event	Yes	Core use case — avoids N point-to-point integrations
Consumers process at different speeds	Yes	Each SQS queue buffers independently
New consumers added frequently	Yes	Subscribe new SQS queue, no producer changes
Only one consumer ever	No	Direct SQS publish is simpler
Request-response needed	No	Use SQS with replyTo pattern instead

Amazon SNS is a pub/sub messaging service that pushes messages to subscribers in milliseconds. For the fan-out pattern, the critical concepts are standard topics (for SQS fan-out), FIFO topics (for ordered fan-out), message attributes (for filter policies), and the raw message delivery option.

Topic Types

SNS offers two topic types. Standard topics deliver messages with best-effort ordering and at-least-once delivery, supporting up to 300 million subscriptions and unlimited throughput. FIFO topics guarantee strict ordering per message group and exactly-once delivery to SQS FIFO queues — useful for financial events where ordering matters, at the cost of 300 TPS per API action. For most fan-out patterns, standard topics are the right choice.

Message Attributes

Message attributes are key-value pairs you attach to an SNS message. They travel alongside the message body and are the mechanism behind subscription filter policies — SNS evaluates them before deciding whether to deliver a message to a given SQS subscriber. Each message can carry up to 10 attributes of types String, Number, Binary, or String.Array.

# Publish with message attributes (CLI)
aws sns publish \
  --topic-arn arn:aws:sns:us-east-1:123456789012:orders-events \
  --message '{"orderId":"ORD-9821","total":249.99,"region":"us-east","type":"PREMIUM"}' \
  --message-attributes '{
    "orderType": {
      "DataType": "String",
      "StringValue": "PREMIUM"
    },
    "region": {
      "DataType": "String",
      "StringValue": "us-east"
    },
    "totalAmount": {
      "DataType": "Number",
      "StringValue": "249.99"
    }
  }'

Raw Message Delivery

By default, SNS wraps the original message body in a JSON envelope when delivering to SQS. The SQS message body looks like this:

{
  "Type": "Notification",
  "MessageId": "a7c3f8d2-...",
  "TopicArn": "arn:aws:sns:us-east-1:123456789012:orders-events",
  "Subject": null,
  "Message": "{\"orderId\":\"ORD-9821\",\"total\":249.99}",
  "Timestamp": "2026-06-08T10:30:00.000Z",
  "SignatureVersion": "1",
  "Signature": "...",
  "MessageAttributes": { ... }
}

If you enable Raw Message Delivery on the subscription, the SQS message body contains exactly what you published — no envelope. This simplifies consumer code and reduces message size. Enable it unless you need the SNS metadata (timestamps, topic ARN) in the consumer.

# Enable raw message delivery on a subscription
aws sns set-subscription-attributes \
  --subscription-arn arn:aws:sns:us-east-1:123456789012:orders-events:abc123 \
  --attribute-name RawMessageDelivery \
  --attribute-value true

Production tip: Always enable Raw Message Delivery for SQS subscribers unless your consumer logic specifically needs the SNS envelope fields. It halves your SQS message parsing code and reduces message size, which matters when approaching the 256 KB SQS limit.

SQS Queue Policy for SNS

For SNS to send messages to an SQS queue, the queue must have a resource-based policy allowing SNS to call sqs:SendMessage. Without this, deliveries silently fail.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "Allow-SNS-SendMessage",
      "Effect": "Allow",
      "Principal": {
        "Service": "sns.amazonaws.com"
      },
      "Action": "sqs:SendMessage",
      "Resource": "arn:aws:sqs:us-east-1:123456789012:inventory-queue",
      "Condition": {
        "ArnEquals": {
          "aws:SourceArn": "arn:aws:sns:us-east-1:123456789012:orders-events"
        }
      }
    }
  ]
}

3. SQS Deep-Dive: Standard vs FIFO, Visibility Timeout, Long Polling

In a fan-out pattern, each SQS queue is an independent buffer. Understanding the tuning knobs for each queue — visibility timeout, message retention, long polling, receive message wait time — is essential for production reliability.

Visibility Timeout

When a consumer reads a message from SQS, the message becomes invisible to other consumers for the visibility timeout duration. If the consumer successfully processes and deletes the message before the timeout expires, it is gone. If the consumer crashes or takes too long, the message reappears in the queue and another consumer picks it up. Set the visibility timeout to at least 6× your average processing time to avoid duplicate processing due to slow consumers.

# Set visibility timeout to 120 seconds
aws sqs set-queue-attributes \
  --queue-url https://sqs.us-east-1.amazonaws.com/123456789012/inventory-queue \
  --attributes VisibilityTimeout=120

# Extend visibility timeout mid-processing (Python)
sqs.change_message_visibility(
    QueueUrl=queue_url,
    ReceiptHandle=receipt_handle,
    VisibilityTimeout=120  # reset the clock
)

Long Polling

Short polling (WaitTimeSeconds=0) queries a random subset of SQS servers and returns immediately — even if the queue is empty — causing empty responses that cost money and waste CPU cycles. Long polling (WaitTimeSeconds=1–20) waits up to 20 seconds for a message to arrive before returning. Always use long polling in production.

aws sqs set-queue-attributes \
  --queue-url https://sqs.us-east-1.amazonaws.com/123456789012/inventory-queue \
  --attributes ReceiveMessageWaitTimeSeconds=20

Standard vs FIFO for Fan-Out

Queue Type	Best For (in fan-out)	Limitations
Standard	Analytics, notifications, logging — high throughput, ordering irrelevant	At-least-once; implement idempotency in consumer
FIFO	Inventory updates, ledger entries — must process in order per entity	300 TPS; SNS topic must also be FIFO

FIFO Fan-Out Constraint: FIFO-to-FIFO is supported (SNS FIFO topic → SQS FIFO queue), but SNS FIFO topics cannot have SQS Standard queues as subscribers. If you need some FIFO and some Standard consumers, use a Standard SNS topic → Standard SQS for non-ordered consumers, and a separate dedicated FIFO topic → FIFO SQS for the ordered use case.

Dead-Letter Queue Setup

Every SQS queue in a fan-out should have a dedicated DLQ. Set maxReceiveCount to 3–5: after that many failed delivery attempts, the message moves to the DLQ instead of being deleted. Without a DLQ, a poison-pill message loops forever and blocks your consumer.

# Create DLQ first
aws sqs create-queue --queue-name inventory-queue-dlq

# Get DLQ ARN
DLQ_ARN=$(aws sqs get-queue-attributes \
  --queue-url https://sqs.us-east-1.amazonaws.com/123456789012/inventory-queue-dlq \
  --attribute-names QueueArn \
  --query 'Attributes.QueueArn' --output text)

# Attach DLQ to main queue
aws sqs set-queue-attributes \
  --queue-url https://sqs.us-east-1.amazonaws.com/123456789012/inventory-queue \
  --attributes '{
    "RedrivePolicy": "{\"deadLetterTargetArn\":\"'"$DLQ_ARN"'\",\"maxReceiveCount\":\"3\"}"
  }'

4. Fan-Out Setup: CLI + Terraform

Let's build the full fan-out topology: one SNS topic, three SQS queues (inventory, notifications, analytics), three DLQs, and three SNS subscriptions.

CLI Setup

#!/bin/bash
REGION="us-east-1"
ACCOUNT="123456789012"

# 1. Create SNS topic
TOPIC_ARN=$(aws sns create-topic \
  --name orders-events \
  --region $REGION \
  --query 'TopicArn' --output text)

echo "SNS Topic: $TOPIC_ARN"

# 2. Create queues + DLQs
for SERVICE in inventory notifications analytics; do
  # Create DLQ
  aws sqs create-queue --queue-name "${SERVICE}-queue-dlq" --region $REGION

  DLQ_ARN="arn:aws:sqs:${REGION}:${ACCOUNT}:${SERVICE}-queue-dlq"

  # Create main queue with DLQ
  QUEUE_URL=$(aws sqs create-queue \
    --queue-name "${SERVICE}-queue" \
    --attributes '{
      "VisibilityTimeout":"60",
      "ReceiveMessageWaitTimeSeconds":"20",
      "MessageRetentionPeriod":"86400",
      "RedrivePolicy":"{\"deadLetterTargetArn\":\"'"$DLQ_ARN"'\",\"maxReceiveCount\":\"3\"}"
    }' \
    --region $REGION \
    --query 'QueueUrl' --output text)

  QUEUE_ARN="arn:aws:sqs:${REGION}:${ACCOUNT}:${SERVICE}-queue"

  # 3. Grant SNS permission to send to this queue
  aws sqs set-queue-attributes \
    --queue-url "$QUEUE_URL" \
    --attributes '{
      "Policy": "{\"Version\":\"2012-10-17\",\"Statement\":[{\"Effect\":\"Allow\",\"Principal\":{\"Service\":\"sns.amazonaws.com\"},\"Action\":\"sqs:SendMessage\",\"Resource\":\"'"$QUEUE_ARN"'\",\"Condition\":{\"ArnEquals\":{\"aws:SourceArn\":\"'"$TOPIC_ARN"'\"}}}]}"
    }'

  # 4. Subscribe queue to SNS topic
  aws sns subscribe \
    --topic-arn "$TOPIC_ARN" \
    --protocol sqs \
    --notification-endpoint "$QUEUE_ARN" \
    --attributes RawMessageDelivery=true \
    --region $REGION

  echo "Created: ${SERVICE}-queue → subscribed to SNS"
done

Terraform

variable "region" { default = "us-east-1" }
variable "services" { default = ["inventory", "notifications", "analytics"] }

provider "aws" { region = var.region }

# SNS Topic
resource "aws_sns_topic" "orders_events" {
  name = "orders-events"
}

# DLQs
resource "aws_sqs_queue" "dlq" {
  for_each                   = toset(var.services)
  name                       = "${each.key}-queue-dlq"
  message_retention_seconds  = 1209600  # 14 days
}

# Main queues
resource "aws_sqs_queue" "main" {
  for_each                     = toset(var.services)
  name                         = "${each.key}-queue"
  visibility_timeout_seconds   = 60
  receive_wait_time_seconds    = 20
  message_retention_seconds    = 86400

  redrive_policy = jsonencode({
    deadLetterTargetArn = aws_sqs_queue.dlq[each.key].arn
    maxReceiveCount     = 3
  })
}

# Queue policies (allow SNS to send)
resource "aws_sqs_queue_policy" "allow_sns" {
  for_each  = toset(var.services)
  queue_url = aws_sqs_queue.main[each.key].id

  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Effect    = "Allow"
      Principal = { Service = "sns.amazonaws.com" }
      Action    = "sqs:SendMessage"
      Resource  = aws_sqs_queue.main[each.key].arn
      Condition = {
        ArnEquals = {
          "aws:SourceArn" = aws_sns_topic.orders_events.arn
        }
      }
    }]
  })
}

# SNS subscriptions
resource "aws_sns_topic_subscription" "fan_out" {
  for_each               = toset(var.services)
  topic_arn              = aws_sns_topic.orders_events.arn
  protocol               = "sqs"
  endpoint               = aws_sqs_queue.main[each.key].arn
  raw_message_delivery   = true
}

output "topic_arn"  { value = aws_sns_topic.orders_events.arn }
output "queue_urls" { value = { for s in var.services : s => aws_sqs_queue.main[s].url } }

5. Message Filtering — Attribute-Based Routing

Without filter policies, every SQS subscriber receives every message published to the SNS topic. That wastes compute and forces consumers to discard irrelevant events. SNS subscription filter policies let you route messages to specific queues based on message attributes — think of it as server-side content-based routing with zero consumer code changes.

Filter Policy Syntax

A filter policy is a JSON object where each key is a message attribute name and the value is an array of accepted values (exact match, prefix match, or numeric range).

# Only deliver to inventory-queue if orderType is STANDARD or PREMIUM
{
  "orderType": ["STANDARD", "PREMIUM"]
}

# Only deliver to fraud-queue if totalAmount >= 500 (numeric range)
{
  "totalAmount": [{"numeric": [">=", 500]}]
}

# Route EU orders to eu-analytics-queue (prefix match)
{
  "region": [{"prefix": "eu-"}]
}

# Complex: PREMIUM orders in us-east OR eu-west
{
  "orderType": ["PREMIUM"],
  "region": ["us-east", "eu-west"]
}  # AND logic across keys, OR logic within arrays

Applying a Filter Policy via CLI

# Apply filter to the fraud-detection subscription
# Get the subscription ARN first
SUB_ARN=$(aws sns list-subscriptions-by-topic \
  --topic-arn arn:aws:sns:us-east-1:123456789012:orders-events \
  --query "Subscriptions[?Endpoint=='arn:aws:sqs:us-east-1:123456789012:fraud-queue'].SubscriptionArn" \
  --output text)

# Set filter: only high-value orders (>= $500)
aws sns set-subscription-attributes \
  --subscription-arn "$SUB_ARN" \
  --attribute-name FilterPolicy \
  --attribute-value '{"totalAmount":[{"numeric":[">=",500]}]}'

# Analytics gets everything — no filter needed (default)

Real-World Routing Matrix

SQS Queue	Filter Policy	Receives
inventory-queue	`{"orderType":["STANDARD","PREMIUM","SUBSCRIPTION"]}`	All paid orders (not REFUND events)
notifications-queue	(none)	All events — sends confirmation email regardless
fraud-queue	`{"totalAmount":[{"numeric":[">=",500]}]}`	High-value orders only
eu-compliance-queue	`{"region":[{"prefix":"eu-"}]}`	EU orders for GDPR audit trail
loyalty-queue	`{"orderType":["STANDARD","PREMIUM"]}`	Eligible orders (not refunds, not subscriptions)

Filter Policy Gotcha: If a message has no message attributes at all and a subscription has a filter policy, the message is not delivered to that subscription. If you want a subscription to receive both filtered and attribute-free messages, use {"orderType":["STANDARD","PREMIUM",{"exists":false}]} to also match messages where the attribute is absent.

Terraform Filter Policies

resource "aws_sns_topic_subscription" "fraud_detection" {
  topic_arn            = aws_sns_topic.orders_events.arn
  protocol             = "sqs"
  endpoint             = aws_sqs_queue.main["fraud"].arn
  raw_message_delivery = true

  filter_policy = jsonencode({
    totalAmount = [{ numeric = [">=", 500] }]
  })
}

resource "aws_sns_topic_subscription" "eu_compliance" {
  topic_arn            = aws_sns_topic.orders_events.arn
  protocol             = "sqs"
  endpoint             = aws_sqs_queue.main["eu-compliance"].arn
  raw_message_delivery = true

  filter_policy = jsonencode({
    region = [{ prefix = "eu-" }]
  })
}

6. Lambda Consumers — Event Source Mapping and Error Handling

Lambda is the most common consumer for SQS queues in a fan-out pattern. Lambda's SQS event source mapping polls the queue on your behalf, invokes your function with batches of messages, and handles deletion of successfully processed messages automatically.

Event Source Mapping Configuration

# Create event source mapping: Lambda processes inventory-queue
aws lambda create-event-source-mapping \
  --function-name inventory-processor \
  --event-source-arn arn:aws:sqs:us-east-1:123456789012:inventory-queue \
  --batch-size 10 \
  --maximum-batching-window-in-seconds 5 \
  --function-response-types ReportBatchItemFailures

Batch Size Tuning

The batch size controls how many SQS messages are passed to a single Lambda invocation. A batch size of 10 with a batching window of 5 seconds means Lambda waits up to 5 seconds to accumulate up to 10 messages before invoking your function — reducing invocation costs for bursty traffic. For CPU-intensive processing, use batch size 1. For I/O-bound work (database writes, API calls), use 10–100.

ReportBatchItemFailures — Partial Batch Handling

Without ReportBatchItemFailures, if any message in the batch fails, the entire batch is retried — including the messages that succeeded. This causes duplicate processing. With ReportBatchItemFailures enabled, your Lambda function returns a batchItemFailures response indicating only the failed message IDs, and only those messages return to the queue.

import json
import boto3
import logging

logger = logging.getLogger()
logger.setLevel(logging.INFO)

def handler(event, context):
    """
    Lambda consumer for inventory-queue.
    Uses ReportBatchItemFailures for partial failure handling.
    """
    failed_message_ids = []

    for record in event['Records']:
        message_id = record['messageId']
        try:
            body = json.loads(record['body'])
            process_inventory_update(body)
            logger.info(f"Processed order {body['orderId']}")
        except Exception as e:
            logger.error(f"Failed to process {message_id}: {e}")
            failed_message_ids.append({"itemIdentifier": message_id})

    # Only failed messages will be retried
    return {
        "batchItemFailures": failed_message_ids
    }

def process_inventory_update(order: dict):
    """Update inventory for each line item in the order."""
    dynamodb = boto3.resource('dynamodb')
    table = dynamodb.Table('Inventory')

    for item in order.get('lineItems', []):
        table.update_item(
            Key={'productId': item['productId']},
            UpdateExpression='SET stock = stock - :qty',
            ExpressionAttributeValues={':qty': item['quantity']},
            ConditionExpression='stock >= :qty'
        )

Lambda DLQ for Async Invocations

Note that the SQS queue already has a DLQ. When Lambda is triggered via event source mapping (synchronous invocation by SQS), failed messages go to the SQS DLQ after maxReceiveCount retries — not to the Lambda DLQ. Configure the SQS DLQ, not the Lambda DLQ, for fan-out consumers.

Lambda Concurrency for Fan-Out: By default, Lambda scales to the number of populated message groups (FIFO) or roughly to batch size × concurrent shards for Standard queues. Reserve concurrency per function to prevent one noisy queue from consuming all account-level Lambda concurrency and starving other queues in your fan-out.

7. Producer + Consumer Code: Python boto3 and Java Spring Cloud AWS

Python boto3 — Producer (Order Service)

import boto3
import json
import uuid
from datetime import datetime
from typing import Optional

class OrderEventProducer:
    """Publishes order events to SNS for fan-out distribution."""

    def __init__(self, topic_arn: str, region: str = "us-east-1"):
        self.sns = boto3.client('sns', region_name=region)
        self.topic_arn = topic_arn

    def publish_order_placed(
        self,
        order_id: str,
        customer_id: str,
        total_amount: float,
        order_type: str,
        region: str,
        line_items: list
    ) -> str:
        """Publish an OrderPlaced event to SNS."""

        payload = {
            "eventType": "OrderPlaced",
            "eventId": str(uuid.uuid4()),
            "timestamp": datetime.utcnow().isoformat() + "Z",
            "orderId": order_id,
            "customerId": customer_id,
            "totalAmount": total_amount,
            "orderType": order_type,
            "region": region,
            "lineItems": line_items
        }

        response = self.sns.publish(
            TopicArn=self.topic_arn,
            Message=json.dumps(payload),
            MessageAttributes={
                "eventType": {
                    "DataType": "String",
                    "StringValue": "OrderPlaced"
                },
                "orderType": {
                    "DataType": "String",
                    "StringValue": order_type
                },
                "region": {
                    "DataType": "String",
                    "StringValue": region
                },
                "totalAmount": {
                    "DataType": "Number",
                    "StringValue": str(total_amount)
                }
            }
        )

        message_id = response['MessageId']
        print(f"Published OrderPlaced event {message_id} for order {order_id}")
        return message_id


# Usage
if __name__ == "__main__":
    producer = OrderEventProducer(
        topic_arn="arn:aws:sns:us-east-1:123456789012:orders-events"
    )

    producer.publish_order_placed(
        order_id="ORD-2026-9821",
        customer_id="CUST-4471",
        total_amount=749.95,
        order_type="PREMIUM",
        region="us-east",
        line_items=[
            {"productId": "SKU-001", "quantity": 2, "price": 299.99},
            {"productId": "SKU-042", "quantity": 1, "price": 149.97}
        ]
    )

Python boto3 — Consumer (Notifications Service, polling SQS directly)

import boto3
import json
import time
import logging

logger = logging.getLogger(__name__)

class NotificationConsumer:
    """Polls SQS for order events and sends customer notifications."""

    def __init__(self, queue_url: str, region: str = "us-east-1"):
        self.sqs = boto3.client('sqs', region_name=region)
        self.ses = boto3.client('ses', region_name=region)
        self.queue_url = queue_url

    def run(self):
        logger.info(f"Starting consumer for {self.queue_url}")
        while True:
            messages = self.sqs.receive_message(
                QueueUrl=self.queue_url,
                MaxNumberOfMessages=10,
                WaitTimeSeconds=20,          # long polling
                AttributeNames=['All'],
                MessageAttributeNames=['All']
            ).get('Messages', [])

            for msg in messages:
                try:
                    self._process(msg)
                    self.sqs.delete_message(
                        QueueUrl=self.queue_url,
                        ReceiptHandle=msg['ReceiptHandle']
                    )
                except Exception as e:
                    logger.error(f"Failed to process {msg['MessageId']}: {e}")
                    # Do NOT delete — message will reappear after visibility timeout
                    # After maxReceiveCount retries it goes to DLQ automatically

    def _process(self, msg: dict):
        order = json.loads(msg['Body'])  # raw message delivery enabled
        order_id = order['orderId']
        customer_id = order['customerId']

        # Idempotency check (DynamoDB or Redis)
        if self._already_notified(order_id):
            logger.info(f"Duplicate notification skipped for {order_id}")
            return

        self._send_confirmation_email(order)
        self._mark_notified(order_id)
        logger.info(f"Sent confirmation for order {order_id} to customer {customer_id}")

    def _already_notified(self, order_id: str) -> bool:
        # Check DynamoDB idempotency table
        return False  # simplified

    def _mark_notified(self, order_id: str):
        pass  # Write to DynamoDB

    def _send_confirmation_email(self, order: dict):
        pass  # Call SES

Java Spring Cloud AWS — Consumer

// pom.xml dependency
// <dependency>
//   <groupId>io.awspring.cloud</groupId>
//   <artifactId>spring-cloud-aws-starter-sqs</artifactId>
//   <version>3.1.1</version>
// </dependency>

package com.techoral.order.consumer;

import io.awspring.cloud.sqs.annotation.SqsListener;
import io.awspring.cloud.sqs.listener.acknowledgement.Acknowledgement;
import org.springframework.stereotype.Component;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import com.fasterxml.jackson.databind.ObjectMapper;
import java.util.Map;

@Component
public class InventoryConsumer {

    private static final Logger log = LoggerFactory.getLogger(InventoryConsumer.class);
    private final InventoryService inventoryService;
    private final ObjectMapper objectMapper;

    public InventoryConsumer(InventoryService inventoryService, ObjectMapper objectMapper) {
        this.inventoryService = inventoryService;
        this.objectMapper = objectMapper;
    }

    /**
     * @SqsListener polls inventory-queue automatically.
     * Spring Cloud AWS handles polling, threading, and deletion on success.
     * Throw any exception to trigger redelivery (message stays in queue).
     */
    @SqsListener(value = "${aws.sqs.inventory-queue-url}",
                 maxConcurrentMessages = "5",
                 maxMessagesPerPoll = "10")
    public void handleOrderPlaced(String rawMessage, Acknowledgement acknowledgement) {
        try {
            Map<String, Object> order = objectMapper.readValue(
                rawMessage, new com.fasterxml.jackson.core.type.TypeReference<>() {}
            );

            String orderId = (String) order.get("orderId");
            log.info("Processing inventory for order: {}", orderId);

            inventoryService.reserveStock(order);
            acknowledgement.acknowledge(); // explicit ack — deletes from SQS

        } catch (InsufficientStockException e) {
            log.warn("Insufficient stock — sending to compensating flow: {}", e.getMessage());
            // Publish compensating event to SNS (saga pattern)
            acknowledgement.acknowledge(); // ack so it doesn't retry
        } catch (Exception e) {
            log.error("Transient failure — message will retry: {}", e.getMessage());
            // Do NOT acknowledge — SQS will redeliver after visibility timeout
            throw new RuntimeException("Transient failure", e);
        }
    }
}

// application.yml
// aws:
//   sqs:
//     inventory-queue-url: https://sqs.us-east-1.amazonaws.com/123456789012/inventory-queue
//   region:
//     static: us-east-1

8. Advanced Patterns: Fan-In, Cross-Account, Cross-Region

Fan-Out + Fan-In (Aggregation Pattern)

Fan-out distributes work; fan-in collects results. In an order processing pipeline, the Order Service fans out to five processors, each writes its result to a dedicated SQS result queue, and an aggregator Lambda polls all five queues and assembles the final order status when all processors report back.

# Fan-in with Step Functions (preferred approach)
# Step Functions Map state runs 5 Lambda functions in parallel
# and waits for all results before continuing to the aggregation state

{
  "Comment": "Fan-out order processing with fan-in",
  "StartAt": "PublishToSNS",
  "States": {
    "PublishToSNS": {
      "Type": "Task",
      "Resource": "arn:aws:states:::sns:publish",
      "Parameters": {
        "TopicArn": "arn:aws:sns:us-east-1:123456789012:orders-events",
        "Message.$": "States.JsonToString($.order)"
      },
      "Next": "WaitForProcessors"
    },
    "WaitForProcessors": {
      "Type": "Wait",
      "Seconds": 30,
      "Next": "AggregateResults"
    },
    "AggregateResults": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:us-east-1:123456789012:function:order-aggregator",
      "End": true
    }
  }
}

Cross-Account Fan-Out

In enterprises, different AWS accounts own different services (security best practice). SNS can fan out to SQS queues in different AWS accounts. The SQS queue in the target account must allow the source SNS topic in its resource-based policy.

# Target account (Account B) SQS queue policy
# Allows Account A's SNS topic to send messages
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "CrossAccountSNSFanOut",
      "Effect": "Allow",
      "Principal": {
        "Service": "sns.amazonaws.com"
      },
      "Action": "sqs:SendMessage",
      "Resource": "arn:aws:sqs:us-east-1:ACCOUNT_B:compliance-queue",
      "Condition": {
        "ArnEquals": {
          "aws:SourceArn": "arn:aws:sns:us-east-1:ACCOUNT_A:orders-events"
        }
      }
    }
  ]
}

# Subscribe Account B's SQS queue from Account A's SNS topic
# (run from Account A with cross-account permissions)
aws sns subscribe \
  --topic-arn arn:aws:sns:us-east-1:ACCOUNT_A:orders-events \
  --protocol sqs \
  --notification-endpoint arn:aws:sqs:us-east-1:ACCOUNT_B:compliance-queue \
  --attributes RawMessageDelivery=true

Cross-Account Note: Raw message delivery is recommended for cross-account subscriptions. Without it, the SNS envelope contains a Signature field that references the SNS topic in Account A — if Account B consumers try to validate the signature, they need network access to Account A's SNS API. Raw delivery sidesteps this entirely.

Cross-Region Fan-Out

SNS can deliver to SQS queues in a different AWS region. This is useful for disaster recovery and multi-region active-active setups. Latency increases (typically 50–200ms cross-region), and data sovereignty regulations may restrict message content crossing regions.

# Subscribe a queue in eu-west-1 to a topic in us-east-1
aws sns subscribe \
  --topic-arn arn:aws:sns:us-east-1:123456789012:orders-events \
  --protocol sqs \
  --notification-endpoint arn:aws:sqs:eu-west-1:123456789012:eu-orders-queue \
  --region us-east-1

9. Dead Letter Queues — Setup, Monitoring, and Auto-Reprocessing

A DLQ is not a graveyard — it is a quarantine zone. Messages in DLQs represent real business failures: a database was unreachable, a third-party API returned a 500, a message was malformed. Every DLQ needs an alert and, for recoverable failures, an automated reprocessing mechanism.

CloudWatch Alarm on DLQ Depth

aws cloudwatch put-metric-alarm \
  --alarm-name "inventory-dlq-not-empty" \
  --alarm-description "Inventory DLQ has messages — investigate immediately" \
  --metric-name ApproximateNumberOfMessagesVisible \
  --namespace AWS/SQS \
  --dimensions Name=QueueName,Value=inventory-queue-dlq \
  --statistic Sum \
  --period 60 \
  --evaluation-periods 1 \
  --threshold 1 \
  --comparison-operator GreaterThanOrEqualToThreshold \
  --treat-missing-data notBreaching \
  --alarm-actions arn:aws:sns:us-east-1:123456789012:ops-alerts

Terraform DLQ Alarm (all fan-out DLQs)

resource "aws_cloudwatch_metric_alarm" "dlq_not_empty" {
  for_each = toset(var.services)

  alarm_name          = "${each.key}-dlq-not-empty"
  alarm_description   = "${each.key} DLQ has messages — investigate"
  metric_name         = "ApproximateNumberOfMessagesVisible"
  namespace           = "AWS/SQS"
  statistic           = "Sum"
  period              = 60
  evaluation_periods  = 1
  threshold           = 1
  comparison_operator = "GreaterThanOrEqualToThreshold"
  treat_missing_data  = "notBreaching"

  dimensions = {
    QueueName = "${each.key}-queue-dlq"
  }

  alarm_actions = [aws_sns_topic.ops_alerts.arn]
}

Automated DLQ Reprocessing Lambda

For transient failures (downstream service temporarily unavailable), automate DLQ message replay. This Lambda reads from a DLQ and republishes messages to the original queue, with exponential back-off between batches.

import boto3
import json
import time
import os
import logging

logger = logging.getLogger(__name__)
sqs = boto3.client('sqs')

DLQ_URL    = os.environ['DLQ_URL']
TARGET_URL = os.environ['TARGET_QUEUE_URL']
BATCH_SIZE = int(os.environ.get('BATCH_SIZE', '10'))


def handler(event, context):
    """
    Replays messages from DLQ back to the main queue.
    Triggered by CloudWatch Events (schedule) or manual invocation.
    """
    total_replayed = 0
    total_failed   = 0

    while True:
        response = sqs.receive_message(
            QueueUrl=DLQ_URL,
            MaxNumberOfMessages=min(BATCH_SIZE, 10),
            WaitTimeSeconds=2
        )
        messages = response.get('Messages', [])

        if not messages:
            logger.info(f"DLQ empty. Replayed={total_replayed}, Failed={total_failed}")
            break

        for msg in messages:
            try:
                # Re-send original message body to the main queue
                sqs.send_message(
                    QueueUrl=TARGET_URL,
                    MessageBody=msg['Body'],
                    MessageAttributes=msg.get('MessageAttributes', {})
                )
                sqs.delete_message(
                    QueueUrl=DLQ_URL,
                    ReceiptHandle=msg['ReceiptHandle']
                )
                total_replayed += 1
            except Exception as e:
                logger.error(f"Failed to replay {msg['MessageId']}: {e}")
                total_failed += 1

        time.sleep(0.5)  # Respect main queue rate limits

    return {
        "statusCode": 200,
        "replayed": total_replayed,
        "failed": total_failed
    }

DLQ Best Practice: Never replay DLQ messages automatically without first understanding why they failed. Add structured logging in your consumers so DLQ messages include the failure reason. Use a separate replay-pending DLQ attribute or tag to mark messages that are safe to replay vs. ones that need manual inspection.

10. Observability — CloudWatch Metrics and X-Ray Tracing

A fan-out topology with 6 SQS queues, 6 DLQs, and 6 Lambda consumers has 60+ CloudWatch metrics worth monitoring. Here are the signals that matter and how to wire them together for end-to-end observability.

Key CloudWatch SQS Metrics

Metric	Alarm Threshold	Meaning
ApproximateNumberOfMessagesVisible	> 1000 sustained 5 min	Consumer falling behind — scale out or check for errors
ApproximateAgeOfOldestMessage	> 300 seconds	Consumer stalled or dead — messages aging dangerously
NumberOfMessagesSent	< 1 over 10 min (business hours)	Producer may be failing — no events flowing
NumberOfMessagesDeleted	N/A (track rate)	Consumer throughput — compare with Sent for backlog rate
ApproximateNumberOfMessagesNotVisible	High relative to Visible	Messages in-flight — Lambda concurrency may be maxed

CloudWatch Dashboard (CLI)

aws cloudwatch put-dashboard \
  --dashboard-name "OrdersFanOutHealth" \
  --dashboard-body '{
    "widgets": [
      {
        "type": "metric",
        "properties": {
          "title": "SQS Queue Depths",
          "metrics": [
            ["AWS/SQS","ApproximateNumberOfMessagesVisible","QueueName","inventory-queue"],
            ["AWS/SQS","ApproximateNumberOfMessagesVisible","QueueName","notifications-queue"],
            ["AWS/SQS","ApproximateNumberOfMessagesVisible","QueueName","analytics-queue"],
            ["AWS/SQS","ApproximateNumberOfMessagesVisible","QueueName","fraud-queue"]
          ],
          "period": 60,
          "stat": "Maximum",
          "view": "timeSeries"
        }
      },
      {
        "type": "metric",
        "properties": {
          "title": "DLQ Depths (should be 0)",
          "metrics": [
            ["AWS/SQS","ApproximateNumberOfMessagesVisible","QueueName","inventory-queue-dlq"],
            ["AWS/SQS","ApproximateNumberOfMessagesVisible","QueueName","notifications-queue-dlq"],
            ["AWS/SQS","ApproximateNumberOfMessagesVisible","QueueName","fraud-queue-dlq"]
          ],
          "period": 60,
          "stat": "Sum",
          "view": "timeSeries"
        }
      }
    ]
  }'

X-Ray End-to-End Tracing

AWS X-Ray traces a message from the moment it is published to SNS, through SQS, into the Lambda consumer, and into any downstream services the consumer calls. Enable active tracing on Lambda and use the X-Ray SDK to propagate the trace context.

# Enable X-Ray active tracing on Lambda
aws lambda update-function-configuration \
  --function-name inventory-processor \
  --tracing-config Mode=Active

# Enable X-Ray on the SNS topic (for publisher tracing)
aws sns set-topic-attributes \
  --topic-arn arn:aws:sns:us-east-1:123456789012:orders-events \
  --attribute-name TracingConfig \
  --attribute-value Active

# Python Lambda with X-Ray SDK
from aws_xray_sdk.core import xray_recorder, patch_all
import boto3, json

patch_all()  # auto-instrument boto3 calls

@xray_recorder.capture('process_inventory_update')
def process_inventory_update(order: dict):
    xray_recorder.current_subsegment().put_annotation('orderId', order['orderId'])
    xray_recorder.current_subsegment().put_annotation('orderType', order['orderType'])

    # All downstream boto3 calls (DynamoDB, etc.) appear in the X-Ray trace
    dynamodb = boto3.resource('dynamodb')
    table = dynamodb.Table('Inventory')
    # ... update stock ...

SNS Metrics to Monitor

# SNS publishes these metrics automatically:
# NumberOfMessagesSent       — messages published by producers
# NumberOfNotificationsDelivered — successfully delivered to subscribers
# NumberOfNotificationsFailed    — failed deliveries (alert on this!)
# NumberOfNotificationsFilteredOut — filtered by subscription filter policies

aws cloudwatch get-metric-statistics \
  --namespace AWS/SNS \
  --metric-name NumberOfNotificationsFailed \
  --dimensions Name=TopicName,Value=orders-events \
  --start-time 2026-06-08T00:00:00Z \
  --end-time   2026-06-08T23:59:59Z \
  --period 3600 \
  --statistics Sum

End-to-End Latency SLO: For a fan-out pattern, define separate SLOs per queue based on business criticality. Fraud detection should process within 10 seconds (P99). Notifications within 30 seconds. Analytics within 5 minutes. Use ApproximateAgeOfOldestMessage alarms per queue to enforce these SLOs automatically.

Putting It All Together — Production Checklist

Every SQS queue has a DLQ with maxReceiveCount=3
Every DLQ has a CloudWatch alarm (threshold: >= 1 message)
Raw message delivery enabled on all SQS subscriptions
Long polling (ReceiveMessageWaitTimeSeconds=20) on all queues
Lambda consumers use ReportBatchItemFailures
Lambda concurrency reserved per function (prevent noisy-neighbor)
SNS subscription filter policies applied where consumers don't need all events
X-Ray active tracing on all Lambda consumers and the SNS topic
CloudWatch dashboard covering queue depth, DLQ depth, Lambda error rate
Producer uses message attributes for all filterable fields