AWS Step Functions: Orchestrating Serverless Workflows (2026)
AWS Step Functions is a serverless orchestration service that lets you coordinate distributed applications using visual state machines. Instead of embedding workflow logic inside Lambda functions or dealing with brittle hand-rolled retry mechanisms, Step Functions gives you a durable, auditable execution engine where each state transition is tracked, retried, and logged automatically. As of 2026, Step Functions supports direct SDK integrations with over 220 AWS services — meaning you can call DynamoDB, ECS, SageMaker, and SNS directly without a Lambda wrapper.
State Machine Concepts
A Step Functions state machine is defined in Amazon States Language (ASL), a JSON-based specification. Every execution follows a directed graph: it starts at the StartAt state and terminates when it reaches a state with "End": true or a Fail state.
Key concepts you need to internalize before building anything:
- States: Individual units of work or control flow. Types include Task, Choice, Wait, Parallel, Map, Pass, Succeed, Fail.
- Input/Output processing: Each state receives a JSON input. You use
InputPath,Parameters,ResultSelector,ResultPath, andOutputPathto shape what flows through. - Transitions: Each non-terminal state specifies a
Nextstate or sets"End": true. - Execution context: The
$$prefix gives you access to execution metadata (execution name, start time, state name) inside Parameters blocks.
Standard vs Express Workflows
Choosing the wrong workflow type is one of the most common Step Functions mistakes. The distinction affects billing, duration, execution semantics, and observability.
| Feature | Standard Workflow | Express Workflow |
|---|---|---|
| Max duration | 1 year | 5 minutes |
| Execution semantics | Exactly-once | At-least-once |
| Pricing model | Per state transition | Per execution duration + requests |
| Execution history | Full history in console | CloudWatch Logs only |
| Execution rate | 2,000/sec (default) | 100,000/sec (default) |
| Best for | Order processing, human approval, long-running ETL | IoT ingestion, high-volume event processing, streaming |
| Idempotency | Built-in | You must handle duplicates |
State Types with JSON Examples
Step Functions provides eight state types. Here are the ones you'll use most, with real ASL examples.
Task State — invokes a Lambda function or AWS SDK action:
{
"ProcessPayment": {
"Type": "Task",
"Resource": "arn:aws:lambda:us-east-1:123456789012:function:ProcessPayment",
"TimeoutSeconds": 30,
"HeartbeatSeconds": 10,
"ResultPath": "$.paymentResult",
"Next": "CheckInventory"
}
}
Choice State — branches based on input values:
{
"RouteOrder": {
"Type": "Choice",
"Choices": [
{
"Variable": "$.orderType",
"StringEquals": "DIGITAL",
"Next": "FulfillDigital"
},
{
"Variable": "$.totalAmount",
"NumericGreaterThan": 500,
"Next": "FraudCheck"
}
],
"Default": "FulfillPhysical"
}
}
Parallel State — executes branches concurrently and waits for all to complete:
{
"FulfillOrder": {
"Type": "Parallel",
"Branches": [
{
"StartAt": "ReserveInventory",
"States": {
"ReserveInventory": {
"Type": "Task",
"Resource": "arn:aws:lambda:us-east-1:123456789012:function:ReserveInventory",
"End": true
}
}
},
{
"StartAt": "ChargeCustomer",
"States": {
"ChargeCustomer": {
"Type": "Task",
"Resource": "arn:aws:lambda:us-east-1:123456789012:function:ChargeCustomer",
"End": true
}
}
}
],
"ResultPath": "$.fulfillmentResults",
"Next": "SendConfirmation"
}
}
Map State — iterates over an array and applies the same workflow to each element:
{
"ProcessLineItems": {
"Type": "Map",
"ItemsPath": "$.lineItems",
"MaxConcurrency": 10,
"Iterator": {
"StartAt": "ProcessItem",
"States": {
"ProcessItem": {
"Type": "Task",
"Resource": "arn:aws:lambda:us-east-1:123456789012:function:ProcessLineItem",
"End": true
}
}
},
"ResultPath": "$.processedItems",
"Next": "Summarize"
}
}
Wait State — pauses execution for a fixed duration or until a timestamp:
{
"WaitForShipment": {
"Type": "Wait",
"SecondsPath": "$.estimatedDeliverySeconds",
"Next": "ConfirmDelivery"
}
}
Error Handling: Retry and Catch
Step Functions has built-in retry and catch logic at the state level — no try/catch blocks in your Lambda code needed for transient errors.
{
"ProcessPayment": {
"Type": "Task",
"Resource": "arn:aws:lambda:us-east-1:123456789012:function:ProcessPayment",
"Retry": [
{
"ErrorEquals": ["Lambda.ServiceException", "Lambda.AWSLambdaException"],
"IntervalSeconds": 2,
"MaxAttempts": 3,
"BackoffRate": 2.0,
"JitterStrategy": "FULL"
},
{
"ErrorEquals": ["PaymentDeclined"],
"MaxAttempts": 0
}
],
"Catch": [
{
"ErrorEquals": ["PaymentDeclined"],
"ResultPath": "$.errorInfo",
"Next": "NotifyCustomerOfDecline"
},
{
"ErrorEquals": ["States.ALL"],
"ResultPath": "$.errorInfo",
"Next": "HandleGenericFailure"
}
],
"Next": "ReserveInventory"
}
}
JitterStrategy: "FULL" was added in 2023. It randomizes retry intervals within the calculated backoff window, preventing thundering herd problems when many executions fail simultaneously.Direct SDK Integrations
SDK integrations let you call AWS APIs directly from Step Functions without a Lambda intermediary. The Resource ARN uses the format arn:aws:states:::aws-sdk:serviceName:apiAction.
Three integration patterns are available:
- Request-response (default): Calls the API and immediately moves to the next state. Good for fire-and-forget operations.
- Sync (.sync:2): Waits for the job/operation to complete. Used with ECS, Glue, SageMaker, CodeBuild.
- Wait for callback (.waitForTaskToken): Sends a task token to a worker; waits until
SendTaskSuccessorSendTaskFailureis called. Used for human approvals or external systems.
{
"PutOrderToDynamoDB": {
"Type": "Task",
"Resource": "arn:aws:states:::dynamodb:putItem",
"Parameters": {
"TableName": "Orders",
"Item": {
"orderId": { "S.$": "$.orderId" },
"status": { "S": "CONFIRMED" },
"timestamp": { "S.$": "$$.Execution.StartTime" }
}
},
"ResultPath": null,
"Next": "SendSNSNotification"
},
"SendSNSNotification": {
"Type": "Task",
"Resource": "arn:aws:states:::sns:publish",
"Parameters": {
"TopicArn": "arn:aws:sns:us-east-1:123456789012:OrderNotifications",
"Message.$": "States.Format('Order {} confirmed', $.orderId)"
},
"End": true
}
}
States.Format, States.StringToJson, States.JsonToString, and States.Array intrinsic functions to manipulate data inside ASL without Lambda. This reduces cold starts and cost.Step Functions Local for Testing
Step Functions Local lets you run state machine executions on your laptop against mock Lambda responses, eliminating the deploy-test loop for workflow logic.
# Start Step Functions Local via Docker
docker run -p 8083:8083 \
-e AWS_DEFAULT_REGION=us-east-1 \
amazon/aws-stepfunctions-local
# Create a state machine pointing at local mock Lambda
aws stepfunctions create-state-machine \
--endpoint-url http://localhost:8083 \
--name "OrderProcessingTest" \
--definition file://order-workflow.json \
--role-arn "arn:aws:iam::123456789012:role/StepFunctionsRole"
# Start an execution
aws stepfunctions start-execution \
--endpoint-url http://localhost:8083 \
--state-machine-arn "arn:aws:states:us-east-1:123456789012:stateMachine:OrderProcessingTest" \
--input '{"orderId":"ORD-001","orderType":"PHYSICAL","totalAmount":150}'
# Check execution status
aws stepfunctions describe-execution \
--endpoint-url http://localhost:8083 \
--execution-arn "arn:aws:states:us-east-1:123456789012:execution:OrderProcessingTest:my-exec"
Create a MockConfigFile.json to define mock responses for each Lambda function, then reference it with -e SFN_MOCK_CONFIG=/home/user/MockConfigFile.json in the Docker run command.
X-Ray Tracing
Enable X-Ray tracing on a state machine to get end-to-end trace maps across all Lambda invocations, SDK calls, and nested state machines.
# Enable tracing when creating a state machine
aws stepfunctions create-state-machine \
--name "OrderProcessing" \
--definition file://order-workflow.json \
--role-arn "arn:aws:iam::123456789012:role/StepFunctionsRole" \
--tracing-configuration enabled=true
# Update tracing on existing state machine
aws stepfunctions update-state-machine \
--state-machine-arn "arn:aws:states:us-east-1:123456789012:stateMachine:OrderProcessing" \
--tracing-configuration enabled=true
Your IAM role for Step Functions needs xray:PutTraceSegments, xray:PutTelemetryRecords, and xray:GetSamplingRules permissions. Once enabled, the X-Ray service map will show each state as a node with latency percentiles.
Real Example: Order Processing Workflow
Here's a complete order processing state machine with parallel fulfillment steps, error handling, and SDK integrations.
{
"Comment": "E-commerce order processing with parallel fulfillment",
"StartAt": "ValidateOrder",
"States": {
"ValidateOrder": {
"Type": "Task",
"Resource": "arn:aws:states:::lambda:invoke",
"Parameters": {
"FunctionName": "ValidateOrder",
"Payload.$": "$"
},
"ResultSelector": { "validated.$": "$.Payload.valid", "orderId.$": "$.Payload.orderId" },
"ResultPath": "$.validation",
"Retry": [{ "ErrorEquals": ["Lambda.ServiceException"], "MaxAttempts": 2, "IntervalSeconds": 1 }],
"Next": "CheckValidation"
},
"CheckValidation": {
"Type": "Choice",
"Choices": [
{ "Variable": "$.validation.validated", "BooleanEquals": true, "Next": "ParallelFulfillment" }
],
"Default": "RejectOrder"
},
"ParallelFulfillment": {
"Type": "Parallel",
"Branches": [
{
"StartAt": "ReserveInventory",
"States": {
"ReserveInventory": {
"Type": "Task",
"Resource": "arn:aws:states:::dynamodb:updateItem",
"Parameters": {
"TableName": "Inventory",
"Key": { "productId": { "S.$": "$.productId" } },
"UpdateExpression": "SET reserved = reserved + :qty",
"ExpressionAttributeValues": { ":qty": { "N.$": "States.Format('{}', $.quantity)" } }
},
"End": true
}
}
},
{
"StartAt": "ChargePayment",
"States": {
"ChargePayment": {
"Type": "Task",
"Resource": "arn:aws:lambda:us-east-1:123456789012:function:ChargePayment",
"Retry": [{ "ErrorEquals": ["States.TaskFailed"], "MaxAttempts": 1 }],
"Catch": [{ "ErrorEquals": ["PaymentDeclined"], "Next": "PaymentFailed" }],
"End": true
},
"PaymentFailed": { "Type": "Fail", "Error": "PaymentDeclined", "Cause": "Card declined" }
}
}
],
"Catch": [{ "ErrorEquals": ["States.ALL"], "ResultPath": "$.error", "Next": "RollbackOrder" }],
"Next": "SendConfirmationEmail"
},
"SendConfirmationEmail": {
"Type": "Task",
"Resource": "arn:aws:states:::sns:publish",
"Parameters": {
"TopicArn": "arn:aws:sns:us-east-1:123456789012:OrderConfirmations",
"Message.$": "States.Format('Your order {} has been confirmed!', $.orderId)"
},
"End": true
},
"RejectOrder": { "Type": "Fail", "Error": "ValidationFailed", "Cause": "Order failed validation" },
"RollbackOrder": { "Type": "Task", "Resource": "arn:aws:lambda:us-east-1:123456789012:function:RollbackOrder", "End": true }
}
}
FAQ
Q: When should I use Step Functions vs a simple Lambda chaining approach?
Use Step Functions when you need durability (executions survive Lambda crashes), visibility (audit trail in the console), or when workflow logic would otherwise live inside application code. For a single two-step async operation, direct Lambda invocation is simpler. For anything with branching, retries, parallel steps, or human approval gates, Step Functions pays for itself quickly.
Q: What does Step Functions cost?
Standard Workflows cost $0.025 per 1,000 state transitions. Express Workflows cost $1.00 per million requests plus $0.00001667 per GB-second of duration. A typical order processing workflow with 10 states costs $0.00025 per order — effectively free at moderate scale. Watch out for Map states iterating over large arrays; each item's states each count as transitions.
Q: Can I pass large payloads between states?
The state input/output limit is 256 KB. For larger payloads, write to S3 and pass the S3 key. The S3 to Step Functions integration can read/write objects directly in SDK integration tasks. A common pattern is to store order details in DynamoDB at the start and pass only the orderId key through subsequent states.
Q: How do I implement a human approval step?
Use a Task state with "Resource": "arn:aws:states:::sqs:sendMessage.waitForTaskToken". Embed $$.Task.Token in the message. Your approval Lambda or application calls SendTaskSuccess or SendTaskFailure with the token when the human acts. The execution waits up to 1 year (Standard) for the callback.
Q: How do I debug a failed execution?
In the Step Functions console, open the failed execution and click on the failed state. The Event History shows exact input, output, and error details for every state transition. For Express Workflows, you need CloudWatch Logs with ALL log level enabled — the console doesn't store Express execution history.