AWS Batch: Run Large-Scale Batch Computing Jobs on AWS (2026)
AWS Batch is a fully managed service that enables you to run hundreds of thousands of batch computing jobs on AWS without managing the underlying compute infrastructure. Whether you're processing genomics pipelines, running ML training jobs, crunching financial simulations, or executing ETL workloads, AWS Batch provisions exactly the right amount of compute resources — EC2, Spot, or Fargate — and scales them down to zero when work is done. This guide covers every layer of AWS Batch from architecture concepts to production-grade GPU workloads, multi-node MPI jobs, and cost optimization strategies.
Table of Contents
- AWS Batch vs Lambda vs ECS vs Glue
- Core Concepts: Compute Environments, Job Queues, Job Definitions
- Compute Environment Setup: EC2 Spot and Fargate
- Job Definitions: Container Properties, Retry, Timeout
- Submitting Jobs: Single, Array, and Dependent Jobs
- GPU Workloads: P4d, G4 Instances, CUDA Training
- Multi-Node Parallel Jobs and MPI Workloads
- Step Functions Integration: Map State + Batch
- Cost Optimization: Spot, Fargate Spot, Queue Priority
- Monitoring: CloudWatch Metrics and EventBridge Alerts
AWS Batch vs Lambda vs ECS vs Glue: When to Use What
Choosing the right AWS compute service for batch workloads is one of the most common architectural decisions teams face. Each service has a clear sweet spot, and picking the wrong one leads to either over-engineering or hitting hard limits at the worst possible time.
| Factor | AWS Batch | Lambda | ECS (Fargate) | Glue |
|---|---|---|---|---|
| Max job duration | No limit | 15 minutes | No limit | 48 hours |
| GPU support | Yes (P4d, G4, G5) | No | No | No |
| MPI / multi-node | Yes | No | No | No |
| Spot instance support | Native, automatic retry | No | Manual Spot config | Limited (Flex) |
| HPC / tightly coupled | Yes (EFA, placement groups) | No | No | No |
| Managed runtime | Docker container (any) | AWS-managed | Docker container | Spark / Python built-in |
| Queue + priority | Yes (multiple queues) | No | No (service-based) | No |
| Array jobs (1000x parallel) | Native | Fan-out via SQS/SNS | Manual | No |
| Cost model | Pay per EC2/Fargate used | Per 1ms invocation | Per vCPU/memory second | Per DPU-hour |
| Best for | Long-running, HPC, GPU, parallel | Short event-driven tasks | Web APIs, always-on services | Spark ETL on structured data |
AWS Glue is excellent when your data is tabular and your team prefers Spark SQL over custom Docker images. Lambda is ideal for short sub-second to 15-minute event handlers. ECS is for persistent services, not ephemeral batch. AWS Batch sits between ECS and HPC cluster managers — it manages scheduling, retries, Spot interruptions, and infrastructure so you focus on the job logic inside your container.
Core Concepts: Compute Environments, Job Queues, Job Definitions
AWS Batch has four primitive objects that compose every batch workload. Understanding how they relate is critical before writing a single line of infrastructure code.
Compute Environments define the pool of EC2 or Fargate capacity that Batch manages. A managed compute environment lets AWS provision, scale, and terminate instances automatically. An unmanaged compute environment gives you full control — you provision your own EC2 instances, register them with ECS (Batch uses ECS under the hood), and Batch only handles scheduling.
Job Queues receive submitted jobs and forward them to a compute environment. A queue has a priority and one or more compute environments ordered by preference. Jobs wait in PENDING state until resources are available. You can have a high-priority queue (e.g., urgent ML inference) backed by On-Demand compute and a low-priority queue backed by Spot instances — and the same compute environment can serve multiple queues.
Job Definitions are versioned templates for a job. They specify the container image, vCPU/memory requirements, IAM role, environment variables, volumes, retry strategy, and timeout. Think of them as reusable blueprints — you override specific fields (e.g., environment variables, command) at submission time.
Jobs are instances of a job definition submitted to a queue. A job progresses through states: SUBMITTED → PENDING → RUNNABLE → STARTING → RUNNING → SUCCEEDED or FAILED. Array jobs spawn N child jobs all running the same definition; each child receives its index via the AWS_BATCH_JOB_ARRAY_INDEX environment variable.
Compute Environment Setup: EC2 Spot and Fargate
Compute environments are where the infrastructure decisions happen. The choice between managed EC2 (On-Demand or Spot), Fargate, and Fargate Spot has significant cost and performance implications.
A managed EC2 Spot compute environment automatically requests Spot capacity, handles interruptions by checkpointing and retrying jobs, and can save up to 90% compared to On-Demand pricing. AWS Batch's Spot integration is superior to manually managing Spot in ECS because Batch handles the retry flow natively — an interrupted job automatically moves back to RUNNABLE and restarts on a new instance.
# Create managed EC2 Spot compute environment via CLI
aws batch create-compute-environment \
--compute-environment-name spot-ml-env \
--type MANAGED \
--state ENABLED \
--compute-resources '{
"type": "SPOT",
"allocationStrategy": "SPOT_CAPACITY_OPTIMIZED",
"minvCpus": 0,
"maxvCpus": 2048,
"desiredvCpus": 0,
"instanceTypes": ["optimal"],
"subnets": ["subnet-abc123", "subnet-def456"],
"securityGroupIds": ["sg-xyz789"],
"instanceRole": "arn:aws:iam::123456789012:instance-profile/ecsInstanceRole",
"bidPercentage": 60,
"spotIamFleetRole": "arn:aws:iam::123456789012:role/AmazonEC2SpotFleetRole",
"tags": {"Project": "ml-pipeline", "Env": "prod"}
}' \
--service-role arn:aws:iam::123456789012:role/AWSBatchServiceRole
For jobs that don't need persistent local storage and are under 16 vCPUs / 120 GB memory, Fargate is often simpler — no EC2 instances, no AMI management, no ECS container instance overhead.
# Terraform: Fargate compute environment + job queue
resource "aws_batch_compute_environment" "fargate" {
compute_environment_name = "fargate-batch-env"
type = "MANAGED"
compute_resources {
type = "FARGATE_SPOT"
max_vcpus = 256
subnets = var.private_subnet_ids
security_group_ids = [aws_security_group.batch.id]
}
service_role = aws_iam_role.batch_service.arn
depends_on = [aws_iam_role_policy_attachment.batch_service]
}
resource "aws_batch_job_queue" "default" {
name = "default-queue"
state = "ENABLED"
priority = 10
compute_environment_order {
order = 1
compute_environment = aws_batch_compute_environment.fargate.arn
}
}
resource "aws_batch_job_queue" "high_priority" {
name = "high-priority-queue"
state = "ENABLED"
priority = 100
compute_environment_order {
order = 1
compute_environment = aws_batch_compute_environment.fargate.arn
}
}
SPOT_CAPACITY_OPTIMIZED for batch workloads where interruption rate matters more than price. It picks the Spot pool with the most available capacity, reducing interruptions. Use LOWEST_PRICE only for fault-tolerant jobs that checkpoint aggressively.
Job Definitions: Container Properties, Retry, and Timeout
A job definition is the heart of AWS Batch configuration. It captures everything the scheduler needs to run your container: what image to pull, how many vCPUs and memory to allocate, which IAM role to assume, how many times to retry on failure, and when to give up entirely. Versioning is built-in — every update creates a new revision, and you can pin submissions to a specific revision or always use the latest.
{
"jobDefinitionName": "ml-training-job",
"type": "container",
"containerProperties": {
"image": "123456789012.dkr.ecr.us-east-1.amazonaws.com/ml-training:v1.2.0",
"vcpus": 4,
"memory": 16384,
"jobRoleArn": "arn:aws:iam::123456789012:role/BatchJobRole",
"executionRoleArn": "arn:aws:iam::123456789012:role/ecsTaskExecutionRole",
"command": ["python", "train.py", "--epochs", "50"],
"environment": [
{"name": "S3_INPUT_BUCKET", "value": "my-data-bucket"},
{"name": "S3_OUTPUT_BUCKET", "value": "my-results-bucket"},
{"name": "MODEL_TYPE", "value": "xgboost"}
],
"secrets": [
{
"name": "DB_PASSWORD",
"valueFrom": "arn:aws:secretsmanager:us-east-1:123456789012:secret:prod/db-password"
}
],
"mountPoints": [
{"containerPath": "/tmp/scratch", "readOnly": false, "sourceVolume": "scratch"}
],
"volumes": [
{"name": "scratch", "host": {"sourcePath": "/tmp"}}
],
"logConfiguration": {
"logDriver": "awslogs",
"options": {
"awslogs-group": "/aws/batch/ml-training",
"awslogs-region": "us-east-1",
"awslogs-stream-prefix": "batch"
}
}
},
"retryStrategy": {
"attempts": 3,
"evaluateOnExit": [
{
"onStatusReason": "Host EC2 was terminated",
"action": "RETRY"
},
{
"onReason": "CannotPullContainerError*",
"action": "RETRY"
},
{
"onExitCode": "1",
"action": "FAILED"
}
]
},
"timeout": {
"attemptDurationSeconds": 7200
},
"tags": {
"Project": "ml-pipeline",
"Team": "data-science"
}
}
The evaluateOnExit block is a powerful feature that lets you retry on Spot interruptions (status reason contains "Host EC2 was terminated") while failing fast on application errors (exit code 1). This prevents wasting Spot budget retrying broken code.
Register the definition with:
aws batch register-job-definition \
--cli-input-json file://job-definition.json
Here is a Dockerfile pattern for a typical batch job that reads from S3, processes data, and writes results back:
# Dockerfile for AWS Batch job
FROM python:3.11-slim
# Install system dependencies
RUN apt-get update && apt-get install -y --no-install-recommends \
awscli \
&& rm -rf /var/lib/apt/lists/*
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY src/ ./src/
# Entrypoint reads AWS_BATCH_JOB_ARRAY_INDEX for array jobs
ENTRYPOINT ["python", "src/process.py"]
# src/process.py — array job pattern
import os
import boto3
def main():
array_index = int(os.environ.get('AWS_BATCH_JOB_ARRAY_INDEX', 0))
input_bucket = os.environ['S3_INPUT_BUCKET']
output_bucket = os.environ['S3_OUTPUT_BUCKET']
s3 = boto3.client('s3')
# Each array child processes a different partition
paginator = s3.get_paginator('list_objects_v2')
pages = list(paginator.paginate(Bucket=input_bucket, Prefix='data/'))
all_keys = [obj['Key'] for page in pages for obj in page.get('Contents', [])]
# Divide work by array index
chunk = all_keys[array_index::int(os.environ.get('AWS_BATCH_JOB_NUM_NODES', 1))]
for key in chunk:
process_file(s3, input_bucket, key, output_bucket)
if __name__ == '__main__':
main()
Submitting Jobs: Single, Array, and Dependent Jobs
Job submission is where the rubber meets the road. AWS Batch supports three modes: single jobs (one container run), array jobs (N identical containers each with a unique index), and dependent jobs (DAG-style ordering). All three are available via the console, CLI, and boto3.
Single job submission:
aws batch submit-job \
--job-name ml-training-run-001 \
--job-queue high-priority-queue \
--job-definition ml-training-job:5 \
--container-overrides '{
"environment": [
{"name": "MODEL_TYPE", "value": "random-forest"},
{"name": "S3_INPUT_BUCKET", "value": "my-data-bucket"}
],
"command": ["python", "train.py", "--epochs", "100"]
}'
Array job — 1000 parallel tasks:
aws batch submit-job \
--job-name parallel-feature-extraction \
--job-queue default-queue \
--job-definition feature-extractor:3 \
--array-properties '{"size": 1000}'
Each of the 1000 child jobs receives AWS_BATCH_JOB_ARRAY_INDEX (0 to 999) and AWS_BATCH_JOB_ARRAY_SIZE (1000). Your code uses the index to determine which partition of data to process — for example, row 0–999 for index 0, rows 1000–1999 for index 1, etc.
Dependent jobs with Python boto3:
import boto3
batch = boto3.client('batch', region_name='us-east-1')
# Step 1: ingest raw data
ingest = batch.submit_job(
jobName='ingest-raw-data',
jobQueue='default-queue',
jobDefinition='data-ingest:2',
containerOverrides={
'environment': [{'name': 'DATE', 'value': '2026-06-09'}]
}
)
ingest_id = ingest['jobId']
# Step 2: transform depends on step 1 completing successfully
transform = batch.submit_job(
jobName='transform-data',
jobQueue='default-queue',
jobDefinition='data-transform:4',
dependsOn=[{'jobId': ingest_id, 'type': 'SEQUENTIAL'}],
arrayProperties={'size': 50} # fan out to 50 parallel transformers
)
transform_id = transform['jobId']
# Step 3: aggregate depends on ALL transform children finishing (N_TO_N would
# be used if step 2 and step 3 were the same size; here SEQUENTIAL is correct)
aggregate = batch.submit_job(
jobName='aggregate-results',
jobQueue='default-queue',
jobDefinition='data-aggregate:1',
dependsOn=[{'jobId': transform_id, 'type': 'SEQUENTIAL'}]
)
print(f"Pipeline submitted: ingest={ingest_id}, transform={transform_id}, "
f"aggregate={aggregate['jobId']}")
SEQUENTIAL means the dependent job starts only after the parent reaches SUCCEEDED. N_TO_N is for array job children — child index N of the dependent job waits for child index N of the parent. Use N_TO_N when you have two array jobs of the same size and each child of the second job needs the specific output of the matching child in the first job.
GPU Workloads: P4d, G4 Instances, CUDA Training
AWS Batch has first-class GPU support via EC2 GPU instances. The P4d (A100 GPUs) and G4dn (T4 GPUs) families are the workhorses for deep learning training and inference. Batch handles requesting these instances and mapping GPU devices into your container — you just declare the GPU count in the job definition's resource requirements.
Your compute environment must use GPU-capable instance types. Avoid the "optimal" instance type selector for GPU jobs — it won't select GPU instances. Explicitly list the families you need:
aws batch create-compute-environment \
--compute-environment-name gpu-spot-env \
--type MANAGED \
--state ENABLED \
--compute-resources '{
"type": "SPOT",
"allocationStrategy": "SPOT_CAPACITY_OPTIMIZED",
"minvCpus": 0,
"maxvCpus": 512,
"instanceTypes": ["p3.2xlarge", "p3.8xlarge", "g4dn.xlarge", "g4dn.2xlarge", "g4dn.4xlarge"],
"subnets": ["subnet-abc123"],
"securityGroupIds": ["sg-gpu123"],
"instanceRole": "arn:aws:iam::123456789012:instance-profile/ecsInstanceRole",
"bidPercentage": 70,
"spotIamFleetRole": "arn:aws:iam::123456789012:role/AmazonEC2SpotFleetRole",
"ec2Configuration": [
{
"imageType": "ECS_AL2_NVIDIA",
"imageIdOverride": ""
}
]
}' \
--service-role arn:aws:iam::123456789012:role/AWSBatchServiceRole
The ECS_AL2_NVIDIA image type tells Batch to use the Amazon Linux 2 ECS-optimized AMI with NVIDIA drivers pre-installed. No manual driver setup.
GPU job definition with CUDA resource requirements:
{
"jobDefinitionName": "pytorch-training-gpu",
"type": "container",
"containerProperties": {
"image": "763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:2.1.0-gpu-py310-cu121-ubuntu20.04-ec2",
"vcpus": 8,
"memory": 61440,
"jobRoleArn": "arn:aws:iam::123456789012:role/BatchJobRole",
"command": ["python", "train_distributed.py", "--batch-size", "256"],
"resourceRequirements": [
{
"type": "GPU",
"value": "1"
}
],
"environment": [
{"name": "NCCL_DEBUG", "value": "INFO"},
{"name": "CUDA_VISIBLE_DEVICES", "value": "0"}
],
"mountPoints": [
{"containerPath": "/dev/shm", "readOnly": false, "sourceVolume": "shm"}
],
"volumes": [
{"name": "shm", "host": {"sourcePath": "/dev/shm"}}
]
},
"retryStrategy": {"attempts": 2},
"timeout": {"attemptDurationSeconds": 86400}
}
# Dockerfile for GPU batch job
FROM 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:2.1.0-gpu-py310-cu121-ubuntu20.04-ec2
WORKDIR /workspace
COPY requirements-gpu.txt .
RUN pip install --no-cache-dir -r requirements-gpu.txt
COPY training/ ./training/
ENTRYPOINT ["python", "training/train.py"]
763104351884.dkr.ecr.*.amazonaws.com/pytorch-training:*) ships with PyTorch, CUDA, cuDNN, NCCL, and all required system libraries pre-installed and tested together. Using it avoids days of environment debugging. Pull from the ECR Public Gallery or your regional ECR to avoid cross-region data transfer costs.
Multi-Node Parallel Jobs and MPI Workloads
For tightly coupled HPC workloads that need multiple nodes communicating via MPI (Message Passing Interface), AWS Batch supports multi-node parallel jobs. This is the pattern for CFD simulations, molecular dynamics, weather modeling, and large distributed ML training that spans more nodes than a single GPU instance can provide.
A multi-node job definition specifies node groups. The main node (index 0) acts as the MPI master; worker nodes register with it and receive work distribution commands. Batch provisions all nodes simultaneously and connects them within the same placement group for low-latency communication.
{
"jobDefinitionName": "mpi-simulation",
"type": "multinode",
"nodeProperties": {
"numNodes": 8,
"mainNode": 0,
"nodeRangeProperties": [
{
"targetNodes": "0:0",
"container": {
"image": "123456789012.dkr.ecr.us-east-1.amazonaws.com/mpi-master:latest",
"vcpus": 32,
"memory": 131072,
"jobRoleArn": "arn:aws:iam::123456789012:role/BatchJobRole",
"command": ["/workspace/run_master.sh"],
"environment": [
{"name": "ROLE", "value": "master"},
{"name": "NUM_WORKERS", "value": "7"}
]
}
},
{
"targetNodes": "1:",
"container": {
"image": "123456789012.dkr.ecr.us-east-1.amazonaws.com/mpi-worker:latest",
"vcpus": 32,
"memory": 131072,
"jobRoleArn": "arn:aws:iam::123456789012:role/BatchJobRole",
"command": ["/workspace/run_worker.sh"],
"environment": [
{"name": "ROLE", "value": "worker"}
]
}
}
]
},
"retryStrategy": {"attempts": 1},
"timeout": {"attemptDurationSeconds": 43200}
}
In each container, Batch injects several environment variables that MPI startup scripts use for node discovery:
# run_master.sh — MPI master startup
#!/bin/bash
set -e
MAIN_HOST=$AWS_BATCH_JOB_MAIN_NODE_PRIVATE_IPV4_ADDRESS
NUM_NODES=$AWS_BATCH_JOB_NUM_NODES
# Wait for all workers to register
echo "Master starting on $MAIN_HOST with $NUM_NODES nodes"
# Generate hostfile from Batch environment
python3 /workspace/gen_hostfile.py > /tmp/hostfile
# Launch MPI job
mpirun -np $((NUM_NODES * 32)) \
-hostfile /tmp/hostfile \
--map-by socket \
/workspace/simulation_binary \
--config /workspace/config.yaml
Step Functions Integration: Map State + AWS Batch
AWS Batch and Step Functions are natural partners. Step Functions orchestrates the workflow — branching, error handling, retries, and parallel fan-out — while Batch handles the heavy compute execution. The combination is more powerful than either service alone.
The optimized Batch integration (using the .sync:2 resource suffix) means Step Functions polls for job completion automatically — your state machine pauses, Batch runs the job, and execution resumes when the job finishes. No polling Lambda needed.
{
"Comment": "ML pipeline: ingest -> transform (parallel) -> aggregate",
"StartAt": "IngestData",
"States": {
"IngestData": {
"Type": "Task",
"Resource": "arn:aws:states:::batch:submitJob.sync:2",
"Parameters": {
"JobName": "ingest-data",
"JobDefinition": "data-ingest:2",
"JobQueue": "default-queue",
"ContainerOverrides": {
"Environment": [
{"Name": "DATE", "Value.$": "$.date"}
]
}
},
"ResultPath": "$.ingestResult",
"Retry": [
{
"ErrorEquals": ["Batch.BatchException", "States.TaskFailed"],
"IntervalSeconds": 30,
"MaxAttempts": 2,
"BackoffRate": 2.0
}
],
"Catch": [
{
"ErrorEquals": ["States.ALL"],
"Next": "NotifyFailure",
"ResultPath": "$.error"
}
],
"Next": "ParallelTransform"
},
"ParallelTransform": {
"Type": "Task",
"Resource": "arn:aws:states:::batch:submitJob.sync:2",
"Parameters": {
"JobName": "parallel-transform",
"JobDefinition": "data-transform:4",
"JobQueue": "default-queue",
"ArrayProperties": {
"Size": 100
}
},
"ResultPath": "$.transformResult",
"Next": "Aggregate"
},
"Aggregate": {
"Type": "Task",
"Resource": "arn:aws:states:::batch:submitJob.sync:2",
"Parameters": {
"JobName": "aggregate-results",
"JobDefinition": "data-aggregate:1",
"JobQueue": "default-queue"
},
"ResultPath": "$.aggregateResult",
"Next": "PipelineSucceeded"
},
"PipelineSucceeded": {
"Type": "Succeed"
},
"NotifyFailure": {
"Type": "Task",
"Resource": "arn:aws:states:::sns:publish",
"Parameters": {
"TopicArn": "arn:aws:sns:us-east-1:123456789012:pipeline-alerts",
"Message.$": "States.Format('Pipeline failed: {}', $.error)"
},
"Next": "PipelineFailed"
},
"PipelineFailed": {
"Type": "Fail",
"Error": "PipelineError",
"Cause": "One or more batch jobs failed"
}
}
}
See the Step Functions complete guide for Map state patterns, error handling, and workflow design best practices.
Cost Optimization: Spot, Fargate Spot, and Queue Priority
AWS Batch's cost story is compelling, but only if you apply the right strategies. The biggest lever is Spot — EC2 Spot instances can cut compute costs by 60–90%. But Spot requires workloads that tolerate interruptions. Here is a complete cost optimization playbook for AWS Batch.
1. Use SPOT_CAPACITY_OPTIMIZED allocation strategy. This picks the Spot pool with the most available capacity, reducing the probability of interruption compared to the cheapest pool. For batch jobs, lower interruption rate usually beats slightly lower price.
2. Enable automatic checkpointing in your job code. Design jobs to write progress to S3 periodically. On Spot interruption and retry, read the checkpoint and resume from there instead of starting from scratch.
# Checkpoint pattern in Python
import os, boto3, json, time
s3 = boto3.client('s3')
CHECKPOINT_BUCKET = os.environ['S3_OUTPUT_BUCKET']
JOB_ID = os.environ['AWS_BATCH_JOB_ID']
def save_checkpoint(state: dict):
s3.put_object(
Bucket=CHECKPOINT_BUCKET,
Key=f'checkpoints/{JOB_ID}.json',
Body=json.dumps(state)
)
def load_checkpoint() -> dict:
try:
obj = s3.get_object(
Bucket=CHECKPOINT_BUCKET,
Key=f'checkpoints/{JOB_ID}.json'
)
return json.loads(obj['Body'].read())
except s3.exceptions.NoSuchKey:
return {} # no checkpoint, start fresh
def process():
checkpoint = load_checkpoint()
start_index = checkpoint.get('last_processed_index', 0)
for i in range(start_index, 100000):
do_work(i)
if i % 500 == 0:
save_checkpoint({'last_processed_index': i})
# Clean up checkpoint on success
s3.delete_object(Bucket=CHECKPOINT_BUCKET, Key=f'checkpoints/{JOB_ID}.json')
3. Use multiple compute environments with priority. Configure your job queue with a Spot environment at order 1 and an On-Demand environment at order 2. Batch fills from Spot first; if capacity is unavailable, jobs spill over to On-Demand. Critical jobs that cannot wait use a dedicated On-Demand queue.
resource "aws_batch_job_queue" "smart_queue" {
name = "smart-cost-queue"
state = "ENABLED"
priority = 10
compute_environment_order {
order = 1
compute_environment = aws_batch_compute_environment.spot.arn
}
compute_environment_order {
order = 2
compute_environment = aws_batch_compute_environment.on_demand.arn
}
}
4. Right-size vCPU and memory requests. Batch packs jobs onto instances based on declared vCPU/memory requirements. Over-provisioning wastes capacity; under-provisioning causes OOM kills. Profile your job with CloudWatch Container Insights before setting production requirements. As a rule of thumb, set memory to p95 observed usage plus 20% headroom.
5. Set minvCpus to 0. Never set a positive minvCpus unless you have a strict cold-start SLA. Keeping warm instances running 24/7 when jobs only run a few hours per day is a major cost leak.
6. Use Fargate Spot for short jobs under 4 vCPUs. Fargate Spot offers up to 70% savings over regular Fargate with no instance management. For jobs under 20 minutes, the Spot interruption risk is low and the savings are significant.
Monitoring: CloudWatch Metrics and EventBridge Alerts
Operational visibility in AWS Batch comes from two sources: CloudWatch metrics for queue and job-level capacity signals, and EventBridge events for state change notifications. Together they give you proactive alerting before backlogs grow and immediate notification on job failures.
Key CloudWatch metrics:
| Metric | Namespace | What it signals |
|---|---|---|
| PendingJobCount | AWS/Batch | Jobs waiting for capacity — spike means compute environment isn't scaling fast enough |
| RunnableJobCount | AWS/Batch | Jobs waiting on capacity but eligible to run — high value = insufficient vCPUs |
| RunningJobCount | AWS/Batch | Currently executing jobs |
| SucceededJobCount | AWS/Batch | Completed successfully in the last minute |
| FailedJobCount | AWS/Batch | Failed in the last minute — alert on any non-zero value |
| CPUUtilization | AWS/ECS | Container-level CPU usage for rightsizing |
Create a CloudWatch alarm on RunnableJobCount to detect capacity bottlenecks:
aws cloudwatch put-metric-alarm \
--alarm-name "batch-runnable-backlog" \
--alarm-description "Job queue backlog growing — check compute environment" \
--namespace AWS/Batch \
--metric-name RunnableJobCount \
--dimensions Name=JobQueue,Value=default-queue \
--statistic Average \
--period 300 \
--threshold 50 \
--comparison-operator GreaterThanThreshold \
--evaluation-periods 3 \
--alarm-actions arn:aws:sns:us-east-1:123456789012:batch-ops-alerts \
--ok-actions arn:aws:sns:us-east-1:123456789012:batch-ops-alerts
EventBridge (CloudWatch Events) captures every job state transition. Route FAILED transitions to an SNS topic for immediate notification:
{
"source": ["aws.batch"],
"detail-type": ["Batch Job State Change"],
"detail": {
"status": ["FAILED"],
"jobQueue": [
"arn:aws:batch:us-east-1:123456789012:job-queue/default-queue",
"arn:aws:batch:us-east-1:123456789012:job-queue/high-priority-queue"
]
}
}
# Terraform: EventBridge rule for failed batch jobs
resource "aws_cloudwatch_event_rule" "batch_failed" {
name = "batch-job-failed"
description = "Notify on any Batch job failure"
event_pattern = jsonencode({
source = ["aws.batch"]
"detail-type" = ["Batch Job State Change"]
detail = {
status = ["FAILED"]
}
})
}
resource "aws_cloudwatch_event_target" "batch_failed_sns" {
rule = aws_cloudwatch_event_rule.batch_failed.name
target_id = "BatchFailedToSNS"
arn = aws_sns_topic.alerts.arn
input_transformer {
input_paths = {
jobName = "$.detail.jobName"
jobId = "$.detail.jobId"
reason = "$.detail.statusReason"
}
input_template = "\"Batch job FAILED: (ID: ). Reason: \""
}
}
/aws/batch/job by default when using the awslogs driver. For failed jobs, the stream name is {jobDefinitionName}/default/{jobId}. You can query logs using CloudWatch Logs Insights across all jobs in a definition to find error patterns at scale.
AWSBatch_*) to get per-task CPU, memory, network, and storage metrics. This is the best tool for rightsizing vCPU and memory declarations in job definitions. Navigate to ECS → Clusters → AWSBatch_* → Metrics tab.
Frequently Asked Questions
When a Spot instance is reclaimed, the ECS container is stopped and the Batch job moves back to RUNNABLE state (if retry attempts remain). The job is rescheduled on a new instance automatically. Use the evaluateOnExit block with onStatusReason: "Host EC2 was terminated" to explicitly retry Spot interruptions without consuming your application error retry budget.
AWS Batch supports array jobs up to 10,000 children. For workloads larger than 10,000 units, break them into multiple array jobs and use Step Functions or a parent job to coordinate them. The 10,000-child limit is generous enough for the vast majority of batch workloads.
No — Fargate and EC2 compute environments cannot be mixed in the same job queue. Create separate queues for Fargate and EC2 workloads. However, a single job queue can reference multiple EC2 compute environments (e.g., Spot + On-Demand) as fallback options ordered by priority.
Never put large data in environment variables or job parameters (limits are in KB). The standard pattern is to write input data to S3 before submitting the job, pass only the S3 URI as an environment variable or command argument, and have the job fetch the data at startup. For structured job parameters, put a JSON config file in S3 and pass the S3 URI.
AWS Articles
Quick Reference
Job states:
SUBMITTED → PENDING → RUNNABLE → STARTING → RUNNING → SUCCEEDED / FAILED
Max array size: 10,000 children
Max job timeout: No hard limit
Spot savings: Up to 90%
GPU instances: P4d (A100), G4dn (T4), G5 (A10G)
Fargate max: 16 vCPU / 120 GB