AWS Batch: Run Large-Scale Batch Computing Jobs on AWS (2026)

AWS Batch is a fully managed service that enables you to run hundreds of thousands of batch computing jobs on AWS without managing the underlying compute infrastructure. Whether you're processing genomics pipelines, running ML training jobs, crunching financial simulations, or executing ETL workloads, AWS Batch provisions exactly the right amount of compute resources — EC2, Spot, or Fargate — and scales them down to zero when work is done. This guide covers every layer of AWS Batch from architecture concepts to production-grade GPU workloads, multi-node MPI jobs, and cost optimization strategies.

AWS Batch vs Lambda vs ECS vs Glue
Core Concepts: Compute Environments, Job Queues, Job Definitions
Compute Environment Setup: EC2 Spot and Fargate
Job Definitions: Container Properties, Retry, Timeout
Submitting Jobs: Single, Array, and Dependent Jobs
GPU Workloads: P4d, G4 Instances, CUDA Training
Multi-Node Parallel Jobs and MPI Workloads
Step Functions Integration: Map State + Batch
Cost Optimization: Spot, Fargate Spot, Queue Priority
Monitoring: CloudWatch Metrics and EventBridge Alerts

AWS Batch vs Lambda vs ECS vs Glue: When to Use What

Choosing the right AWS compute service for batch workloads is one of the most common architectural decisions teams face. Each service has a clear sweet spot, and picking the wrong one leads to either over-engineering or hitting hard limits at the worst possible time.

Factor	AWS Batch	Lambda	ECS (Fargate)	Glue
Max job duration	No limit	15 minutes	No limit	48 hours
GPU support	Yes (P4d, G4, G5)	No	No	No
MPI / multi-node	Yes	No	No	No
Spot instance support	Native, automatic retry	No	Manual Spot config	Limited (Flex)
HPC / tightly coupled	Yes (EFA, placement groups)	No	No	No
Managed runtime	Docker container (any)	AWS-managed	Docker container	Spark / Python built-in
Queue + priority	Yes (multiple queues)	No	No (service-based)	No
Array jobs (1000x parallel)	Native	Fan-out via SQS/SNS	Manual	No
Cost model	Pay per EC2/Fargate used	Per 1ms invocation	Per vCPU/memory second	Per DPU-hour
Best for	Long-running, HPC, GPU, parallel	Short event-driven tasks	Web APIs, always-on services	Spark ETL on structured data

When AWS Batch wins: Your jobs run longer than 15 minutes, need GPUs, require tight inter-node communication (MPI/EFA), must process thousands of items in parallel with Spot savings, or need job dependencies and retry policies baked into the scheduler — not built manually in Lambda or Step Functions.

AWS Glue is excellent when your data is tabular and your team prefers Spark SQL over custom Docker images. Lambda is ideal for short sub-second to 15-minute event handlers. ECS is for persistent services, not ephemeral batch. AWS Batch sits between ECS and HPC cluster managers — it manages scheduling, retries, Spot interruptions, and infrastructure so you focus on the job logic inside your container.

Core Concepts: Compute Environments, Job Queues, Job Definitions

AWS Batch has four primitive objects that compose every batch workload. Understanding how they relate is critical before writing a single line of infrastructure code.

Compute Environments define the pool of EC2 or Fargate capacity that Batch manages. A managed compute environment lets AWS provision, scale, and terminate instances automatically. An unmanaged compute environment gives you full control — you provision your own EC2 instances, register them with ECS (Batch uses ECS under the hood), and Batch only handles scheduling.

Job Queues receive submitted jobs and forward them to a compute environment. A queue has a priority and one or more compute environments ordered by preference. Jobs wait in PENDING state until resources are available. You can have a high-priority queue (e.g., urgent ML inference) backed by On-Demand compute and a low-priority queue backed by Spot instances — and the same compute environment can serve multiple queues.

Job Definitions are versioned templates for a job. They specify the container image, vCPU/memory requirements, IAM role, environment variables, volumes, retry strategy, and timeout. Think of them as reusable blueprints — you override specific fields (e.g., environment variables, command) at submission time.

Jobs are instances of a job definition submitted to a queue. A job progresses through states: SUBMITTED → PENDING → RUNNABLE → STARTING → RUNNING → SUCCEEDED or FAILED. Array jobs spawn N child jobs all running the same definition; each child receives its index via the AWS_BATCH_JOB_ARRAY_INDEX environment variable.

Batch uses ECS internally: Every job runs as an ECS task on a Batch-managed ECS cluster. You can see these tasks in the ECS console. This means your container image must be accessible from ECR or Docker Hub, and the job's IAM role follows the same ECS task role pattern.

Compute Environment Setup: EC2 Spot and Fargate

Compute environments are where the infrastructure decisions happen. The choice between managed EC2 (On-Demand or Spot), Fargate, and Fargate Spot has significant cost and performance implications.

A managed EC2 Spot compute environment automatically requests Spot capacity, handles interruptions by checkpointing and retrying jobs, and can save up to 90% compared to On-Demand pricing. AWS Batch's Spot integration is superior to manually managing Spot in ECS because Batch handles the retry flow natively — an interrupted job automatically moves back to RUNNABLE and restarts on a new instance.

# Create managed EC2 Spot compute environment via CLI
aws batch create-compute-environment \
  --compute-environment-name spot-ml-env \
  --type MANAGED \
  --state ENABLED \
  --compute-resources '{
    "type": "SPOT",
    "allocationStrategy": "SPOT_CAPACITY_OPTIMIZED",
    "minvCpus": 0,
    "maxvCpus": 2048,
    "desiredvCpus": 0,
    "instanceTypes": ["optimal"],
    "subnets": ["subnet-abc123", "subnet-def456"],
    "securityGroupIds": ["sg-xyz789"],
    "instanceRole": "arn:aws:iam::123456789012:instance-profile/ecsInstanceRole",
    "bidPercentage": 60,
    "spotIamFleetRole": "arn:aws:iam::123456789012:role/AmazonEC2SpotFleetRole",
    "tags": {"Project": "ml-pipeline", "Env": "prod"}
  }' \
  --service-role arn:aws:iam::123456789012:role/AWSBatchServiceRole

For jobs that don't need persistent local storage and are under 16 vCPUs / 120 GB memory, Fargate is often simpler — no EC2 instances, no AMI management, no ECS container instance overhead.

# Terraform: Fargate compute environment + job queue
resource "aws_batch_compute_environment" "fargate" {
  compute_environment_name = "fargate-batch-env"
  type                     = "MANAGED"

  compute_resources {
    type               = "FARGATE_SPOT"
    max_vcpus          = 256
    subnets            = var.private_subnet_ids
    security_group_ids = [aws_security_group.batch.id]
  }

  service_role = aws_iam_role.batch_service.arn
  depends_on   = [aws_iam_role_policy_attachment.batch_service]
}

resource "aws_batch_job_queue" "default" {
  name     = "default-queue"
  state    = "ENABLED"
  priority = 10

  compute_environment_order {
    order               = 1
    compute_environment = aws_batch_compute_environment.fargate.arn
  }
}

resource "aws_batch_job_queue" "high_priority" {
  name     = "high-priority-queue"
  state    = "ENABLED"
  priority = 100

  compute_environment_order {
    order               = 1
    compute_environment = aws_batch_compute_environment.fargate.arn
  }
}

SPOT_CAPACITY_OPTIMIZED vs SPOT_PRICE_CAPACITY_OPTIMIZED: Use SPOT_CAPACITY_OPTIMIZED for batch workloads where interruption rate matters more than price. It picks the Spot pool with the most available capacity, reducing interruptions. Use LOWEST_PRICE only for fault-tolerant jobs that checkpoint aggressively.

Job Definitions: Container Properties, Retry, and Timeout

A job definition is the heart of AWS Batch configuration. It captures everything the scheduler needs to run your container: what image to pull, how many vCPUs and memory to allocate, which IAM role to assume, how many times to retry on failure, and when to give up entirely. Versioning is built-in — every update creates a new revision, and you can pin submissions to a specific revision or always use the latest.

{
  "jobDefinitionName": "ml-training-job",
  "type": "container",
  "containerProperties": {
    "image": "123456789012.dkr.ecr.us-east-1.amazonaws.com/ml-training:v1.2.0",
    "vcpus": 4,
    "memory": 16384,
    "jobRoleArn": "arn:aws:iam::123456789012:role/BatchJobRole",
    "executionRoleArn": "arn:aws:iam::123456789012:role/ecsTaskExecutionRole",
    "command": ["python", "train.py", "--epochs", "50"],
    "environment": [
      {"name": "S3_INPUT_BUCKET", "value": "my-data-bucket"},
      {"name": "S3_OUTPUT_BUCKET", "value": "my-results-bucket"},
      {"name": "MODEL_TYPE", "value": "xgboost"}
    ],
    "secrets": [
      {
        "name": "DB_PASSWORD",
        "valueFrom": "arn:aws:secretsmanager:us-east-1:123456789012:secret:prod/db-password"
      }
    ],
    "mountPoints": [
      {"containerPath": "/tmp/scratch", "readOnly": false, "sourceVolume": "scratch"}
    ],
    "volumes": [
      {"name": "scratch", "host": {"sourcePath": "/tmp"}}
    ],
    "logConfiguration": {
      "logDriver": "awslogs",
      "options": {
        "awslogs-group": "/aws/batch/ml-training",
        "awslogs-region": "us-east-1",
        "awslogs-stream-prefix": "batch"
      }
    }
  },
  "retryStrategy": {
    "attempts": 3,
    "evaluateOnExit": [
      {
        "onStatusReason": "Host EC2 was terminated",
        "action": "RETRY"
      },
      {
        "onReason": "CannotPullContainerError*",
        "action": "RETRY"
      },
      {
        "onExitCode": "1",
        "action": "FAILED"
      }
    ]
  },
  "timeout": {
    "attemptDurationSeconds": 7200
  },
  "tags": {
    "Project": "ml-pipeline",
    "Team": "data-science"
  }
}

The evaluateOnExit block is a powerful feature that lets you retry on Spot interruptions (status reason contains "Host EC2 was terminated") while failing fast on application errors (exit code 1). This prevents wasting Spot budget retrying broken code.

aws batch register-job-definition \
  --cli-input-json file://job-definition.json

Here is a Dockerfile pattern for a typical batch job that reads from S3, processes data, and writes results back:

# Dockerfile for AWS Batch job
FROM python:3.11-slim

# Install system dependencies
RUN apt-get update && apt-get install -y --no-install-recommends \
    awscli \
    && rm -rf /var/lib/apt/lists/*

WORKDIR /app

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY src/ ./src/

# Entrypoint reads AWS_BATCH_JOB_ARRAY_INDEX for array jobs
ENTRYPOINT ["python", "src/process.py"]

# src/process.py — array job pattern
import os
import boto3

def main():
    array_index = int(os.environ.get('AWS_BATCH_JOB_ARRAY_INDEX', 0))
    input_bucket = os.environ['S3_INPUT_BUCKET']
    output_bucket = os.environ['S3_OUTPUT_BUCKET']

    s3 = boto3.client('s3')

    # Each array child processes a different partition
    paginator = s3.get_paginator('list_objects_v2')
    pages = list(paginator.paginate(Bucket=input_bucket, Prefix='data/'))

    all_keys = [obj['Key'] for page in pages for obj in page.get('Contents', [])]

    # Divide work by array index
    chunk = all_keys[array_index::int(os.environ.get('AWS_BATCH_JOB_NUM_NODES', 1))]

    for key in chunk:
        process_file(s3, input_bucket, key, output_bucket)

if __name__ == '__main__':
    main()

Submitting Jobs: Single, Array, and Dependent Jobs

Job submission is where the rubber meets the road. AWS Batch supports three modes: single jobs (one container run), array jobs (N identical containers each with a unique index), and dependent jobs (DAG-style ordering). All three are available via the console, CLI, and boto3.

Single job submission:

aws batch submit-job \
  --job-name ml-training-run-001 \
  --job-queue high-priority-queue \
  --job-definition ml-training-job:5 \
  --container-overrides '{
    "environment": [
      {"name": "MODEL_TYPE", "value": "random-forest"},
      {"name": "S3_INPUT_BUCKET", "value": "my-data-bucket"}
    ],
    "command": ["python", "train.py", "--epochs", "100"]
  }'

Array job — 1000 parallel tasks:

aws batch submit-job \
  --job-name parallel-feature-extraction \
  --job-queue default-queue \
  --job-definition feature-extractor:3 \
  --array-properties '{"size": 1000}'

Each of the 1000 child jobs receives AWS_BATCH_JOB_ARRAY_INDEX (0 to 999) and AWS_BATCH_JOB_ARRAY_SIZE (1000). Your code uses the index to determine which partition of data to process — for example, row 0—999 for index 0, rows 1000—1999 for index 1, etc.

Dependent jobs with Python boto3:

import boto3

batch = boto3.client('batch', region_name='us-east-1')

# Step 1: ingest raw data
ingest = batch.submit_job(
    jobName='ingest-raw-data',
    jobQueue='default-queue',
    jobDefinition='data-ingest:2',
    containerOverrides={
        'environment': [{'name': 'DATE', 'value': '2026-06-09'}]
    }
)
ingest_id = ingest['jobId']

# Step 2: transform depends on step 1 completing successfully
transform = batch.submit_job(
    jobName='transform-data',
    jobQueue='default-queue',
    jobDefinition='data-transform:4',
    dependsOn=[{'jobId': ingest_id, 'type': 'SEQUENTIAL'}],
    arrayProperties={'size': 50}  # fan out to 50 parallel transformers
)
transform_id = transform['jobId']

# Step 3: aggregate depends on ALL transform children finishing (N_TO_N would
# be used if step 2 and step 3 were the same size; here SEQUENTIAL is correct)
aggregate = batch.submit_job(
    jobName='aggregate-results',
    jobQueue='default-queue',
    jobDefinition='data-aggregate:1',
    dependsOn=[{'jobId': transform_id, 'type': 'SEQUENTIAL'}]
)

print(f"Pipeline submitted: ingest={ingest_id}, transform={transform_id}, "
      f"aggregate={aggregate['jobId']}")

Dependency types: SEQUENTIAL means the dependent job starts only after the parent reaches SUCCEEDED. N_TO_N is for array job children — child index N of the dependent job waits for child index N of the parent. Use N_TO_N when you have two array jobs of the same size and each child of the second job needs the specific output of the matching child in the first job.

GPU Workloads: P4d, G4 Instances, CUDA Training

AWS Batch has first-class GPU support via EC2 GPU instances. The P4d (A100 GPUs) and G4dn (T4 GPUs) families are the workhorses for deep learning training and inference. Batch handles requesting these instances and mapping GPU devices into your container — you just declare the GPU count in the job definition's resource requirements.

Your compute environment must use GPU-capable instance types. Avoid the "optimal" instance type selector for GPU jobs — it won't select GPU instances. Explicitly list the families you need:

aws batch create-compute-environment \
  --compute-environment-name gpu-spot-env \
  --type MANAGED \
  --state ENABLED \
  --compute-resources '{
    "type": "SPOT",
    "allocationStrategy": "SPOT_CAPACITY_OPTIMIZED",
    "minvCpus": 0,
    "maxvCpus": 512,
    "instanceTypes": ["p3.2xlarge", "p3.8xlarge", "g4dn.xlarge", "g4dn.2xlarge", "g4dn.4xlarge"],
    "subnets": ["subnet-abc123"],
    "securityGroupIds": ["sg-gpu123"],
    "instanceRole": "arn:aws:iam::123456789012:instance-profile/ecsInstanceRole",
    "bidPercentage": 70,
    "spotIamFleetRole": "arn:aws:iam::123456789012:role/AmazonEC2SpotFleetRole",
    "ec2Configuration": [
      {
        "imageType": "ECS_AL2_NVIDIA",
        "imageIdOverride": ""
      }
    ]
  }' \
  --service-role arn:aws:iam::123456789012:role/AWSBatchServiceRole

The ECS_AL2_NVIDIA image type tells Batch to use the Amazon Linux 2 ECS-optimized AMI with NVIDIA drivers pre-installed. No manual driver setup.

GPU job definition with CUDA resource requirements:

{
  "jobDefinitionName": "pytorch-training-gpu",
  "type": "container",
  "containerProperties": {
    "image": "763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:2.1.0-gpu-py310-cu121-ubuntu20.04-ec2",
    "vcpus": 8,
    "memory": 61440,
    "jobRoleArn": "arn:aws:iam::123456789012:role/BatchJobRole",
    "command": ["python", "train_distributed.py", "--batch-size", "256"],
    "resourceRequirements": [
      {
        "type": "GPU",
        "value": "1"
      }
    ],
    "environment": [
      {"name": "NCCL_DEBUG", "value": "INFO"},
      {"name": "CUDA_VISIBLE_DEVICES", "value": "0"}
    ],
    "mountPoints": [
      {"containerPath": "/dev/shm", "readOnly": false, "sourceVolume": "shm"}
    ],
    "volumes": [
      {"name": "shm", "host": {"sourcePath": "/dev/shm"}}
    ]
  },
  "retryStrategy": {"attempts": 2},
  "timeout": {"attemptDurationSeconds": 86400}
}

# Dockerfile for GPU batch job
FROM 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:2.1.0-gpu-py310-cu121-ubuntu20.04-ec2

WORKDIR /workspace

COPY requirements-gpu.txt .
RUN pip install --no-cache-dir -r requirements-gpu.txt

COPY training/ ./training/

ENTRYPOINT ["python", "training/train.py"]

AWS Deep Learning Containers: The AWS DLC image (763104351884.dkr.ecr.*.amazonaws.com/pytorch-training:*) ships with PyTorch, CUDA, cuDNN, NCCL, and all required system libraries pre-installed and tested together. Using it avoids days of environment debugging. Pull from the ECR Public Gallery or your regional ECR to avoid cross-region data transfer costs.

Multi-Node Parallel Jobs and MPI Workloads

For tightly coupled HPC workloads that need multiple nodes communicating via MPI (Message Passing Interface), AWS Batch supports multi-node parallel jobs. This is the pattern for CFD simulations, molecular dynamics, weather modeling, and large distributed ML training that spans more nodes than a single GPU instance can provide.

A multi-node job definition specifies node groups. The main node (index 0) acts as the MPI master; worker nodes register with it and receive work distribution commands. Batch provisions all nodes simultaneously and connects them within the same placement group for low-latency communication.

{
  "jobDefinitionName": "mpi-simulation",
  "type": "multinode",
  "nodeProperties": {
    "numNodes": 8,
    "mainNode": 0,
    "nodeRangeProperties": [
      {
        "targetNodes": "0:0",
        "container": {
          "image": "123456789012.dkr.ecr.us-east-1.amazonaws.com/mpi-master:latest",
          "vcpus": 32,
          "memory": 131072,
          "jobRoleArn": "arn:aws:iam::123456789012:role/BatchJobRole",
          "command": ["/workspace/run_master.sh"],
          "environment": [
            {"name": "ROLE", "value": "master"},
            {"name": "NUM_WORKERS", "value": "7"}
          ]
        }
      },
      {
        "targetNodes": "1:",
        "container": {
          "image": "123456789012.dkr.ecr.us-east-1.amazonaws.com/mpi-worker:latest",
          "vcpus": 32,
          "memory": 131072,
          "jobRoleArn": "arn:aws:iam::123456789012:role/BatchJobRole",
          "command": ["/workspace/run_worker.sh"],
          "environment": [
            {"name": "ROLE", "value": "worker"}
          ]
        }
      }
    ]
  },
  "retryStrategy": {"attempts": 1},
  "timeout": {"attemptDurationSeconds": 43200}
}

In each container, Batch injects several environment variables that MPI startup scripts use for node discovery:

# run_master.sh — MPI master startup
#!/bin/bash
set -e

MAIN_HOST=$AWS_BATCH_JOB_MAIN_NODE_PRIVATE_IPV4_ADDRESS
NUM_NODES=$AWS_BATCH_JOB_NUM_NODES

# Wait for all workers to register
echo "Master starting on $MAIN_HOST with $NUM_NODES nodes"

# Generate hostfile from Batch environment
python3 /workspace/gen_hostfile.py > /tmp/hostfile

# Launch MPI job
mpirun -np $((NUM_NODES * 32)) \
  -hostfile /tmp/hostfile \
  --map-by socket \
  /workspace/simulation_binary \
  --config /workspace/config.yaml

EFA for HPC: For workloads requiring very low latency between nodes (under 5 microseconds), use Elastic Fabric Adapter (EFA)-enabled instances (hpc6a, p4d.24xlarge, c5n.18xlarge). Set the compute environment's instance types accordingly and use an EFA-enabled security group. EFA throughput is on par with on-premises InfiniBand for most HPC use cases.

Step Functions Integration: Map State + AWS Batch

AWS Batch and Step Functions are natural partners. Step Functions orchestrates the workflow — branching, error handling, retries, and parallel fan-out — while Batch handles the heavy compute execution. The combination is more powerful than either service alone.

The optimized Batch integration (using the .sync:2 resource suffix) means Step Functions polls for job completion automatically — your state machine pauses, Batch runs the job, and execution resumes when the job finishes. No polling Lambda needed.

{
  "Comment": "ML pipeline: ingest -> transform (parallel) -> aggregate",
  "StartAt": "IngestData",
  "States": {
    "IngestData": {
      "Type": "Task",
      "Resource": "arn:aws:states:::batch:submitJob.sync:2",
      "Parameters": {
        "JobName": "ingest-data",
        "JobDefinition": "data-ingest:2",
        "JobQueue": "default-queue",
        "ContainerOverrides": {
          "Environment": [
            {"Name": "DATE", "Value.$": "$.date"}
          ]
        }
      },
      "ResultPath": "$.ingestResult",
      "Retry": [
        {
          "ErrorEquals": ["Batch.BatchException", "States.TaskFailed"],
          "IntervalSeconds": 30,
          "MaxAttempts": 2,
          "BackoffRate": 2.0
        }
      ],
      "Catch": [
        {
          "ErrorEquals": ["States.ALL"],
          "Next": "NotifyFailure",
          "ResultPath": "$.error"
        }
      ],
      "Next": "ParallelTransform"
    },
    "ParallelTransform": {
      "Type": "Task",
      "Resource": "arn:aws:states:::batch:submitJob.sync:2",
      "Parameters": {
        "JobName": "parallel-transform",
        "JobDefinition": "data-transform:4",
        "JobQueue": "default-queue",
        "ArrayProperties": {
          "Size": 100
        }
      },
      "ResultPath": "$.transformResult",
      "Next": "Aggregate"
    },
    "Aggregate": {
      "Type": "Task",
      "Resource": "arn:aws:states:::batch:submitJob.sync:2",
      "Parameters": {
        "JobName": "aggregate-results",
        "JobDefinition": "data-aggregate:1",
        "JobQueue": "default-queue"
      },
      "ResultPath": "$.aggregateResult",
      "Next": "PipelineSucceeded"
    },
    "PipelineSucceeded": {
      "Type": "Succeed"
    },
    "NotifyFailure": {
      "Type": "Task",
      "Resource": "arn:aws:states:::sns:publish",
      "Parameters": {
        "TopicArn": "arn:aws:sns:us-east-1:123456789012:pipeline-alerts",
        "Message.$": "States.Format('Pipeline failed: {}', $.error)"
      },
      "Next": "PipelineFailed"
    },
    "PipelineFailed": {
      "Type": "Fail",
      "Error": "PipelineError",
      "Cause": "One or more batch jobs failed"
    }
  }
}

Map state vs array jobs: Use Step Functions Map state when each iteration needs different input parameters (heterogeneous fan-out). Use Batch array jobs when all N tasks run identical code with only their index varying (homogeneous fan-out). Array jobs are cheaper to orchestrate — each array child doesn't create a Step Functions execution. For 10,000-item parallel processing, array jobs are almost always the right tool.

See the Step Functions complete guide for Map state patterns, error handling, and workflow design best practices.

Cost Optimization: Spot, Fargate Spot, and Queue Priority

AWS Batch's cost story is compelling, but only if you apply the right strategies. The biggest lever is Spot — EC2 Spot instances can cut compute costs by 60—90%. But Spot requires workloads that tolerate interruptions. Here is a complete cost optimization playbook for AWS Batch.

1. Use SPOT_CAPACITY_OPTIMIZED allocation strategy. This picks the Spot pool with the most available capacity, reducing the probability of interruption compared to the cheapest pool. For batch jobs, lower interruption rate usually beats slightly lower price.

2. Enable automatic checkpointing in your job code. Design jobs to write progress to S3 periodically. On Spot interruption and retry, read the checkpoint and resume from there instead of starting from scratch.

# Checkpoint pattern in Python
import os, boto3, json, time

s3 = boto3.client('s3')
CHECKPOINT_BUCKET = os.environ['S3_OUTPUT_BUCKET']
JOB_ID = os.environ['AWS_BATCH_JOB_ID']

def save_checkpoint(state: dict):
    s3.put_object(
        Bucket=CHECKPOINT_BUCKET,
        Key=f'checkpoints/{JOB_ID}.json',
        Body=json.dumps(state)
    )

def load_checkpoint() -> dict:
    try:
        obj = s3.get_object(
            Bucket=CHECKPOINT_BUCKET,
            Key=f'checkpoints/{JOB_ID}.json'
        )
        return json.loads(obj['Body'].read())
    except s3.exceptions.NoSuchKey:
        return {}  # no checkpoint, start fresh

def process():
    checkpoint = load_checkpoint()
    start_index = checkpoint.get('last_processed_index', 0)

    for i in range(start_index, 100000):
        do_work(i)
        if i % 500 == 0:
            save_checkpoint({'last_processed_index': i})

    # Clean up checkpoint on success
    s3.delete_object(Bucket=CHECKPOINT_BUCKET, Key=f'checkpoints/{JOB_ID}.json')

3. Use multiple compute environments with priority. Configure your job queue with a Spot environment at order 1 and an On-Demand environment at order 2. Batch fills from Spot first; if capacity is unavailable, jobs spill over to On-Demand. Critical jobs that cannot wait use a dedicated On-Demand queue.

resource "aws_batch_job_queue" "smart_queue" {
  name     = "smart-cost-queue"
  state    = "ENABLED"
  priority = 10

  compute_environment_order {
    order               = 1
    compute_environment = aws_batch_compute_environment.spot.arn
  }

  compute_environment_order {
    order               = 2
    compute_environment = aws_batch_compute_environment.on_demand.arn
  }
}

4. Right-size vCPU and memory requests. Batch packs jobs onto instances based on declared vCPU/memory requirements. Over-provisioning wastes capacity; under-provisioning causes OOM kills. Profile your job with CloudWatch Container Insights before setting production requirements. As a rule of thumb, set memory to p95 observed usage plus 20% headroom.

5. Set minvCpus to 0. Never set a positive minvCpus unless you have a strict cold-start SLA. Keeping warm instances running 24/7 when jobs only run a few hours per day is a major cost leak.

6. Use Fargate Spot for short jobs under 4 vCPUs. Fargate Spot offers up to 70% savings over regular Fargate with no instance management. For jobs under 20 minutes, the Spot interruption risk is low and the savings are significant.

Monitoring: CloudWatch Metrics and EventBridge Alerts

Operational visibility in AWS Batch comes from two sources: CloudWatch metrics for queue and job-level capacity signals, and EventBridge events for state change notifications. Together they give you proactive alerting before backlogs grow and immediate notification on job failures.

Key CloudWatch metrics:

Metric	Namespace	What it signals
PendingJobCount	AWS/Batch	Jobs waiting for capacity — spike means compute environment isn't scaling fast enough
RunnableJobCount	AWS/Batch	Jobs waiting on capacity but eligible to run — high value = insufficient vCPUs
RunningJobCount	AWS/Batch	Currently executing jobs
SucceededJobCount	AWS/Batch	Completed successfully in the last minute
FailedJobCount	AWS/Batch	Failed in the last minute — alert on any non-zero value
CPUUtilization	AWS/ECS	Container-level CPU usage for rightsizing

Create a CloudWatch alarm on RunnableJobCount to detect capacity bottlenecks:

aws cloudwatch put-metric-alarm \
  --alarm-name "batch-runnable-backlog" \
  --alarm-description "Job queue backlog growing — check compute environment" \
  --namespace AWS/Batch \
  --metric-name RunnableJobCount \
  --dimensions Name=JobQueue,Value=default-queue \
  --statistic Average \
  --period 300 \
  --threshold 50 \
  --comparison-operator GreaterThanThreshold \
  --evaluation-periods 3 \
  --alarm-actions arn:aws:sns:us-east-1:123456789012:batch-ops-alerts \
  --ok-actions arn:aws:sns:us-east-1:123456789012:batch-ops-alerts

EventBridge (CloudWatch Events) captures every job state transition. Route FAILED transitions to an SNS topic for immediate notification:

{
  "source": ["aws.batch"],
  "detail-type": ["Batch Job State Change"],
  "detail": {
    "status": ["FAILED"],
    "jobQueue": [
      "arn:aws:batch:us-east-1:123456789012:job-queue/default-queue",
      "arn:aws:batch:us-east-1:123456789012:job-queue/high-priority-queue"
    ]
  }
}

# Terraform: EventBridge rule for failed batch jobs
resource "aws_cloudwatch_event_rule" "batch_failed" {
  name        = "batch-job-failed"
  description = "Notify on any Batch job failure"

  event_pattern = jsonencode({
    source        = ["aws.batch"]
    "detail-type" = ["Batch Job State Change"]
    detail = {
      status = ["FAILED"]
    }
  })
}

resource "aws_cloudwatch_event_target" "batch_failed_sns" {
  rule      = aws_cloudwatch_event_rule.batch_failed.name
  target_id = "BatchFailedToSNS"
  arn       = aws_sns_topic.alerts.arn

  input_transformer {
    input_paths = {
      jobName   = "$.detail.jobName"
      jobId     = "$.detail.jobId"
      reason    = "$.detail.statusReason"
    }
    input_template = "\"Batch job FAILED:  (ID: ). Reason: \""
  }
}

Job log access: All job stdout/stderr goes to CloudWatch Logs at /aws/batch/job by default when using the awslogs driver. For failed jobs, the stream name is {jobDefinitionName}/default/{jobId}. You can query logs using CloudWatch Logs Insights across all jobs in a definition to find error patterns at scale.

Container Insights for Batch: Enable ECS Container Insights on the Batch-managed ECS cluster (it follows the naming convention AWSBatch_*) to get per-task CPU, memory, network, and storage metrics. This is the best tool for rightsizing vCPU and memory declarations in job definitions. Navigate to ECS → Clusters → AWSBatch_* → Metrics tab.

Frequently Asked Questions

Q: How does AWS Batch handle Spot interruptions?

When a Spot instance is reclaimed, the ECS container is stopped and the Batch job moves back to RUNNABLE state (if retry attempts remain). The job is rescheduled on a new instance automatically. Use the evaluateOnExit block with onStatusReason: "Host EC2 was terminated" to explicitly retry Spot interruptions without consuming your application error retry budget.

Q: What is the maximum size of an array job?

AWS Batch supports array jobs up to 10,000 children. For workloads larger than 10,000 units, break them into multiple array jobs and use Step Functions or a parent job to coordinate them. The 10,000-child limit is generous enough for the vast majority of batch workloads.

Q: Can I use Fargate and EC2 in the same job queue?

No — Fargate and EC2 compute environments cannot be mixed in the same job queue. Create separate queues for Fargate and EC2 workloads. However, a single job queue can reference multiple EC2 compute environments (e.g., Spot + On-Demand) as fallback options ordered by priority.

Q: How do I pass large inputs to batch jobs?

Never put large data in environment variables or job parameters (limits are in KB). The standard pattern is to write input data to S3 before submitting the job, pass only the S3 URI as an environment variable or command argument, and have the job fetch the data at startup. For structured job parameters, put a JSON config file in S3 and pass the S3 URI.

AWS Articles

Quick Reference

Job states:
SUBMITTED → PENDING → RUNNABLE → STARTING → RUNNING → SUCCEEDED / FAILED

Max array size: 10,000 children

Max job timeout: No hard limit

Spot savings: Up to 90%

GPU instances: P4d (A100), G4dn (T4), G5 (A10G)

Fargate max: 16 vCPU / 120 GB

AWS Batch: Run Large-Scale Batch Computing Jobs on AWS (2026)

Table of Contents

AWS Batch vs Lambda vs ECS vs Glue: When to Use What

Core Concepts: Compute Environments, Job Queues, Job Definitions

Compute Environment Setup: EC2 Spot and Fargate

Job Definitions: Container Properties, Retry, and Timeout

Submitting Jobs: Single, Array, and Dependent Jobs

GPU Workloads: P4d, G4 Instances, CUDA Training

Multi-Node Parallel Jobs and MPI Workloads

Step Functions Integration: Map State + AWS Batch

Cost Optimization: Spot, Fargate Spot, and Queue Priority

Monitoring: CloudWatch Metrics and EventBridge Alerts

Frequently Asked Questions

Related Articles

AWS Step Functions: Orchestrate Serverless Workflows

AWS ECS: Container Orchestration with Fargate and EC2

AWS EC2 Spot Instances: Up to 90% Cost Savings

AWS CloudWatch: Monitoring, Alarms, and Dashboards

AWS Lambda: Serverless Functions Complete Guide

AWS Articles

Quick Reference