AWS EC2 Spot Instances: Cost Savings and Interruption Handling

AWS EC2 Spot Instances let you bid on unused EC2 capacity at discounts of up to 90% compared to On-Demand pricing — making them one of the most powerful cost-reduction levers in your AWS toolkit. The trade-off is that AWS can reclaim Spot Instances with a two-minute warning when it needs the capacity back, so they require fault-tolerant, interruption-aware workload design. This guide walks you through the Spot pricing model, how to request and manage Spot capacity, Spot Fleet diversification, graceful interruption handling with EventBridge, integration with Auto Scaling Groups, and battle-tested patterns for CI/CD pipelines and batch workloads.

1. How Spot Pricing Works
2. Spot vs On-Demand vs Reserved Instances
3. Requesting Spot Instances (CLI & Console)
4. Spot Fleet and Diversification Strategy
5. Interruption Handling: 2-Minute Warning + EventBridge
6. Spot Instances with Auto Scaling Groups
7. CI/CD and Batch Workloads on Spot
Read Next

1. How Spot Pricing Works

EC2 Spot Instances use AWS's spare compute capacity. AWS publishes a Spot price for each instance type in each Availability Zone, which fluctuates based on supply and demand. When you launch a Spot Instance you no longer set a max bid price — instead, AWS charges you the current Spot price for the duration your instance runs, and you are subject to interruption only when AWS needs capacity back (not when the price exceeds a threshold). This model changed in November 2017; the old bidding system is gone.

Spot prices are typically 60–90% lower than On-Demand prices. For example, an m5.2xlarge On-Demand in us-east-1 costs roughly $0.384/hr, while its Spot price often sits around $0.04–$0.08/hr. Over a month of continuous use that is a $230–$250 saving on a single instance type.

Key concepts to understand:

Spot capacity pool — a unique combination of instance type, OS, and Availability Zone. Each pool has its own Spot price and interruption frequency.
Spot price history — AWS exposes 90 days of historical Spot prices via the CLI and console. Use this data to select pools with stable pricing and low interruption rates.
Interruption rate — AWS publishes an instance interruption advisor dashboard that rates each pool from "<5%" to ">20%" frequency. Lower frequency pools are ideal for longer-running jobs.
Rebalance recommendation — Before issuing a two-minute interruption notice, AWS may send a rebalance recommendation signal (EC2 Instance Rebalance Recommendation event) giving you extra time to proactively replace the instance.

# Check current Spot price for m5.2xlarge in us-east-1
aws ec2 describe-spot-price-history \
  --instance-types m5.2xlarge \
  --product-descriptions "Linux/UNIX" \
  --region us-east-1 \
  --query 'SpotPriceHistory[*].{AZ:AvailabilityZone,Price:SpotPrice,Time:Timestamp}' \
  --output table

Tip: Always check Spot price history across multiple AZs and across similar instance families (e.g., m5, m5a, m5n) before committing to a pool. A pool with a price that has been flat for 30+ days is a strong signal of abundant capacity.

2. Spot vs On-Demand vs Reserved Instances

Choosing the right purchasing model is fundamental to EC2 cost optimisation. AWS provides three primary models — On-Demand, Reserved Instances (RIs), and Spot — each suited to different workload characteristics.

Feature	On-Demand	Reserved (1-yr, no upfront)	Spot
Typical discount vs On-Demand	—	~36%	60–90%
Commitment	None	1 or 3 years	None
Interruption risk	None	None	2-min notice
Best for	Unpredictable, short spikes	Steady-state, predictable load	Fault-tolerant, flexible workloads
Availability SLA	Yes	Yes	No (best effort)

The optimal cost strategy is to combine all three: use Reserved Instances or Savings Plans to cover your baseline capacity, On-Demand to cover short-lived demand spikes that Spot cannot serve fast enough, and Spot for everything that is stateless, retryable, or time-flexible — batch jobs, CI runners, ML training, video transcoding, data processing, and web-tier scale-out.

A common production pattern is a mixed instances Auto Scaling Group where 20–30% of capacity is On-Demand (or Reserved) to guarantee a floor, and the remaining 70–80% is Spot across multiple instance families and AZs. This gives you meaningful cost savings while maintaining availability even when one Spot pool is interrupted.

3. Requesting Spot Instances (CLI & Console)

The modern way to launch Spot Instances is through a Launch Template with InstanceMarketOptions set to spot, or via the run-instances CLI command. The older request-spot-instances API still works but is considered legacy.

Using the AWS CLI

# Launch a Spot Instance using run-instances (recommended approach)
aws ec2 run-instances \
  --image-id ami-0c02fb55956c7d316 \
  --instance-type m5.xlarge \
  --key-name my-keypair \
  --subnet-id subnet-0abc1234def56789a \
  --security-group-ids sg-0123456789abcdef0 \
  --instance-market-options '{"MarketType":"spot","SpotOptions":{"SpotInstanceType":"one-time","InstanceInterruptionBehavior":"terminate"}}' \
  --tag-specifications 'ResourceType=instance,Tags=[{Key=Name,Value=spot-worker}]' \
  --count 1

The InstanceInterruptionBehavior can be set to terminate, stop, or hibernate. terminate is the default and cheapest. stop and hibernate are useful when you want to resume quickly after interruption, but require EBS-backed instances and additional setup.

Creating a Launch Template for Spot

# Create a Launch Template with Spot market options
aws ec2 create-launch-template \
  --launch-template-name spot-worker-lt \
  --version-description "Spot instance template v1" \
  --launch-template-data '{
    "ImageId": "ami-0c02fb55956c7d316",
    "InstanceType": "m5.xlarge",
    "KeyName": "my-keypair",
    "SecurityGroupIds": ["sg-0123456789abcdef0"],
    "InstanceMarketOptions": {
      "MarketType": "spot",
      "SpotOptions": {
        "SpotInstanceType": "persistent",
        "InstanceInterruptionBehavior": "stop"
      }
    },
    "TagSpecifications": [{
      "ResourceType": "instance",
      "Tags": [{"Key": "Name", "Value": "spot-worker"}]
    }]
  }'

Note: persistent Spot requests automatically re-launch the instance after it is stopped or interrupted (as long as capacity is available). Use one-time for tasks that should not automatically restart — batch jobs where re-launch is managed by your own orchestration layer.

Checking Spot Instance Status

# List all running Spot Instances with their lifecycle field
aws ec2 describe-instances \
  --filters "Name=instance-lifecycle,Values=spot" "Name=instance-state-name,Values=running" \
  --query 'Reservations[*].Instances[*].{ID:InstanceId,Type:InstanceType,AZ:Placement.AvailabilityZone,Price:SpotInstanceRequestId}' \
  --output table

4. Spot Fleet and Diversification Strategy

A Spot Fleet is a collection of Spot (and optionally On-Demand) instances that fulfils a target capacity request across multiple launch specifications — different instance types, AMIs, and Availability Zones. Spot Fleet automatically selects the cheapest combination of capacity pools to meet your target, and replaces interrupted instances from alternate pools. This is the key to building highly available, low-cost compute fleets.

Spot Fleet supports two allocation strategies:

lowestPrice — picks the cheapest pool(s). Maximises savings but concentrates risk in one pool. Use only for batch jobs that can tolerate simultaneous interruptions.
capacityOptimized — picks the pool with the most available Spot capacity. Reduces interruption frequency at the cost of a slightly higher price. Recommended for most production workloads.
priceCapacityOptimized (newest) — balances both price and capacity. Best default choice for most workloads.

# spot-fleet-config.json — Spot Fleet with diversification across 3 instance types and 2 AZs
{
  "SpotFleetRequestConfig": {
    "IamFleetRole": "arn:aws:iam::123456789012:role/aws-ec2-spot-fleet-tagging-role",
    "AllocationStrategy": "priceCapacityOptimized",
    "TargetCapacity": 10,
    "SpotPrice": "0.15",
    "LaunchSpecifications": [
      {
        "ImageId": "ami-0c02fb55956c7d316",
        "InstanceType": "m5.xlarge",
        "SubnetId": "subnet-0abc1234",
        "WeightedCapacity": 1
      },
      {
        "ImageId": "ami-0c02fb55956c7d316",
        "InstanceType": "m5a.xlarge",
        "SubnetId": "subnet-0abc1234",
        "WeightedCapacity": 1
      },
      {
        "ImageId": "ami-0c02fb55956c7d316",
        "InstanceType": "m4.xlarge",
        "SubnetId": "subnet-0def5678",
        "WeightedCapacity": 1
      },
      {
        "ImageId": "ami-0c02fb55956c7d316",
        "InstanceType": "m5.2xlarge",
        "SubnetId": "subnet-0def5678",
        "WeightedCapacity": 2
      }
    ],
    "Type": "maintain"
  }
}

# Submit the fleet request
aws ec2 request-spot-fleet \
  --spot-fleet-request-config file://spot-fleet-config.json

Using "Type": "maintain" ensures Spot Fleet continuously re-launches replacement capacity when instances are interrupted, maintaining your target of 10 weighted capacity units. The WeightedCapacity field lets you mix instance sizes — a m5.2xlarge counts as 2 units while an m5.xlarge counts as 1.

Diversification rule of thumb: Specify at least 5–10 instance types across 2–3 Availability Zones. Never rely on a single instance type for Spot capacity. The more pools you target, the lower your probability of a large simultaneous interruption event.

EC2 Fleet vs Spot Fleet: EC2 Fleet is the newer, more flexible API that supersedes Spot Fleet. It supports Spot, On-Demand, and Reserved in a single request and offers the same allocation strategies. For greenfield projects, prefer EC2 Fleet or mixed instances Auto Scaling Groups over raw Spot Fleet.

5. Interruption Handling: 2-Minute Warning + EventBridge

When AWS needs to reclaim a Spot Instance, it sends a Spot Instance Interruption Notice exactly two minutes before termination. This notice is delivered via two channels simultaneously: the EC2 instance metadata service (accessible from inside the instance) and Amazon EventBridge (for centralised, account-level automation). Properly handling these signals is what separates resilient Spot workloads from fragile ones.

Polling the Instance Metadata (from inside the instance)

#!/bin/bash
# interruption-checker.sh
# Run this in a loop on your Spot Instance to detect the 2-minute warning

TOKEN=$(curl -s -X PUT "http://169.254.169.254/latest/api/token" \
  -H "X-aws-ec2-metadata-token-ttl-seconds: 21600")

while true; do
  HTTP_CODE=$(curl -s -o /dev/null -w "%{http_code}" \
    -H "X-aws-ec2-metadata-token: $TOKEN" \
    http://169.254.169.254/latest/meta-data/spot/termination-time)

  if [ "$HTTP_CODE" -eq 200 ]; then
    echo "$(date): INTERRUPTION NOTICE received. Starting graceful shutdown..."

    # 1. Signal the application to stop accepting new work
    systemctl stop my-worker.service

    # 2. Upload any partial results to S3
    aws s3 sync /var/app/results/ s3://my-results-bucket/partial/

    # 3. Deregister from load balancer or service registry
    # (application-specific)

    # 4. Exit cleanly — let the OS terminate
    break
  fi

  sleep 5
done

The metadata endpoint /latest/meta-data/spot/termination-time returns HTTP 404 normally. When an interruption is scheduled, it returns HTTP 200 with the exact ISO-8601 timestamp of when the instance will be terminated. You have the window between detection and that timestamp (up to ~2 minutes) to complete graceful shutdown tasks.

EventBridge Rule for Centralised Interruption Handling

For fleet-scale automation — draining instances from a load balancer, notifying your job queue, triggering a Lambda to spin up replacements — use EventBridge rather than per-instance polling. AWS emits a EC2 Spot Instance Interruption Warning event to EventBridge when an interruption notice is issued.

# Create an EventBridge rule that triggers a Lambda on Spot interruption
aws events put-rule \
  --name "SpotInterruptionHandler" \
  --event-pattern '{
    "source": ["aws.ec2"],
    "detail-type": ["EC2 Spot Instance Interruption Warning"]
  }' \
  --state ENABLED

# Add the Lambda function as a target
aws events put-targets \
  --rule "SpotInterruptionHandler" \
  --targets '[{
    "Id": "SpotInterruptionLambda",
    "Arn": "arn:aws:lambda:us-east-1:123456789012:function:handle-spot-interruption"
  }]'

The Lambda receives an event payload like this:

{
  "version": "0",
  "id": "12345678-1234-1234-1234-123456789012",
  "detail-type": "EC2 Spot Instance Interruption Warning",
  "source": "aws.ec2",
  "account": "123456789012",
  "time": "2026-06-06T10:00:00Z",
  "region": "us-east-1",
  "detail": {
    "instance-id": "i-0abcd1234efgh5678",
    "instance-action": "terminate"
  }
}

From the Lambda, you can call deregister-targets on an ALB target group, send a drain signal to your container orchestrator, or update a DynamoDB table that your job scheduler reads to avoid dispatching new work to the instance.

Rebalance Recommendation: Before the 2-minute hard interruption notice, AWS may send an EC2 Instance Rebalance Recommendation EventBridge event when the risk of interruption increases. This is a softer, earlier signal. You can use it to proactively launch a replacement before the interruption occurs, enabling zero-downtime transitions.

6. Spot Instances with Auto Scaling Groups

The most operationally mature way to use Spot Instances in production is through mixed instances Auto Scaling Groups (ASGs). A mixed instances ASG lets you specify a base capacity of On-Demand instances plus additional Spot capacity, all managed by the Auto Scaling service which handles interruption replacement automatically.

When a Spot Instance in the ASG is interrupted, Auto Scaling detects the termination and immediately launches a replacement from the next cheapest available pool among your configured instance types. This happens without any manual intervention. Combined with health checks, lifecycle hooks, and capacity rebalancing, mixed instances ASGs are production-grade.

# Create a mixed instances Auto Scaling Group using a Launch Template
aws autoscaling create-auto-scaling-group \
  --auto-scaling-group-name "web-tier-mixed-asg" \
  --min-size 2 \
  --max-size 20 \
  --desired-capacity 6 \
  --vpc-zone-identifier "subnet-0abc1234,subnet-0def5678,subnet-0ghi9012" \
  --mixed-instances-policy '{
    "LaunchTemplate": {
      "LaunchTemplateSpecification": {
        "LaunchTemplateName": "spot-worker-lt",
        "Version": "$Latest"
      },
      "Overrides": [
        {"InstanceType": "m5.xlarge"},
        {"InstanceType": "m5a.xlarge"},
        {"InstanceType": "m5n.xlarge"},
        {"InstanceType": "m4.xlarge"},
        {"InstanceType": "m5.2xlarge", "WeightedCapacity": "2"},
        {"InstanceType": "m5a.2xlarge", "WeightedCapacity": "2"}
      ]
    },
    "InstancesDistribution": {
      "OnDemandBaseCapacity": 2,
      "OnDemandPercentageAboveBaseCapacity": 20,
      "SpotAllocationStrategy": "price-capacity-optimized"
    }
  }' \
  --health-check-type "ELB" \
  --health-check-grace-period 120

The key fields in InstancesDistribution:

OnDemandBaseCapacity: 2 — the first 2 instances are always On-Demand, guaranteeing a floor even if all Spot capacity is unavailable.
OnDemandPercentageAboveBaseCapacity: 20 — of capacity above the base, 20% is On-Demand and 80% is Spot.
SpotAllocationStrategy: price-capacity-optimized — the best general strategy. It selects the Spot pool that offers the lowest price among the pools with the highest available capacity.

To enable Capacity Rebalancing (which proactively replaces at-risk Spot Instances before they receive an interruption notice):

aws autoscaling update-auto-scaling-group \
  --auto-scaling-group-name "web-tier-mixed-asg" \
  --capacity-rebalance

With Capacity Rebalancing enabled, when Auto Scaling receives a Rebalance Recommendation event for one of its Spot Instances, it proactively launches a replacement and only terminates the at-risk instance after the replacement passes health checks. This can result in brief moments of over-capacity but dramatically reduces any service disruption from Spot interruptions.

7. CI/CD and Batch Workloads on Spot

Spot Instances are an ideal fit for workloads that are inherently retryable and whose value lies in throughput rather than individual instance uptime. CI/CD build runners, ML training jobs, video encoding, data pipeline ETL, nightly report generation, and large-scale web scraping are all textbook Spot use cases. The following patterns cover the two most common production scenarios.

CI/CD Build Runners (GitHub Actions / Jenkins)

For GitHub Actions, you can use the actions-runner-controller (ARC) on EKS with Spot node groups, or use EC2-based self-hosted runners launched as Spot Instances via an Auto Scaling Group. Each runner registers with GitHub, picks up a build job, and terminates when the job completes. If a runner is interrupted mid-job, GitHub detects the disconnection and re-queues the job automatically.

# CloudFormation snippet: Auto Scaling Group for GitHub Actions Spot runners
# (UserData registers runner on launch, deregisters on termination via lifecycle hook)

AWSTemplateFormatVersion: '2010-09-09'
Resources:
  RunnerLaunchTemplate:
    Type: AWS::EC2::LaunchTemplate
    Properties:
      LaunchTemplateName: github-runner-spot
      LaunchTemplateData:
        ImageId: ami-0c02fb55956c7d316   # Amazon Linux 2023 AMI
        InstanceType: c5.xlarge
        IamInstanceProfile:
          Arn: !GetAtt RunnerInstanceProfile.Arn
        InstanceMarketOptions:
          MarketType: spot
          SpotOptions:
            SpotInstanceType: one-time
            InstanceInterruptionBehavior: terminate
        UserData:
          Fn::Base64: |
            #!/bin/bash
            # Install runner
            cd /home/ec2-user
            curl -o actions-runner.tar.gz -L \
              https://github.com/actions/runner/releases/download/v2.317.0/actions-runner-linux-x64-2.317.0.tar.gz
            tar xzf actions-runner.tar.gz
            # Register (token fetched from SSM Parameter Store)
            TOKEN=$(aws ssm get-parameter --name /github/runner-token \
              --with-decryption --query Parameter.Value --output text)
            ./config.sh --url https://github.com/my-org --token $TOKEN \
              --labels spot,x64 --unattended --ephemeral
            ./run.sh

  RunnerASG:
    Type: AWS::AutoScaling::AutoScalingGroup
    Properties:
      MinSize: 0
      MaxSize: 50
      DesiredCapacity: 0     # Scaled by GitHub webhooks / KEDA
      MixedInstancesPolicy:
        LaunchTemplate:
          LaunchTemplateSpecification:
            LaunchTemplateName: github-runner-spot
            Version: '$Latest'
          Overrides:
            - InstanceType: c5.xlarge
            - InstanceType: c5a.xlarge
            - InstanceType: c5n.xlarge
            - InstanceType: c4.xlarge
        InstancesDistribution:
          OnDemandBaseCapacity: 0
          OnDemandPercentageAboveBaseCapacity: 0
          SpotAllocationStrategy: price-capacity-optimized

Batch Processing with AWS Batch on Spot

AWS Batch natively supports Spot Instances through managed compute environments. When a Spot interruption occurs, Batch automatically retries the interrupted job on a new instance. You control retry behaviour via the attempts parameter on each job definition.

# Create a Managed Compute Environment using Spot
aws batch create-compute-environment \
  --compute-environment-name spot-batch-env \
  --type MANAGED \
  --state ENABLED \
  --compute-resources '{
    "type": "SPOT",
    "bidPercentage": 60,
    "minvCpus": 0,
    "maxvCpus": 256,
    "instanceTypes": ["m5", "m5a", "m4", "c5", "c5a", "r5"],
    "subnets": ["subnet-0abc1234", "subnet-0def5678"],
    "securityGroupIds": ["sg-0123456789abcdef0"],
    "instanceRole": "arn:aws:iam::123456789012:instance-profile/ecsInstanceRole",
    "spotIamFleetRole": "arn:aws:iam::123456789012:role/AmazonEC2SpotFleetRole",
    "allocationStrategy": "SPOT_CAPACITY_OPTIMIZED",
    "tags": {"Project": "batch-pipeline", "CostCenter": "data-eng"}
  }' \
  --service-role arn:aws:iam::123456789012:role/AWSBatchServiceRole

Setting "bidPercentage": 60 means Batch will only use Spot pools where the current price is at or below 60% of the On-Demand price — a useful guard against running in a high-price environment that erodes your savings. Setting "minvCpus": 0 ensures you pay nothing when there is no work to process.

Best Practice — Checkpointing: For long-running batch jobs (>30 minutes), implement regular checkpointing to S3 or EFS. When a job is interrupted and retried, it should resume from the last checkpoint rather than restarting from scratch. This makes Spot economically viable even for jobs that take hours.

Cost Savings Summary

To put numbers on the opportunity: a mid-size engineering team running 20 CI build runners for 8 hours/day, 22 days/month on c5.xlarge On-Demand would spend approximately $528/month. On Spot at an average 75% discount, the same workload costs around $132/month — a saving of nearly $5,000 per year on CI infrastructure alone. At fleet scale across data pipelines, ML training, and test environments, Spot savings routinely reach six figures annually.