AWS SageMaker: Machine Learning Deployment Guide

AWS SageMaker is the most complete managed ML platform available today. It covers the entire machine learning lifecycle — from data preparation and model training to deployment, monitoring, and retraining — without requiring you to manage any underlying infrastructure. This guide walks through every major SageMaker capability you'll encounter in production: training jobs with spot instances, real-time and serverless endpoints, MLOps pipelines, Feature Store, and model monitoring. Code examples use boto3 and the SageMaker Python SDK throughout.

SageMaker Platform Overview
Training Jobs: Built-in Algorithms vs Custom Containers
Spot Training: 70% Cost Savings
Hyperparameter Tuning Jobs
Model Deployment: Endpoints and Inference Modes
SageMaker Pipelines: MLOps CI/CD
Feature Store: Online and Offline
Model Monitoring: Drift, Quality, Bias
SageMaker Canvas: No-Code ML
Cost Optimization Strategies
Frequently Asked Questions

SageMaker Platform Overview

SageMaker is not a single service — it is a family of integrated tools that cover every phase of ML development. Understanding what each component does prevents confusion and helps you pick the right tool for each task.

Component	Purpose	When to Use
SageMaker Studio	Browser-based IDE for ML	Exploratory work, notebook-first development
SageMaker Notebooks	Managed Jupyter notebooks	Quick experiments, no Studio needed
Training Jobs	Managed distributed training on EC2	Training any model at scale
Processing Jobs	Managed data preprocessing / evaluation	ETL, feature engineering, batch scoring
Real-time Endpoints	Always-on HTTPS inference endpoint	Low-latency online predictions
Serverless Inference	Pay-per-invocation inference	Sporadic or unpredictable traffic
Batch Transform	Offline bulk predictions on S3 data	Scoring large datasets overnight
Async Inference	Queued inference for large payloads	Large inputs (video, documents)
Pipelines	ML CI/CD orchestration	Repeatable model build + deploy workflows
Feature Store	Centralised feature repository	Sharing features across teams/models
Model Monitor	Production data/model quality checks	Detecting drift and degradation
Canvas	No-code AutoML UI	Business analysts, rapid prototyping

Architecture Tip: A production ML system typically uses SageMaker Pipelines to orchestrate training → evaluation → registration → deployment, Feature Store for consistent feature retrieval at training and inference time, and Model Monitor to detect when the deployed model needs retraining.

Training Jobs: Built-in Algorithms vs Custom Containers

SageMaker training jobs run on fully managed compute. You supply your training script and data location; SageMaker provisions the instance, copies the data, runs your code, saves the model artifact to S3, and terminates the instance. You are only billed for the seconds the instance is running.

Built-in Algorithms

SageMaker ships ~20 built-in algorithms as Docker images maintained by AWS. Common ones include XGBoost, Linear Learner, K-Means, Random Cut Forest (anomaly detection), BlazingText (NLP), and Object Detection. Using built-ins means zero container maintenance — just point to the image URI and supply hyperparameters.

Custom Training Scripts with Framework Containers

For PyTorch, TensorFlow, Scikit-learn, or HuggingFace, SageMaker provides framework-specific managed containers. You write your training script as a standard Python file and pass it to the estimator — SageMaker handles the rest.

import boto3
import sagemaker
from sagemaker.pytorch import PyTorch

# Initialise session
sess = sagemaker.Session()
role = sagemaker.get_execution_role()
bucket = sess.default_bucket()

# Define PyTorch estimator
estimator = PyTorch(
    entry_point="train.py",           # Your training script
    source_dir="./src",               # Directory containing train.py + requirements.txt
    role=role,
    instance_type="ml.p3.2xlarge",    # GPU instance
    instance_count=1,
    framework_version="2.1",
    py_version="py310",
    hyperparameters={
        "epochs": 30,
        "batch-size": 64,
        "learning-rate": 0.001,
    },
    output_path=f"s3://{bucket}/models/",
    environment={"WANDB_DISABLED": "true"},
)

# Launch training job — blocks until complete
estimator.fit({
    "train": f"s3://{bucket}/data/train/",
    "val":   f"s3://{bucket}/data/val/",
})

# Model artifact location
print(estimator.model_data)
# s3://my-bucket/models/pytorch-training-2026-06-06-12-00-00-000/output/model.tar.gz

Inside train.py, SageMaker injects the hyperparameters as CLI arguments and sets environment variables like SM_CHANNEL_TRAIN and SM_MODEL_DIR so your script knows where to read data and write the model artifact.

Custom Containers: If you need a dependency not available in any managed container (e.g., a specific CUDA version, a proprietary library), build your own Docker image, push it to Amazon ECR, and pass the image URI to the image_uri parameter of the generic Estimator class. The SageMaker Training Toolkit is open source — you can add it to any base image to get the environment variable injection for free.

Spot Training: 70% Cost Savings

SageMaker Managed Spot Training runs your training job on Spot instances. AWS can interrupt the job, but SageMaker automatically resumes from the last checkpoint when capacity becomes available. For a 10-hour training job, this can cut the compute cost from $80 to $24.

from sagemaker.pytorch import PyTorch

estimator = PyTorch(
    entry_point="train.py",
    source_dir="./src",
    role=role,
    instance_type="ml.p3.2xlarge",
    instance_count=1,
    framework_version="2.1",
    py_version="py310",
    # --- Spot training settings ---
    use_spot_instances=True,
    max_run=7200,           # Maximum total training time: 2 hours
    max_wait=10800,         # Max time including spot waits: 3 hours
    checkpoint_s3_uri=f"s3://{bucket}/checkpoints/my-job/",
    checkpoint_local_path="/opt/ml/checkpoints",
    hyperparameters={"epochs": 50},
)

estimator.fit({"train": f"s3://{bucket}/data/train/"})

# Check savings in the training job metadata
job_name = estimator.latest_training_job.name
sm = boto3.client("sagemaker")
desc = sm.describe_training_job(TrainingJobName=job_name)
billed   = desc["BillableTimeInSeconds"]
total    = desc["TrainingTimeInSeconds"]
savings  = round((1 - billed / total) * 100, 1)
print(f"Spot savings: {savings}%  (billed {billed}s of {total}s)")

Checkpoint Requirement: Your training script must periodically save a checkpoint to checkpoint_local_path and resume from it on startup if a checkpoint exists. SageMaker syncs this path with checkpoint_s3_uri automatically. Without checkpointing, an interruption restarts training from epoch 0.

Hyperparameter Tuning Jobs

SageMaker Automatic Model Tuning (AMT) runs multiple training jobs in parallel, using Bayesian optimisation to find the best hyperparameter combination. You define the metric to optimise and the search ranges; SageMaker handles the rest.

from sagemaker.tuner import (
    HyperparameterTuner, ContinuousParameter, IntegerParameter, CategoricalParameter
)

tuner = HyperparameterTuner(
    estimator=estimator,
    objective_metric_name="validation:accuracy",
    objective_type="Maximize",
    hyperparameter_ranges={
        "learning-rate": ContinuousParameter(1e-4, 1e-1, scaling_type="Logarithmic"),
        "batch-size":    CategoricalParameter([32, 64, 128]),
        "dropout":       ContinuousParameter(0.1, 0.5),
        "hidden-units":  IntegerParameter(64, 512),
    },
    max_jobs=20,
    max_parallel_jobs=4,
    strategy="Bayesian",  # or "Random", "Grid", "Hyperband"
)

tuner.fit({"train": f"s3://{bucket}/data/train/"})
tuner.wait()

# Get the best training job
best = tuner.best_training_job()
print(f"Best job: {best}")

Model Deployment: Endpoints and Inference Modes

SageMaker supports four inference patterns. Choosing the right one has a major impact on cost and latency.

Mode	Latency	Cost Model	Best For
Real-time endpoint	<100ms	Per instance-hour (always on)	Online serving, <6MB payload
Serverless inference	Cold start ~1s	Per invocation + GB-seconds	Sporadic traffic, dev/test
Batch transform	Minutes to hours	Per instance-hour (job duration)	Offline bulk scoring, large datasets
Async inference	Seconds to minutes	Per invocation, idle scale-to-zero	Large payloads (>6MB), long inference

Real-time Endpoint Deployment

import boto3
import json

# Deploy from a completed training job
predictor = estimator.deploy(
    initial_instance_count=2,
    instance_type="ml.m5.xlarge",
    endpoint_name="my-pytorch-endpoint",
    serializer=sagemaker.serializers.JSONSerializer(),
    deserializer=sagemaker.deserializers.JSONDeserializer(),
)

# Invoke the endpoint
response = predictor.predict({"inputs": [[1.2, 3.4, 5.6, 7.8]]})
print(response)  # {"predictions": [0.97]}

# --- Or invoke via boto3 directly ---
sm_rt = boto3.client("sagemaker-runtime")
response = sm_rt.invoke_endpoint(
    EndpointName="my-pytorch-endpoint",
    ContentType="application/json",
    Body=json.dumps({"inputs": [[1.2, 3.4, 5.6, 7.8]]}),
)
result = json.loads(response["Body"].read())
print(result)

# Auto scaling policy for the endpoint
aas = boto3.client("application-autoscaling")
aas.register_scalable_target(
    ServiceNamespace="sagemaker",
    ResourceId="endpoint/my-pytorch-endpoint/variant/AllTraffic",
    ScalableDimension="sagemaker:variant:DesiredInstanceCount",
    MinCapacity=1,
    MaxCapacity=10,
)
aas.put_scaling_policy(
    PolicyName="sagemaker-invocations-scaling",
    ServiceNamespace="sagemaker",
    ResourceId="endpoint/my-pytorch-endpoint/variant/AllTraffic",
    ScalableDimension="sagemaker:variant:DesiredInstanceCount",
    PolicyType="TargetTrackingScaling",
    TargetTrackingScalingPolicyConfiguration={
        "TargetValue": 1000.0,  # Target: 1000 invocations per instance per minute
        "PredefinedMetricSpecification": {
            "PredefinedMetricType": "SageMakerVariantInvocationsPerInstance"
        },
        "ScaleInCooldown": 300,
        "ScaleOutCooldown": 60,
    },
)

Serverless Inference

Serverless inference is ideal for models with sporadic or unpredictable traffic. There are no idle instances — you pay only when inference requests arrive. Cold starts add roughly 1—3 seconds on first invocation.

from sagemaker.serverless import ServerlessInferenceConfig

serverless_config = ServerlessInferenceConfig(
    memory_size_in_mb=2048,    # 1024, 2048, 3072, 4096, 5120, or 6144
    max_concurrency=10,        # Max concurrent invocations
)

predictor = estimator.deploy(
    serverless_inference_config=serverless_config,
    endpoint_name="my-serverless-endpoint",
)

SageMaker Pipelines: MLOps CI/CD

SageMaker Pipelines is a DAG-based orchestration engine for ML workflows. A pipeline defines steps (Processing, Training, Evaluation, RegisterModel, CreateModel, Deploy) and the dependencies between them. Pipelines are versioned, repeatable, and can be triggered manually, on a schedule, or via EventBridge when new data arrives in S3.

from sagemaker.workflow.pipeline import Pipeline
from sagemaker.workflow.steps import TrainingStep, ProcessingStep
from sagemaker.workflow.step_collections import RegisterModel
from sagemaker.workflow.parameters import ParameterString, ParameterFloat
from sagemaker.workflow.conditions import ConditionGreaterThanOrEqualTo
from sagemaker.workflow.condition_step import ConditionStep
from sagemaker.workflow.functions import JsonGet
from sagemaker.processing import ScriptProcessor, ProcessingInput, ProcessingOutput
from sagemaker.inputs import TrainingInput

# Pipeline parameters — can be overridden at runtime
input_data    = ParameterString(name="InputData", default_value=f"s3://{bucket}/data/")
accuracy_gate = ParameterFloat(name="AccuracyGate", default_value=0.85)

# Step 1: Preprocessing
processor = ScriptProcessor(
    image_uri="683313688378.dkr.ecr.us-east-1.amazonaws.com/sagemaker-scikit-learn:1.2-1",
    command=["python3"],
    instance_type="ml.m5.xlarge",
    instance_count=1,
    role=role,
)
preprocess_step = ProcessingStep(
    name="Preprocess",
    processor=processor,
    inputs=[ProcessingInput(source=input_data, destination="/opt/ml/processing/input")],
    outputs=[
        ProcessingOutput(output_name="train", source="/opt/ml/processing/train"),
        ProcessingOutput(output_name="val",   source="/opt/ml/processing/val"),
    ],
    code="preprocess.py",
)

# Step 2: Training
train_step = TrainingStep(
    name="TrainModel",
    estimator=estimator,
    inputs={
        "train": TrainingInput(preprocess_step.properties.ProcessingOutputConfig
                               .Outputs["train"].S3Output.S3Uri),
        "val":   TrainingInput(preprocess_step.properties.ProcessingOutputConfig
                               .Outputs["val"].S3Output.S3Uri),
    },
)

# Step 3: Evaluate
eval_step = ProcessingStep(
    name="EvaluateModel",
    processor=processor,
    inputs=[
        ProcessingInput(source=train_step.properties.ModelArtifacts.S3ModelArtifacts,
                        destination="/opt/ml/processing/model"),
    ],
    outputs=[ProcessingOutput(output_name="evaluation", source="/opt/ml/processing/evaluation")],
    code="evaluate.py",
    property_files=[
        sagemaker.workflow.properties.PropertyFile(
            name="EvaluationReport", output_name="evaluation", path="evaluation.json"
        )
    ],
)

# Step 4: Conditional register — only if accuracy >= gate
accuracy_condition = ConditionGreaterThanOrEqualTo(
    left=JsonGet(step_name=eval_step.name, property_file="EvaluationReport", json_path="accuracy"),
    right=accuracy_gate,
)
register_step = RegisterModel(
    name="RegisterModel",
    estimator=estimator,
    model_data=train_step.properties.ModelArtifacts.S3ModelArtifacts,
    content_types=["application/json"],
    response_types=["application/json"],
    model_package_group_name="my-model-group",
    approval_status="PendingManualApproval",
)
condition_step = ConditionStep(
    name="CheckAccuracy",
    conditions=[accuracy_condition],
    if_steps=[register_step],
    else_steps=[],
)

# Assemble and upsert the pipeline
pipeline = Pipeline(
    name="my-ml-pipeline",
    parameters=[input_data, accuracy_gate],
    steps=[preprocess_step, train_step, eval_step, condition_step],
    sagemaker_session=sess,
)
pipeline.upsert(role_arn=role)

# Execute with custom parameters
execution = pipeline.start(
    parameters={"InputData": f"s3://{bucket}/data/2026-06-06/", "AccuracyGate": 0.88}
)
execution.wait()
print(execution.list_steps())

Model Registry: The RegisterModel step adds the model to a Model Package Group with PendingManualApproval status. A data scientist reviews the evaluation report in SageMaker Studio and approves or rejects it. An EventBridge rule on the Approved status change can then trigger a Lambda function to automatically deploy the model to the production endpoint.

Feature Store: Online and Offline

SageMaker Feature Store is a centralised repository for ML features. It solves two major problems: (1) training-serving skew — features computed differently at training time vs inference time, and (2) feature duplication — multiple teams recomputing the same features independently. Feature Store has two backends that can be used independently or together:

Online Store — DynamoDB-backed, single-digit millisecond reads, stores the latest value per entity. Used during real-time inference to retrieve the current feature values.
Offline Store — S3-backed Parquet, stores all historical values with timestamps. Used during training to generate point-in-time correct feature datasets.

import boto3
import pandas as pd
import time
from sagemaker.feature_store.feature_group import FeatureGroup
from sagemaker.feature_store.feature_definition import (
    FeatureDefinition, FeatureTypeEnum
)

sm_client = boto3.client("sagemaker")
fs_runtime = boto3.client("sagemaker-featurestore-runtime")

# --- Create a Feature Group ---
feature_group = FeatureGroup(
    name="customer-purchase-features",
    sagemaker_session=sess,
)
feature_group.load_feature_definitions(data_frame=pd.DataFrame({
    "customer_id":         pd.Series(dtype="str"),
    "total_spend_30d":     pd.Series(dtype="float64"),
    "num_orders_30d":      pd.Series(dtype="int64"),
    "avg_order_value":     pd.Series(dtype="float64"),
    "days_since_last_order": pd.Series(dtype="int64"),
    "event_time":          pd.Series(dtype="str"),  # ISO-8601 timestamp — required
}))
feature_group.create(
    s3_uri=f"s3://{bucket}/feature-store/",
    record_identifier_name="customer_id",
    event_time_feature_name="event_time",
    role_arn=role,
    enable_online_store=True,
    enable_offline_store=True,
)

# Wait for the Feature Group to become active
while feature_group.describe()["FeatureGroupStatus"] != "Created":
    time.sleep(5)
print("Feature Group is active")

# --- Ingest features ---
records = pd.DataFrame([
    {"customer_id": "C001", "total_spend_30d": 420.50, "num_orders_30d": 5,
     "avg_order_value": 84.10, "days_since_last_order": 3,
     "event_time": "2026-06-06T10:00:00Z"},
    {"customer_id": "C002", "total_spend_30d": 112.00, "num_orders_30d": 2,
     "avg_order_value": 56.00, "days_since_last_order": 14,
     "event_time": "2026-06-06T10:00:00Z"},
])
feature_group.ingest(data_frame=records, max_workers=4, wait=True)

# --- Online retrieval at inference time ---
response = fs_runtime.get_record(
    FeatureGroupName="customer-purchase-features",
    RecordIdentifierValueAsString="C001",
    FeatureNames=["total_spend_30d", "num_orders_30d", "avg_order_value"],
)
features = {f["FeatureName"]: f["ValueAsString"] for f in response["Record"]}
print(features)
# {'total_spend_30d': '420.5', 'num_orders_30d': '5', 'avg_order_value': '84.1'}

# --- Offline: generate training dataset with Athena ---
feature_group.athena_query().run(
    query_string="""
        SELECT customer_id, total_spend_30d, num_orders_30d, avg_order_value,
               days_since_last_order
        FROM "customer-purchase-features"
        WHERE event_time BETWEEN '2026-01-01' AND '2026-06-01'
    """,
    output_location=f"s3://{bucket}/athena-results/",
)

Model Monitoring: Data Drift, Model Quality, Bias Detection

Model performance degrades over time as the real-world data distribution shifts away from the training distribution. SageMaker Model Monitor continuously checks four dimensions: data quality (feature distribution drift), model quality (prediction accuracy vs ground truth), bias drift (demographic parity, equal opportunity), and explainability drift (SHAP value shifts).

The workflow is: (1) capture endpoint traffic to S3 using Data Capture, (2) create a baseline from your training dataset, (3) schedule a monitoring job that runs hourly or daily and compares live traffic against the baseline, (4) publish violation reports to CloudWatch Metrics and trigger alerts.

from sagemaker.model_monitor import DefaultModelMonitor, DataCaptureConfig
from sagemaker.model_monitor.dataset_format import DatasetFormat
from sagemaker.model_monitor import CronExpressionGenerator

# Step 1: Enable data capture on the endpoint
data_capture_config = DataCaptureConfig(
    enable_capture=True,
    sampling_percentage=20,          # Capture 20% of traffic
    destination_s3_uri=f"s3://{bucket}/capture/",
    capture_options=["Input", "Output"],
    csv_content_types=["text/csv"],
    json_content_types=["application/json"],
)
# Pass data_capture_config=data_capture_config when calling estimator.deploy()

# Step 2: Create a baseline from the training dataset
monitor = DefaultModelMonitor(
    role=role,
    instance_count=1,
    instance_type="ml.m5.xlarge",
    volume_size_in_gb=20,
    max_runtime_in_seconds=3600,
)
monitor.suggest_baseline(
    baseline_dataset=f"s3://{bucket}/data/train/baseline.csv",
    dataset_format=DatasetFormat.csv(header=True),
    output_s3_uri=f"s3://{bucket}/monitor/baseline/",
    wait=True,
)

# Step 3: Schedule monitoring
monitor.create_monitoring_schedule(
    monitor_schedule_name="my-model-monitor",
    endpoint_input="my-pytorch-endpoint",
    output_s3_uri=f"s3://{bucket}/monitor/reports/",
    statistics=monitor.baseline_statistics(),
    constraints=monitor.suggested_constraints(),
    schedule_cron_expression=CronExpressionGenerator.hourly(),
    enable_cloudwatch_metrics=True,
)
print("Monitoring schedule created — violations will appear in CloudWatch")

Ground Truth Labels: Model quality monitoring requires you to merge ground truth labels (e.g., actual outcomes logged by your application) back into the capture data. SageMaker provides a merge job utility. Without ground truth, you can still use data quality monitoring to detect input drift even without knowing whether predictions were correct.

When a violation is detected, SageMaker writes a constraint_violations.json report to S3 and emits a CloudWatch metric. Wire a CloudWatch Alarm to this metric and route it to SNS → Lambda to trigger an automatic retraining pipeline execution.

SageMaker Canvas: No-Code ML

SageMaker Canvas allows business analysts and domain experts to build, train, and deploy ML models through a point-and-click interface — no Python required. Users import a CSV from S3 or upload it directly, select the target column, and Canvas automatically chooses the best algorithm (classification, regression, or time series forecasting), trains multiple models, and presents an accuracy leaderboard. The winning model can be shared with a data scientist for review or deployed to a real-time endpoint with a single click.

Canvas is priced per session hour (the time the UI is open) plus model training time. For citizen data scientists doing periodic analysis, the total monthly cost is usually well below what a data engineer would charge to build the equivalent pipeline. Canvas models are fully compatible with the SageMaker Model Registry — a Canvas model can be registered, approved, and deployed via the same Pipeline workflow as any programmatically trained model.

Cost Optimization Strategies

SageMaker costs accumulate in three places: training compute, inference compute, and Studio notebook instances. Each requires a different optimisation strategy.

Strategy	Saving	Applies To
Managed Spot Training	Up to 70%	Training jobs
Serverless Inference	100% during idle time	Sporadic inference traffic
Multi-Model Endpoints (MME)	50—90%	Many models with low traffic each
Auto Scaling endpoints to 0	100% during off-hours	Dev/staging endpoints
Graviton3 instances (ml.m7g)	~20%	CPU inference endpoints
Inf2 instances (AWS Inferentia)	Up to 50%	Deep learning inference at scale
Right-size instances with CloudWatch	10—40%	Any endpoint or training job
Lifecycle configs to auto-stop notebooks	Eliminates waste	Studio / classic notebooks

Multi-Model Endpoints

If you have 500 customer-specific models that each receive only a few requests per day, creating 500 individual endpoints would cost tens of thousands of dollars per month. A Multi-Model Endpoint (MME) hosts all models on a single fleet. SageMaker lazy-loads models into memory on first request and evicts least-recently-used models when memory is full. You pay for one endpoint regardless of how many models it hosts.

from sagemaker.multidatamodel import MultiDataModel

# All model artifacts live in the same S3 prefix
model_data_prefix = f"s3://{bucket}/multi-model-artifacts/"

mme = MultiDataModel(
    name="customer-churn-mme",
    model_data_prefix=model_data_prefix,
    model=estimator.create_model(),   # Base container
    sagemaker_session=sess,
)

predictor = mme.deploy(
    initial_instance_count=2,
    instance_type="ml.m5.2xlarge",
    endpoint_name="customer-churn-mme-endpoint",
)

# Copy a model into the MME (can be done without restarting the endpoint)
mme.add_model(model_data_source=f"s3://{bucket}/models/customer-A/model.tar.gz",
              model_data_path="customer-A.tar.gz")

# Invoke a specific model by name
sm_rt = boto3.client("sagemaker-runtime")
response = sm_rt.invoke_endpoint(
    EndpointName="customer-churn-mme-endpoint",
    ContentType="application/json",
    TargetModel="customer-A.tar.gz",    # SageMaker routes to this specific model
    Body=json.dumps({"inputs": [[1.2, 3.4, 5.6]]}),
)
print(json.loads(response["Body"].read()))

Auto-stop Studio Notebooks: Studio notebook instances run even when the browser tab is closed. Use a lifecycle configuration script that calls aws sagemaker stop-notebook-instance if the instance has been idle for more than 60 minutes. AWS provides a reference implementation in the sagemaker-studio-lifecycle-config-examples GitHub repo.

Frequently Asked Questions

What is the difference between SageMaker Training Jobs and SageMaker Processing Jobs?

Training Jobs are designed for fitting ML models — they integrate with the model artifact pipeline, Model Registry, and Estimator classes. Processing Jobs are for arbitrary Python scripts that don't produce a model: data preprocessing, feature engineering, post-training evaluation, or batch inference scoring. Processing Jobs use ScriptProcessor, SKLearnProcessor, PySparkProcessor, etc. Both run on managed compute that is billed per-second and terminated when the job completes.

When should I use Serverless Inference vs a real-time endpoint?

Use Serverless Inference when traffic is sporadic — for example, an internal tool used only during business hours, or a model that gets a few hundred requests per day. At that scale, the always-on cost of a real-time endpoint (minimum ~$50/month for a ml.t3.medium) exceeds the per-invocation cost of serverless. For models that need consistent sub-100ms latency or receive sustained traffic (thousands of requests per minute), real-time endpoints with auto scaling are more cost-effective and predictable.

How do I do A/B testing with SageMaker endpoints?

SageMaker supports production variants — multiple model versions deployed behind a single endpoint with configurable traffic weights. Set up two variants with InitialVariantWeight of 90/10 to send 90% of traffic to the champion model and 10% to the challenger. CloudWatch metrics are emitted per-variant, so you can compare latency and invocation counts. Once the challenger wins, update the weights to 0/100 and delete the old variant — all without restarting the endpoint or changing client code.

Can SageMaker Pipelines trigger automatically when new data arrives?

Yes. Create an EventBridge rule that fires when an s3:ObjectCreated event matches your data prefix, then route it to a Lambda function that calls pipeline.start(). Alternatively, use SageMaker Pipelines' built-in EventBridge trigger which lets you define the trigger condition directly on the pipeline without writing Lambda code. For time-based execution (e.g., nightly retraining), use a scheduled EventBridge rule with a cron expression.

What is the SageMaker Model Registry used for?

The Model Registry is a versioned catalogue of trained models. Each model version stores the model artifact URI, container image, inference specification, metrics from evaluation, and approval status. In a CI/CD ML pipeline, the registry acts as the handoff point between the data science team (who train and register models) and the platform team (who deploy approved models). When a model version is approved, a downstream system can automatically deploy it — decoupling training from deployment and providing a full audit trail of which model is running in production at any point in time.

AWS SageMaker: Machine Learning Deployment Guide

Table of Contents

SageMaker Platform Overview

Training Jobs: Built-in Algorithms vs Custom Containers

Built-in Algorithms

Custom Training Scripts with Framework Containers

Spot Training: 70% Cost Savings

Hyperparameter Tuning Jobs

Model Deployment: Endpoints and Inference Modes

Real-time Endpoint Deployment

Serverless Inference

SageMaker Pipelines: MLOps CI/CD

Feature Store: Online and Offline

Model Monitoring: Data Drift, Model Quality, Bias Detection

SageMaker Canvas: No-Code ML

Cost Optimization Strategies

Multi-Model Endpoints

Frequently Asked Questions

What is the difference between SageMaker Training Jobs and SageMaker Processing Jobs?

When should I use Serverless Inference vs a real-time endpoint?

How do I do A/B testing with SageMaker endpoints?

Can SageMaker Pipelines trigger automatically when new data arrives?

What is the SageMaker Model Registry used for?

Read Next

AWS Lambda Serverless: Complete Guide

AWS Cost Optimization: 20 Proven Strategies

AWS Articles