AWS Bedrock: Build Generative AI Apps with Foundation Models

AWS Bedrock — Generative AI with Foundation Models

AWS Bedrock is the fastest path from a generative AI idea to a production-ready application on AWS. Instead of managing GPU clusters, downloading model weights, or negotiating direct API agreements with AI labs, Bedrock gives you a single unified API to invoke a curated marketplace of foundation models — Claude from Anthropic, Llama from Meta, Mistral, Amazon Titan, Stable Diffusion for images, and more. You only pay for the tokens you process, there is nothing to provision, and every model runs inside your VPC with all the security and compliance controls AWS provides.

This guide covers everything you need to build production generative AI applications on Bedrock: direct model invocation with streaming, Retrieval-Augmented Generation with Knowledge Bases, multi-step agentic workflows with Bedrock Agents, content safety with Guardrails, fine-tuning your own model, and cost optimisation strategies. All code examples use Python boto3 and run against the real Bedrock API.

What Is Bedrock — The Foundation Model Marketplace
Bedrock vs SageMaker vs OpenAI API
Invoking Models — InvokeModel API with boto3
Streaming Responses — Real-time Token Output
Knowledge Bases — RAG with S3 and OpenSearch
Bedrock Agents — Multi-Step Reasoning and Tool Use
Fine-Tuning and Continued Pre-Training
Guardrails — Content Filtering and PII Redaction
Prompt Management — Flows and Versioning
Cost Optimisation — On-Demand vs Provisioned Throughput
Frequently Asked Questions

What Is Bedrock — The Foundation Model Marketplace

AWS Bedrock is a fully managed service that provides access to high-performance foundation models (FMs) through a single API. AWS handles all the infrastructure: GPU cluster management, model serving, scaling, availability, and security. You interact with models via HTTP endpoints — no servers, no containers, no model weights on your disk.

The model catalogue as of mid-2026 includes every major frontier model family:

Provider	Models Available	Best For
Anthropic	Claude 3.5 Sonnet, Claude 3 Haiku, Claude 3 Opus	Reasoning, coding, long context, instruction following
Meta	Llama 3.1 8B / 70B / 405B Instruct	Open-weight flexibility, cost efficiency, fine-tuning base
Mistral AI	Mistral Large, Mistral 7B, Mixtral 8x7B	Multilingual, code generation, fast inference
Amazon	Titan Text Lite, Titan Text Express, Titan Embeddings V2, Titan Image Generator	AWS-native, embeddings, image generation
Stability AI	Stable Diffusion 3 Large, SDXL 1.0	Image generation, image editing
Cohere	Command R+, Command R, Embed	RAG retrieval, enterprise search, multilingual embeddings
AI21 Labs	Jamba 1.5 Large, Jamba 1.5 Mini	Long-context, structured output, summarisation

Model Access: Before calling any foundation model, you must request access in the AWS Console under Bedrock → Model access. Access is per-region. Most models are approved instantly; some (like certain Llama versions) require a brief agreement. Once granted, access is permanent — there is no per-API-call approval step.

Bedrock's killer feature is not just model access — it's the managed capability layer built on top: Knowledge Bases for RAG, Agents for orchestration, Guardrails for safety, Prompt Management for versioning, and Model Evaluation for benchmarking. These are production features you would otherwise spend months building yourself.

Bedrock vs SageMaker vs OpenAI API — When to Use What

Three services dominate the conversation when teams are choosing an AI backend. Each occupies a distinct niche. Choosing the wrong one means either overpaying for capabilities you don't need or under-investing in infrastructure that bites you at scale.

Dimension	AWS Bedrock	AWS SageMaker	OpenAI API
Model choice	Multi-provider FM marketplace	Any model (HuggingFace, custom)	OpenAI models only (GPT-4o, o1, etc.)
Infrastructure	Fully serverless — zero management	Managed but you configure instances	Fully serverless — zero management
Custom models	Fine-tuning on supported base models	Full control — any framework, any GPU	Fine-tuning on select GPT models
Data residency	Stays in your AWS region / VPC	Stays in your AWS region / VPC	Leaves your environment (US servers)
RAG / Agents	Built-in (Knowledge Bases, Agents)	DIY with LangChain/LlamaIndex	Assistants API (limited RAG)
Pricing model	Per token (on-demand) or throughput reservation	Per instance-hour + storage	Per token
AWS service integration	Native (S3, Lambda, CloudWatch, VPC)	Native + deepest ML tooling	API only — you wire the integrations
Best for	Gen AI apps, RAG, agents, multi-model	Custom training, MLOps, non-FM models	Teams already on OpenAI, prototyping

Decision Rule of Thumb: If your use case centres on prompting frontier language models and you want AWS-native data controls, choose Bedrock. If you need to train your own neural network from scratch or serve a specialised model not in Bedrock's catalogue, choose SageMaker. If your team is small and already using OpenAI in production, don't switch for switching's sake — the marginal benefit of Bedrock only materialises when AWS integration or multi-model routing matters.

Invoking Models — InvokeModel API with boto3

Every Bedrock foundation model is reachable via the bedrock-runtime boto3 client. The invoke_model call takes a JSON body whose schema varies by model provider, but the outer API is always the same: you supply the modelId, a JSON body serialised to bytes, and content-type headers. The response contains a JSON body with the model's output.

IAM Policy Required

Your calling role needs the bedrock:InvokeModel permission scoped to the specific model ARN. Using a wildcard is acceptable for development but tighten it in production:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "BedrockInvokeModels",
      "Effect": "Allow",
      "Action": [
        "bedrock:InvokeModel",
        "bedrock:InvokeModelWithResponseStream"
      ],
      "Resource": [
        "arn:aws:bedrock:us-east-1::foundation-model/anthropic.claude-3-5-sonnet-20241022-v2:0",
        "arn:aws:bedrock:us-east-1::foundation-model/amazon.titan-text-express-v1"
      ]
    }
  ]
}

Invoking Claude 3.5 Sonnet

Anthropic models use the Messages API format. The body contains a messages array following the user/assistant turn structure, an optional system prompt, and generation parameters. Note that max_tokens is required for Anthropic models — there is no default.

import boto3
import json

# Create the Bedrock Runtime client
client = boto3.client(
    service_name="bedrock-runtime",
    region_name="us-east-1",
)

# Invoke Claude 3.5 Sonnet using the Messages API
body = json.dumps({
    "anthropic_version": "bedrock-2023-05-31",
    "max_tokens": 1024,
    "system": "You are a senior AWS solutions architect. Be concise and practical.",
    "messages": [
        {
            "role": "user",
            "content": "Explain the difference between Bedrock Knowledge Bases and a custom RAG pipeline in 3 bullet points."
        }
    ],
    "temperature": 0.3,
    "top_p": 0.9,
})

response = client.invoke_model(
    modelId="anthropic.claude-3-5-sonnet-20241022-v2:0",
    body=body,
    contentType="application/json",
    accept="application/json",
)

result = json.loads(response["body"].read())
answer = result["content"][0]["text"]
print(answer)

# Usage stats
usage = result["usage"]
print(f"Input tokens: {usage['input_tokens']}, Output tokens: {usage['output_tokens']}")

Invoking Amazon Titan Text Express

Amazon's Titan models use a different request schema with a inputText field and a textGenerationConfig block. This is useful for workloads where you want a fully AWS-native model with no third-party data agreements.

import boto3
import json

client = boto3.client("bedrock-runtime", region_name="us-east-1")

body = json.dumps({
    "inputText": "Summarise the main benefits of serverless computing in 5 sentences.",
    "textGenerationConfig": {
        "maxTokenCount": 512,
        "temperature": 0.5,
        "topP": 0.9,
        "stopSequences": [],
    },
})

response = client.invoke_model(
    modelId="amazon.titan-text-express-v1",
    body=body,
    contentType="application/json",
    accept="application/json",
)

result = json.loads(response["body"].read())
print(result["results"][0]["outputText"])
print(f"Token count: {result['results'][0]['tokenCount']}")

Invoking Meta Llama 3.1 70B

Meta's Llama models on Bedrock follow a chat-completion format. You construct a prompt string that includes special tokens for the system and user turns. The model returns the assistant's continuation of the prompt.

import boto3
import json

client = boto3.client("bedrock-runtime", region_name="us-east-1")

# Llama 3.1 uses a specific chat template
prompt = """<|begin_of_text|><|start_header_id|>system<|end_header_id|>
You are a helpful Python tutor.<|eot_id|>
<|start_header_id|>user<|end_header_id|>
Write a Python function that implements binary search with type hints.<|eot_id|>
<|start_header_id|>assistant<|end_header_id|>"""

body = json.dumps({
    "prompt": prompt,
    "max_gen_len": 512,
    "temperature": 0.2,
    "top_p": 0.9,
})

response = client.invoke_model(
    modelId="meta.llama3-1-70b-instruct-v1:0",
    body=body,
    contentType="application/json",
    accept="application/json",
)

result = json.loads(response["body"].read())
print(result["generation"])
print(f"Prompt tokens: {result['prompt_token_count']}, Generation tokens: {result['generation_token_count']}")

Converse API (recommended): AWS introduced the Bedrock Converse API as a model-agnostic interface — one request format works with Claude, Titan, Llama, Mistral and Cohere. Use client.converse(modelId=..., messages=[...]) instead of invoke_model when you want to swap models without changing request code. The trade-off is that provider-specific parameters (like Anthropic's top_k) are not directly exposed.

Streaming Responses — Real-time Token Output

By default invoke_model waits until the model finishes generating the entire response before returning it. For a 1,000-token response from Claude 3.5 Sonnet, that is roughly 3–5 seconds of silence before anything appears. For interactive chatbots and UI applications, streaming is non-negotiable — users see tokens as they are generated, giving the impression of instant response.

invoke_model_with_response_stream returns an event stream. Each event carries a chunk of the response body. For Anthropic models these are Server-Sent Events encoded in the Bedrock event stream format. The code below handles the full event lifecycle:

import boto3
import json

client = boto3.client("bedrock-runtime", region_name="us-east-1")

body = json.dumps({
    "anthropic_version": "bedrock-2023-05-31",
    "max_tokens": 2048,
    "system": "You are an expert cloud architect.",
    "messages": [
        {
            "role": "user",
            "content": "Write a step-by-step guide to implementing blue/green deployments on AWS ECS."
        }
    ],
    "temperature": 0.4,
})

response = client.invoke_model_with_response_stream(
    modelId="anthropic.claude-3-5-sonnet-20241022-v2:0",
    body=body,
    contentType="application/json",
    accept="application/json",
)

# Stream events from the response
stream = response.get("body")
full_text = ""
input_tokens = 0
output_tokens = 0

for event in stream:
    chunk = event.get("chunk")
    if chunk:
        chunk_data = json.loads(chunk.get("bytes").decode())
        event_type = chunk_data.get("type")

        if event_type == "content_block_delta":
            delta = chunk_data.get("delta", {})
            if delta.get("type") == "text_delta":
                text_piece = delta.get("text", "")
                full_text += text_piece
                print(text_piece, end="", flush=True)  # Real-time output

        elif event_type == "message_delta":
            usage = chunk_data.get("usage", {})
            output_tokens = usage.get("output_tokens", 0)

        elif event_type == "message_start":
            usage = chunk_data.get("message", {}).get("usage", {})
            input_tokens = usage.get("input_tokens", 0)

        elif event_type == "message_stop":
            stop_reason = chunk_data.get("stop_reason")
            print(f"\n\n[Stop reason: {stop_reason}]")

print(f"\n[Tokens — Input: {input_tokens}, Output: {output_tokens}]")
print(f"[Total characters: {len(full_text)}]")

FastAPI + Streaming: To stream Bedrock responses through a FastAPI endpoint to a browser, wrap the event loop in an async generator and return a StreamingResponse with media_type="text/event-stream". The browser receives Server-Sent Events and can render tokens in real time without WebSockets.

For the Converse API, streaming is available through client.converse_stream(), which returns the same event-stream format but uses a unified schema regardless of the underlying model — making it the preferred approach when your application supports multiple model providers.

Knowledge Bases — RAG with S3 and OpenSearch Serverless

Retrieval-Augmented Generation (RAG) solves one of the most common limitations of LLMs: they have a knowledge cutoff and cannot access your proprietary documents. Bedrock Knowledge Bases manages the entire RAG pipeline: document ingestion from S3, chunking, embedding (using Amazon Titan Embeddings V2 or Cohere Embed), vector storage in OpenSearch Serverless or Aurora PostgreSQL (pgvector), and retrieval at query time — all without you managing any of the infrastructure.

Step 1 — Create the OpenSearch Serverless Collection

Bedrock needs a vector store to hold document embeddings. OpenSearch Serverless is the easiest option — it scales to zero when idle and requires no capacity planning.

import boto3
import json
import time

aoss = boto3.client("opensearchserverless", region_name="us-east-1")

# Create encryption policy
aoss.create_security_policy(
    name="bedrock-kb-encryption",
    type="encryption",
    policy=json.dumps({
        "Rules": [{"Resource": ["collection/bedrock-knowledge-base"], "ResourceType": "collection"}],
        "AWSOwnedKey": True,
    }),
)

# Create network policy (VPC endpoint or public — use public for simplicity)
aoss.create_security_policy(
    name="bedrock-kb-network",
    type="network",
    policy=json.dumps([
        {
            "Rules": [
                {"Resource": ["collection/bedrock-knowledge-base"], "ResourceType": "collection"},
                {"Resource": ["collection/bedrock-knowledge-base"], "ResourceType": "dashboard"},
            ],
            "AllowFromPublic": True,
        }
    ]),
)

# Create access policy — allow Bedrock service role to access indices
bedrock_role_arn = "arn:aws:iam::123456789012:role/AmazonBedrockExecutionRoleForKnowledgeBase"
aoss.create_access_policy(
    name="bedrock-kb-access",
    type="data",
    policy=json.dumps([
        {
            "Rules": [
                {
                    "Resource": ["index/bedrock-knowledge-base/*"],
                    "Permission": [
                        "aoss:CreateIndex", "aoss:DeleteIndex", "aoss:UpdateIndex",
                        "aoss:DescribeIndex", "aoss:ReadDocument", "aoss:WriteDocument",
                    ],
                    "ResourceType": "index",
                }
            ],
            "Principal": [bedrock_role_arn],
        }
    ]),
)

# Create the collection
response = aoss.create_collection(name="bedrock-knowledge-base", type="VECTORSEARCH")
collection_id = response["createCollectionDetail"]["id"]
collection_endpoint = f"https://{collection_id}.us-east-1.aoss.amazonaws.com"

print(f"Collection ID: {collection_id}")
print(f"Endpoint: {collection_endpoint}")

Step 2 — Create the Knowledge Base

import boto3
import json

bedrock_agent = boto3.client("bedrock-agent", region_name="us-east-1")

kb_response = bedrock_agent.create_knowledge_base(
    name="product-documentation-kb",
    description="Knowledge base for product documentation and FAQs",
    roleArn="arn:aws:iam::123456789012:role/AmazonBedrockExecutionRoleForKnowledgeBase",
    knowledgeBaseConfiguration={
        "type": "VECTOR",
        "vectorKnowledgeBaseConfiguration": {
            "embeddingModelArn": "arn:aws:bedrock:us-east-1::foundation-model/amazon.titan-embed-text-v2:0",
        },
    },
    storageConfiguration={
        "type": "OPENSEARCH_SERVERLESS",
        "opensearchServerlessConfiguration": {
            "collectionArn": f"arn:aws:aoss:us-east-1:123456789012:collection/{collection_id}",
            "vectorIndexName": "bedrock-knowledge-base-default-index",
            "fieldMapping": {
                "vectorField": "bedrock-knowledge-base-default-vector",
                "textField": "AMAZON_BEDROCK_TEXT_CHUNK",
                "metadataField": "AMAZON_BEDROCK_METADATA",
            },
        },
    },
)

kb_id = kb_response["knowledgeBase"]["knowledgeBaseId"]
print(f"Knowledge Base ID: {kb_id}")

# Step 3 — Add an S3 data source
ds_response = bedrock_agent.create_data_source(
    knowledgeBaseId=kb_id,
    name="product-docs-s3",
    dataSourceConfiguration={
        "type": "S3",
        "s3Configuration": {
            "bucketArn": "arn:aws:s3:::my-product-documentation",
            "inclusionPrefixes": ["docs/", "faqs/"],  # Only ingest these prefixes
        },
    },
    vectorIngestionConfiguration={
        "chunkingConfiguration": {
            "chunkingStrategy": "FIXED_SIZE",
            "fixedSizeChunkingConfiguration": {
                "maxTokens": 512,        # Tokens per chunk
                "overlapPercentage": 20, # 20% overlap between chunks for context continuity
            },
        }
    },
)

ds_id = ds_response["dataSource"]["dataSourceId"]
print(f"Data Source ID: {ds_id}")

# Step 4 — Start ingestion job (sync S3 → vector store)
ingest = bedrock_agent.start_ingestion_job(
    knowledgeBaseId=kb_id,
    dataSourceId=ds_id,
)
job_id = ingest["ingestionJob"]["ingestionJobId"]
print(f"Ingestion job started: {job_id}")

Step 4 — Query the Knowledge Base (Retrieve + Generate)

import boto3

bedrock_agent_rt = boto3.client("bedrock-agent-runtime", region_name="us-east-1")

# RetrieveAndGenerate — single call that retrieves context + generates answer
response = bedrock_agent_rt.retrieve_and_generate(
    input={"text": "What is the return policy for enterprise subscriptions?"},
    retrieveAndGenerateConfiguration={
        "type": "KNOWLEDGE_BASE",
        "knowledgeBaseConfiguration": {
            "knowledgeBaseId": kb_id,
            "modelArn": "arn:aws:bedrock:us-east-1::foundation-model/anthropic.claude-3-5-sonnet-20241022-v2:0",
            "retrievalConfiguration": {
                "vectorSearchConfiguration": {
                    "numberOfResults": 5,  # Retrieve top 5 chunks
                    "overrideSearchType": "HYBRID",  # Hybrid: semantic + keyword
                }
            },
            "generationConfiguration": {
                "promptTemplate": {
                    "textPromptTemplate": "You are a support agent. Use only the provided context to answer. Context: $search_results$ Question: $query$"
                }
            },
        },
    },
)

print(response["output"]["text"])
print("\n--- Citations ---")
for citation in response.get("citations", []):
    for ref in citation.get("retrievedReferences", []):
        loc = ref["location"]["s3Location"]
        print(f"  Source: {loc['uri']} (score: {ref.get('score', 'N/A')})")

Re-sync after S3 changes: When you add or update documents in S3, call bedrock_agent.start_ingestion_job() again to re-sync. Bedrock tracks document checksums and only re-embeds changed files — incremental syncs on a 10,000-document corpus typically complete in under 2 minutes.

Bedrock Agents — Multi-Step Reasoning and Tool Use

Bedrock Agents extends the model beyond a single prompt-response cycle. An Agent can reason over a goal, decide which tools (called action groups) to invoke, call Lambda functions or API endpoints, observe the results, and continue reasoning until the task is complete. This is the Bedrock equivalent of ReAct / function-calling patterns — but the orchestration loop is managed by AWS, not your application code.

The architecture has three parts: (1) the Agent itself, which holds the system prompt and reasoning configuration; (2) Action Groups, which define the tools the agent can call — each action group maps to an OpenAPI schema and a Lambda function; (3) an optional Knowledge Base attachment for retrieval during reasoning.

Creating an Agent with a Lambda Action Group

import boto3
import json

bedrock_agent = boto3.client("bedrock-agent", region_name="us-east-1")

# Step 1: Create the Agent
agent_response = bedrock_agent.create_agent(
    agentName="order-support-agent",
    agentResourceRoleArn="arn:aws:iam::123456789012:role/AmazonBedrockExecutionRoleForAgents",
    foundationModel="anthropic.claude-3-5-sonnet-20241022-v2:0",
    instruction="""You are an order support agent for Techoral Store.
When a customer asks about their order:
1. Use get_order_status to look up the current status.
2. If the order is delayed, use check_carrier_tracking for real-time carrier data.
3. If the customer wants to cancel, use initiate_cancellation.
Always be polite and concise. Never make up order information.""",
    idleSessionTTLInSeconds=600,
)

agent_id = agent_response["agent"]["agentId"]
print(f"Agent ID: {agent_id}")

# Step 2: Define the action group OpenAPI schema
# This schema tells the Agent what functions are available and their parameters
openapi_schema = {
    "openapi": "3.0.0",
    "info": {"title": "Order Support API", "version": "1.0.0"},
    "paths": {
        "/get_order_status": {
            "get": {
                "operationId": "get_order_status",
                "description": "Retrieve the current status of a customer order",
                "parameters": [
                    {
                        "name": "order_id",
                        "in": "query",
                        "required": True,
                        "description": "The order ID (format: ORD-XXXXXX)",
                        "schema": {"type": "string"},
                    }
                ],
                "responses": {"200": {"description": "Order status returned"}},
            }
        },
        "/initiate_cancellation": {
            "post": {
                "operationId": "initiate_cancellation",
                "description": "Initiate cancellation of an order",
                "requestBody": {
                    "required": True,
                    "content": {
                        "application/json": {
                            "schema": {
                                "type": "object",
                                "properties": {
                                    "order_id": {"type": "string"},
                                    "reason": {"type": "string"},
                                },
                                "required": ["order_id", "reason"],
                            }
                        }
                    },
                },
                "responses": {"200": {"description": "Cancellation initiated"}},
            }
        },
    },
}

# Step 3: Attach the action group to the agent
bedrock_agent.create_agent_action_group(
    agentId=agent_id,
    agentVersion="DRAFT",
    actionGroupName="order-operations",
    actionGroupExecutor={
        "lambda": "arn:aws:lambda:us-east-1:123456789012:function:order-support-handler"
    },
    apiSchema={
        "payload": json.dumps(openapi_schema),
    },
    description="Functions for looking up and managing customer orders",
)

# Step 4: Prepare the agent (compiles the instruction + action groups)
bedrock_agent.prepare_agent(agentId=agent_id)
print("Agent prepared — waiting for status Active...")

Lambda Handler for Action Groups

When the Agent decides to call one of the actions, Bedrock invokes the Lambda with a structured event. Your Lambda handler extracts the function name and parameters, executes the business logic, and returns a result in the expected format.

import json

def lambda_handler(event, context):
    """Lambda handler for Bedrock Agent action group."""
    action_group = event.get("actionGroup")
    function_name = event.get("function")
    parameters = event.get("parameters", [])

    # Convert parameters list to dict
    params = {p["name"]: p["value"] for p in parameters}
    print(f"Agent calling: {action_group}::{function_name} with {params}")

    if function_name == "get_order_status":
        order_id = params.get("order_id")
        # --- Your real business logic here ---
        result = {
            "order_id": order_id,
            "status": "SHIPPED",
            "estimated_delivery": "2026-06-10",
            "carrier": "FedEx",
            "tracking_number": "7749283746271",
        }
        response_body = {"application/json": {"body": json.dumps(result)}}

    elif function_name == "initiate_cancellation":
        order_id = params.get("order_id")
        reason = params.get("reason")
        # --- Call your order management system ---
        result = {
            "success": True,
            "cancellation_id": "CXL-998877",
            "message": f"Order {order_id} cancellation initiated. Refund in 3–5 business days.",
        }
        response_body = {"application/json": {"body": json.dumps(result)}}

    else:
        response_body = {"application/json": {"body": json.dumps({"error": "Unknown function"})}}

    # Bedrock Agent expects this specific response structure
    return {
        "messageVersion": "1.0",
        "response": {
            "actionGroup": action_group,
            "function": function_name,
            "functionResponse": {"responseBody": response_body},
        },
    }

Invoking the Agent in Your Application

import boto3
import uuid

bedrock_agent_rt = boto3.client("bedrock-agent-runtime", region_name="us-east-1")

session_id = str(uuid.uuid4())  # Maintain session state across turns

response = bedrock_agent_rt.invoke_agent(
    agentId=agent_id,
    agentAliasId="TSTALIASID",  # Use "TSTALIASID" for the DRAFT alias
    sessionId=session_id,
    inputText="My order ORD-447829 hasn't arrived. It was supposed to be here 3 days ago.",
    enableTrace=True,  # Returns the agent's reasoning steps
)

# Stream the agent's response (it may invoke multiple tools before answering)
event_stream = response["completion"]
for event in event_stream:
    if "chunk" in event:
        chunk = event["chunk"]
        print(chunk["bytes"].decode(), end="", flush=True)
    elif "trace" in event:
        trace = event["trace"]["trace"]
        if "orchestrationTrace" in trace:
            orch = trace["orchestrationTrace"]
            if "invocationInput" in orch:
                inv = orch["invocationInput"]
                if inv.get("invocationType") == "ACTION_GROUP":
                    ag_inv = inv["actionGroupInvocationInput"]
                    print(f"\n[Agent calling: {ag_inv['function']}({ag_inv.get('parameters', [])})]")

Fine-Tuning and Continued Pre-Training on Bedrock

Fine-tuning adapts a foundation model to your specific domain, tone, or task format. Bedrock supports two customisation modes on eligible base models (currently Amazon Titan, Cohere Command, and Meta Llama): Fine-tuning (supervised, uses labelled prompt-completion pairs) and Continued Pre-Training (unsupervised, uses raw domain text to shift the model's knowledge distribution without requiring explicit labels).

When to Fine-Tune vs Prompt Engineer

Fine-tuning is expensive in time and data preparation. Before committing, try these cheaper alternatives in order: (1) better system prompts with examples (few-shot), (2) Bedrock Knowledge Bases for factual grounding, (3) model parameter tuning (temperature, top-p). Only fine-tune when you have 1,000+ high-quality labelled examples and consistent quality requirements that prompting alone cannot meet.

Preparing Fine-Tuning Data

Fine-tuning data must be in JSONL format, one example per line. Each line is a JSON object with a prompt and completion field. Upload to S3 before starting the job.

{"prompt": "Classify the sentiment of this support ticket: 'The product broke on day 1 and support is unresponsive.'", "completion": "NEGATIVE"}
{"prompt": "Classify the sentiment of this support ticket: 'Setup was seamless and the team loves it!'", "completion": "POSITIVE"}
{"prompt": "Classify the sentiment of this support ticket: 'Works as expected, nothing special.'", "completion": "NEUTRAL"}

Starting a Fine-Tuning Job

import boto3
import json

bedrock = boto3.client("bedrock", region_name="us-east-1")

response = bedrock.create_model_customization_job(
    jobName="support-ticket-classifier-v1",
    customModelName="support-classifier-titan",
    roleArn="arn:aws:iam::123456789012:role/BedrockCustomizationRole",
    baseModelIdentifier="amazon.titan-text-express-v1",
    customizationType="FINE_TUNING",
    trainingDataConfig={
        "s3Uri": "s3://my-training-data/fine-tune/train.jsonl",
    },
    validationDataConfig={
        "validators": [{"s3Uri": "s3://my-training-data/fine-tune/val.jsonl"}]
    },
    outputDataConfig={
        "s3Uri": "s3://my-training-data/fine-tune/output/",
    },
    hyperParameters={
        "epochCount": "5",
        "batchSize": "8",
        "learningRate": "0.00005",
    },
)

job_arn = response["jobArn"]
print(f"Fine-tuning job ARN: {job_arn}")

# Poll for completion
import time
while True:
    status = bedrock.get_model_customization_job(jobIdentifier=job_arn)
    job_status = status["status"]
    print(f"Status: {job_status}")
    if job_status in ("Completed", "Failed", "Stopped"):
        break
    time.sleep(60)

if job_status == "Completed":
    custom_model_arn = status["outputModelArn"]
    print(f"Custom model ARN: {custom_model_arn}")
    # The custom model is now invokable via invoke_model using this ARN

Purchased Throughput Required: Custom (fine-tuned) Bedrock models cannot be invoked on-demand — you must purchase Provisioned Throughput (even a 1-month commitment works) before you can invoke your custom model. This adds ~$7–30/hour depending on model size. Factor this into your cost model before committing to fine-tuning.

Guardrails — Content Filtering, PII Redaction, and Topic Denial

Bedrock Guardrails is a moderation layer you attach to any model invocation. It intercepts both the user's input and the model's output and applies a configurable set of policies: filter harmful content (hate, violence, sexual, self-harm, insults), block topics you define (e.g., "competitor products", "legal advice"), redact PII (names, SSNs, credit card numbers, email addresses), and enforce word-level deny lists. Guardrails work with all Bedrock models and also with Knowledge Bases and Agents — a single guardrail config can protect your entire generative AI surface.

import boto3
import json

bedrock = boto3.client("bedrock", region_name="us-east-1")

# Create a guardrail
guardrail_response = bedrock.create_guardrail(
    name="production-safety-guardrail",
    description="Safety guardrail for customer-facing chatbot",

    # Block harmful content categories
    contentPolicyConfig={
        "filtersConfig": [
            {"type": "SEXUAL",    "inputStrength": "HIGH",   "outputStrength": "HIGH"},
            {"type": "VIOLENCE",  "inputStrength": "MEDIUM", "outputStrength": "HIGH"},
            {"type": "HATE",      "inputStrength": "HIGH",   "outputStrength": "HIGH"},
            {"type": "INSULTS",   "inputStrength": "MEDIUM", "outputStrength": "MEDIUM"},
            {"type": "MISCONDUCT","inputStrength": "HIGH",   "outputStrength": "HIGH"},
        ]
    },

    # Deny specific topics entirely
    topicPolicyConfig={
        "topicsConfig": [
            {
                "name": "legal-advice",
                "definition": "Requests for specific legal advice, interpretation of contracts, or legal representation",
                "examples": [
                    "Can I sue my employer for this?",
                    "Is my NDA enforceable?",
                ],
                "type": "DENY",
            },
            {
                "name": "competitor-comparison",
                "definition": "Questions asking to compare Techoral products against specific named competitors",
                "examples": ["How does Techoral compare to Competitor X?"],
                "type": "DENY",
            },
        ]
    },

    # Redact PII from inputs and outputs
    sensitiveInformationPolicyConfig={
        "piiEntitiesConfig": [
            {"type": "EMAIL",              "action": "ANONYMIZE"},
            {"type": "PHONE",              "action": "ANONYMIZE"},
            {"type": "CREDIT_DEBIT_CARD_NUMBER", "action": "BLOCK"},
            {"type": "US_SOCIAL_SECURITY_NUMBER", "action": "BLOCK"},
            {"type": "NAME",               "action": "ANONYMIZE"},
        ]
    },

    # Custom word blocklist
    wordPolicyConfig={
        "wordsConfig": [
            {"text": "competitors-product-name"},
        ],
        "managedWordListsConfig": [{"type": "PROFANITY"}],
    },

    blockedInputMessaging="I cannot process that request. Please rephrase your question.",
    blockedOutputsMessaging="I cannot provide that information. How else can I help you?",
)

guardrail_id = guardrail_response["guardrailId"]
guardrail_version = guardrail_response["version"]
print(f"Guardrail ID: {guardrail_id}, Version: {guardrail_version}")

# Apply guardrail to a model invocation
client = boto3.client("bedrock-runtime", region_name="us-east-1")

response = client.invoke_model(
    modelId="anthropic.claude-3-5-sonnet-20241022-v2:0",
    body=json.dumps({
        "anthropic_version": "bedrock-2023-05-31",
        "max_tokens": 512,
        "messages": [{"role": "user", "content": "My email is john@example.com. What is your refund policy?"}],
    }),
    contentType="application/json",
    accept="application/json",
    guardrailIdentifier=guardrail_id,
    guardrailVersion=guardrail_version,
    trace="ENABLED",
)

result = json.loads(response["body"].read())
print(result["content"][0]["text"])
# PII in the input is anonymized before reaching the model
# "My email is [EMAIL]. What is your refund policy?"

Guardrail Metrics in CloudWatch: Bedrock publishes Guardrail:Invocations, Guardrail:InterventionCount, and per-policy breakdown metrics to CloudWatch. Create a CloudWatch Dashboard showing daily intervention rates per guardrail policy type — spikes indicate either prompt injection attempts or legitimate user confusion about what the chatbot can help with.

Prompt Management — Prompt Flows and Versioning

Bedrock Prompt Management lets you store, version, and deploy prompts as first-class resources — separate from application code. This solves a common problem: prompt changes are made ad-hoc by engineers, are not reviewed or tracked, and production regressions are difficult to diagnose. With Prompt Management, every prompt has a unique ARN, a version history, and can be referenced by alias (like "production" or "staging") so your application code never hardcodes prompt text.

import boto3

bedrock_agent = boto3.client("bedrock-agent", region_name="us-east-1")

# Create a versioned prompt
prompt_response = bedrock_agent.create_prompt(
    name="support-chat-system-prompt",
    description="System prompt for the customer support chatbot",
    variants=[
        {
            "name": "default",
            "templateType": "TEXT",
            "templateConfiguration": {
                "text": {
                    "text": """You are a friendly and knowledgeable support agent for {{company_name}}.
Your goals:
- Resolve customer issues on the first interaction
- Be empathetic and solution-focused
- Escalate to a human agent if the issue cannot be resolved in 3 turns
- Never discuss pricing without checking the current rate card

Current date: {{current_date}}
Agent name: {{agent_name}}""",
                    "inputVariables": [
                        {"name": "company_name"},
                        {"name": "current_date"},
                        {"name": "agent_name"},
                    ],
                }
            },
            "modelId": "anthropic.claude-3-5-sonnet-20241022-v2:0",
            "inferenceConfiguration": {
                "text": {"temperature": 0.3, "maxTokens": 2048}
            },
        }
    ],
)

prompt_id = prompt_response["id"]
print(f"Prompt ID: {prompt_id}")

# Create a version (immutable snapshot)
version_response = bedrock_agent.create_prompt_version(
    promptIdentifier=prompt_id,
    description="Version 1 — initial production release",
)
version_number = version_response["version"]
print(f"Prompt version: {version_number}")

# Reference the prompt in an invocation via Converse API
import json
client = boto3.client("bedrock-runtime", region_name="us-east-1")

response = client.converse(
    modelId="anthropic.claude-3-5-sonnet-20241022-v2:0",
    system=[
        {
            "text": "You are a support agent for Techoral. Be helpful and concise."
            # In practice, fetch prompt text from Bedrock Prompt Management
            # and substitute variables before passing here
        }
    ],
    messages=[{"role": "user", "content": [{"text": "How do I reset my password?"}]}],
)

print(response["output"]["message"]["content"][0]["text"])

Bedrock Prompt Flows goes further — it provides a visual no-code pipeline builder where you chain model invocations, knowledge base retrievals, Lambda calls, and conditional branching. A flow is defined as a directed acyclic graph and can be tested, versioned, and deployed with a stable alias URL that your application invokes via invoke_flow.

Cost Optimisation — On-Demand vs Provisioned Throughput

Bedrock pricing has two modes. Understanding both is essential to avoiding bill shock and choosing the right architecture for your traffic profile.

On-Demand Pricing

You pay per token processed — input tokens and output tokens priced separately. There is no minimum commitment, no idle cost, and no capacity reservation. This is the right choice for variable or unpredictable workloads, prototypes, and applications with bursts of traffic separated by long periods of inactivity.

Model	Input (per 1K tokens)	Output (per 1K tokens)	Notes
Claude 3.5 Sonnet	$0.003	$0.015	Best quality/cost for reasoning tasks
Claude 3 Haiku	$0.00025	$0.00125	12x cheaper than Sonnet, good for classification/extraction
Llama 3.1 70B Instruct	$0.00265	$0.0035	Strong open-weight alternative
Llama 3.1 8B Instruct	$0.00022	$0.00022	Ultra-low cost, fast, for simple tasks
Titan Text Express	$0.0008	$0.0016	AWS-native, no third-party agreement
Titan Embeddings V2	$0.00002	N/A	Per token for embedding generation

Provisioned Throughput

Provisioned Throughput reserves Model Units (MUs) — each MU guarantees a fixed number of tokens per minute. You pay per hour regardless of whether you use the full throughput. The break-even point compared to on-demand is typically around 60–70% sustained utilisation. Provisioned Throughput is the right choice for high-volume, consistent workloads where predictable latency matters and usage is above the on-demand break-even.

import boto3

bedrock = boto3.client("bedrock", region_name="us-east-1")

# Purchase provisioned throughput — 1 model unit, 1-month commitment
pt_response = bedrock.create_provisioned_model_throughput(
    provisionedModelName="production-claude-haiku",
    modelId="anthropic.claude-3-haiku-20240307-v1:0",
    modelUnits=1,               # 1 MU = ~1,000 tokens/minute for Haiku
    commitmentDuration="OneMonth",  # "OneMonth" or "SixMonths" (6-month gets ~40% discount)
)

provisioned_model_arn = pt_response["provisionedModelArn"]
print(f"Provisioned model ARN: {provisioned_model_arn}")

# Invoke using the provisioned model ARN instead of the base model ID
import json
client = boto3.client("bedrock-runtime", region_name="us-east-1")

response = client.invoke_model(
    modelId=provisioned_model_arn,   # Use provisioned ARN for guaranteed throughput
    body=json.dumps({
        "anthropic_version": "bedrock-2023-05-31",
        "max_tokens": 256,
        "messages": [{"role": "user", "content": "Classify this ticket: billing issue"}],
    }),
    contentType="application/json",
    accept="application/json",
)
print(json.loads(response["body"].read())["content"][0]["text"])

Token Counting Before Invocation

The Bedrock Converse API includes a count_tokens operation that estimates token count without invoking the model. Use this in batch pipelines to pre-filter large inputs that would exceed model context windows or budget thresholds before paying for the full invocation.

import boto3

client = boto3.client("bedrock-runtime", region_name="us-east-1")

# Count tokens without invoking the model (no charge)
messages = [
    {"role": "user", "content": [{"text": "Summarise this 50-page document: " + "x" * 50000}]}
]

token_count_response = client.count_tokens(
    modelId="anthropic.claude-3-5-sonnet-20241022-v2:0",
    messages=messages,
)

total_tokens = token_count_response["inputTokenCount"]
print(f"Input token count: {total_tokens}")

MAX_TOKENS = 100_000
if total_tokens > MAX_TOKENS:
    print(f"Input exceeds {MAX_TOKENS} token budget — truncating or chunking required")
else:
    # Safe to invoke
    response = client.converse(
        modelId="anthropic.claude-3-5-sonnet-20241022-v2:0",
        messages=messages,
    )
    print(response["output"]["message"]["content"][0]["text"])

Cost Optimisation Checklist: (1) Use Claude 3 Haiku for classification, extraction, and routing tasks — it is 12x cheaper than Sonnet with comparable accuracy on structured tasks. (2) Cache system prompts using Anthropic's prompt caching feature (50–90% input token savings on repeated prompts). (3) Set max_tokens to the minimum needed — you are charged for output tokens generated. (4) Use Knowledge Bases to reduce context window usage instead of stuffing full documents into prompts. (5) Monitor token usage per endpoint using CloudWatch metric Bedrock:InputTokenCount and Bedrock:OutputTokenCount.

Frequently Asked Questions

Does Bedrock train on my data?

No. AWS explicitly states that data you send to Bedrock for model inference is not used to train or improve foundation models. Your prompts and completions are not stored beyond the session unless you explicitly enable logging to S3 via Bedrock Model Invocation Logging. This is a key differentiator from some third-party AI API providers and is backed by AWS data processing agreements (DPA) and SOC 2 / ISO 27001 compliance certifications.

Which Bedrock model should I use for my chatbot?

For general-purpose customer-facing chatbots, Claude 3.5 Sonnet offers the best combination of quality, instruction-following, and safety. For high-volume classification, routing, or extraction tasks where cost per call matters more than maximum capability, Claude 3 Haiku cuts costs by 12x with acceptable quality degradation. For open-weight requirements (compliance, audit, local deployment fallback), Llama 3.1 70B is the strongest option. Run a model evaluation using Bedrock's built-in Model Evaluation feature — it benchmarks multiple models against your specific test dataset before you commit to one.

What is the context window limit for Bedrock models?

Context limits vary by model: Claude 3.5 Sonnet supports 200K tokens (~150,000 words), Llama 3.1 405B supports 128K tokens, Titan Text Express supports 8K tokens, and Mistral Large supports 32K tokens. For documents exceeding the context limit, use Bedrock Knowledge Bases to retrieve only relevant chunks rather than stuffing the entire document into the prompt — this also reduces cost significantly.

How do I use Bedrock with LangChain or LlamaIndex?

Both LangChain and LlamaIndex have native Bedrock integrations. In LangChain, use from langchain_aws import ChatBedrock and pass the model_id and region_name. In LlamaIndex, use from llama_index.llms.bedrock import Bedrock. Both abstractions sit on top of the boto3 client, so they inherit your IAM role's Bedrock permissions automatically — no separate API keys required.

Can I run Bedrock in a private VPC with no internet access?

Yes. Create a VPC endpoint for com.amazonaws.us-east-1.bedrock-runtime (Interface endpoint). Once the endpoint is active, all Bedrock API traffic stays on the AWS private network — it never traverses the public internet. Combine this with SCPs that deny Bedrock calls not made through the VPC endpoint for defence-in-depth. This configuration is required for many financial services and government compliance frameworks.