AWS Bedrock: Build Generative AI Apps with Foundation Models
AWS Bedrock is the fastest path from a generative AI idea to a production-ready application on AWS. Instead of managing GPU clusters, downloading model weights, or negotiating direct API agreements with AI labs, Bedrock gives you a single unified API to invoke a curated marketplace of foundation models — Claude from Anthropic, Llama from Meta, Mistral, Amazon Titan, Stable Diffusion for images, and more. You only pay for the tokens you process, there is nothing to provision, and every model runs inside your VPC with all the security and compliance controls AWS provides.
This guide covers everything you need to build production generative AI applications on Bedrock: direct model invocation with streaming, Retrieval-Augmented Generation with Knowledge Bases, multi-step agentic workflows with Bedrock Agents, content safety with Guardrails, fine-tuning your own model, and cost optimisation strategies. All code examples use Python boto3 and run against the real Bedrock API.
Table of Contents
- What Is Bedrock — The Foundation Model Marketplace
- Bedrock vs SageMaker vs OpenAI API
- Invoking Models — InvokeModel API with boto3
- Streaming Responses — Real-time Token Output
- Knowledge Bases — RAG with S3 and OpenSearch
- Bedrock Agents — Multi-Step Reasoning and Tool Use
- Fine-Tuning and Continued Pre-Training
- Guardrails — Content Filtering and PII Redaction
- Prompt Management — Flows and Versioning
- Cost Optimisation — On-Demand vs Provisioned Throughput
- Frequently Asked Questions
What Is Bedrock — The Foundation Model Marketplace
AWS Bedrock is a fully managed service that provides access to high-performance foundation models (FMs) through a single API. AWS handles all the infrastructure: GPU cluster management, model serving, scaling, availability, and security. You interact with models via HTTP endpoints — no servers, no containers, no model weights on your disk.
The model catalogue as of mid-2026 includes every major frontier model family:
| Provider | Models Available | Best For |
|---|---|---|
| Anthropic | Claude 3.5 Sonnet, Claude 3 Haiku, Claude 3 Opus | Reasoning, coding, long context, instruction following |
| Meta | Llama 3.1 8B / 70B / 405B Instruct | Open-weight flexibility, cost efficiency, fine-tuning base |
| Mistral AI | Mistral Large, Mistral 7B, Mixtral 8x7B | Multilingual, code generation, fast inference |
| Amazon | Titan Text Lite, Titan Text Express, Titan Embeddings V2, Titan Image Generator | AWS-native, embeddings, image generation |
| Stability AI | Stable Diffusion 3 Large, SDXL 1.0 | Image generation, image editing |
| Cohere | Command R+, Command R, Embed | RAG retrieval, enterprise search, multilingual embeddings |
| AI21 Labs | Jamba 1.5 Large, Jamba 1.5 Mini | Long-context, structured output, summarisation |
Bedrock's killer feature is not just model access — it's the managed capability layer built on top: Knowledge Bases for RAG, Agents for orchestration, Guardrails for safety, Prompt Management for versioning, and Model Evaluation for benchmarking. These are production features you would otherwise spend months building yourself.
Bedrock vs SageMaker vs OpenAI API — When to Use What
Three services dominate the conversation when teams are choosing an AI backend. Each occupies a distinct niche. Choosing the wrong one means either overpaying for capabilities you don't need or under-investing in infrastructure that bites you at scale.
| Dimension | AWS Bedrock | AWS SageMaker | OpenAI API |
|---|---|---|---|
| Model choice | Multi-provider FM marketplace | Any model (HuggingFace, custom) | OpenAI models only (GPT-4o, o1, etc.) |
| Infrastructure | Fully serverless — zero management | Managed but you configure instances | Fully serverless — zero management |
| Custom models | Fine-tuning on supported base models | Full control — any framework, any GPU | Fine-tuning on select GPT models |
| Data residency | Stays in your AWS region / VPC | Stays in your AWS region / VPC | Leaves your environment (US servers) |
| RAG / Agents | Built-in (Knowledge Bases, Agents) | DIY with LangChain/LlamaIndex | Assistants API (limited RAG) |
| Pricing model | Per token (on-demand) or throughput reservation | Per instance-hour + storage | Per token |
| AWS service integration | Native (S3, Lambda, CloudWatch, VPC) | Native + deepest ML tooling | API only — you wire the integrations |
| Best for | Gen AI apps, RAG, agents, multi-model | Custom training, MLOps, non-FM models | Teams already on OpenAI, prototyping |
Invoking Models — InvokeModel API with boto3
Every Bedrock foundation model is reachable via the bedrock-runtime boto3 client. The invoke_model call takes a JSON body whose schema varies by model provider, but the outer API is always the same: you supply the modelId, a JSON body serialised to bytes, and content-type headers. The response contains a JSON body with the model's output.
IAM Policy Required
Your calling role needs the bedrock:InvokeModel permission scoped to the specific model ARN. Using a wildcard is acceptable for development but tighten it in production:
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "BedrockInvokeModels",
"Effect": "Allow",
"Action": [
"bedrock:InvokeModel",
"bedrock:InvokeModelWithResponseStream"
],
"Resource": [
"arn:aws:bedrock:us-east-1::foundation-model/anthropic.claude-3-5-sonnet-20241022-v2:0",
"arn:aws:bedrock:us-east-1::foundation-model/amazon.titan-text-express-v1"
]
}
]
}
Invoking Claude 3.5 Sonnet
Anthropic models use the Messages API format. The body contains a messages array following the user/assistant turn structure, an optional system prompt, and generation parameters. Note that max_tokens is required for Anthropic models — there is no default.
import boto3
import json
# Create the Bedrock Runtime client
client = boto3.client(
service_name="bedrock-runtime",
region_name="us-east-1",
)
# Invoke Claude 3.5 Sonnet using the Messages API
body = json.dumps({
"anthropic_version": "bedrock-2023-05-31",
"max_tokens": 1024,
"system": "You are a senior AWS solutions architect. Be concise and practical.",
"messages": [
{
"role": "user",
"content": "Explain the difference between Bedrock Knowledge Bases and a custom RAG pipeline in 3 bullet points."
}
],
"temperature": 0.3,
"top_p": 0.9,
})
response = client.invoke_model(
modelId="anthropic.claude-3-5-sonnet-20241022-v2:0",
body=body,
contentType="application/json",
accept="application/json",
)
result = json.loads(response["body"].read())
answer = result["content"][0]["text"]
print(answer)
# Usage stats
usage = result["usage"]
print(f"Input tokens: {usage['input_tokens']}, Output tokens: {usage['output_tokens']}")
Invoking Amazon Titan Text Express
Amazon's Titan models use a different request schema with a inputText field and a textGenerationConfig block. This is useful for workloads where you want a fully AWS-native model with no third-party data agreements.
import boto3
import json
client = boto3.client("bedrock-runtime", region_name="us-east-1")
body = json.dumps({
"inputText": "Summarise the main benefits of serverless computing in 5 sentences.",
"textGenerationConfig": {
"maxTokenCount": 512,
"temperature": 0.5,
"topP": 0.9,
"stopSequences": [],
},
})
response = client.invoke_model(
modelId="amazon.titan-text-express-v1",
body=body,
contentType="application/json",
accept="application/json",
)
result = json.loads(response["body"].read())
print(result["results"][0]["outputText"])
print(f"Token count: {result['results'][0]['tokenCount']}")
Invoking Meta Llama 3.1 70B
Meta's Llama models on Bedrock follow a chat-completion format. You construct a prompt string that includes special tokens for the system and user turns. The model returns the assistant's continuation of the prompt.
import boto3
import json
client = boto3.client("bedrock-runtime", region_name="us-east-1")
# Llama 3.1 uses a specific chat template
prompt = """<|begin_of_text|><|start_header_id|>system<|end_header_id|>
You are a helpful Python tutor.<|eot_id|>
<|start_header_id|>user<|end_header_id|>
Write a Python function that implements binary search with type hints.<|eot_id|>
<|start_header_id|>assistant<|end_header_id|>"""
body = json.dumps({
"prompt": prompt,
"max_gen_len": 512,
"temperature": 0.2,
"top_p": 0.9,
})
response = client.invoke_model(
modelId="meta.llama3-1-70b-instruct-v1:0",
body=body,
contentType="application/json",
accept="application/json",
)
result = json.loads(response["body"].read())
print(result["generation"])
print(f"Prompt tokens: {result['prompt_token_count']}, Generation tokens: {result['generation_token_count']}")
client.converse(modelId=..., messages=[...]) instead of invoke_model when you want to swap models without changing request code. The trade-off is that provider-specific parameters (like Anthropic's top_k) are not directly exposed.Streaming Responses — Real-time Token Output
By default invoke_model waits until the model finishes generating the entire response before returning it. For a 1,000-token response from Claude 3.5 Sonnet, that is roughly 3–5 seconds of silence before anything appears. For interactive chatbots and UI applications, streaming is non-negotiable — users see tokens as they are generated, giving the impression of instant response.
invoke_model_with_response_stream returns an event stream. Each event carries a chunk of the response body. For Anthropic models these are Server-Sent Events encoded in the Bedrock event stream format. The code below handles the full event lifecycle:
import boto3
import json
client = boto3.client("bedrock-runtime", region_name="us-east-1")
body = json.dumps({
"anthropic_version": "bedrock-2023-05-31",
"max_tokens": 2048,
"system": "You are an expert cloud architect.",
"messages": [
{
"role": "user",
"content": "Write a step-by-step guide to implementing blue/green deployments on AWS ECS."
}
],
"temperature": 0.4,
})
response = client.invoke_model_with_response_stream(
modelId="anthropic.claude-3-5-sonnet-20241022-v2:0",
body=body,
contentType="application/json",
accept="application/json",
)
# Stream events from the response
stream = response.get("body")
full_text = ""
input_tokens = 0
output_tokens = 0
for event in stream:
chunk = event.get("chunk")
if chunk:
chunk_data = json.loads(chunk.get("bytes").decode())
event_type = chunk_data.get("type")
if event_type == "content_block_delta":
delta = chunk_data.get("delta", {})
if delta.get("type") == "text_delta":
text_piece = delta.get("text", "")
full_text += text_piece
print(text_piece, end="", flush=True) # Real-time output
elif event_type == "message_delta":
usage = chunk_data.get("usage", {})
output_tokens = usage.get("output_tokens", 0)
elif event_type == "message_start":
usage = chunk_data.get("message", {}).get("usage", {})
input_tokens = usage.get("input_tokens", 0)
elif event_type == "message_stop":
stop_reason = chunk_data.get("stop_reason")
print(f"\n\n[Stop reason: {stop_reason}]")
print(f"\n[Tokens — Input: {input_tokens}, Output: {output_tokens}]")
print(f"[Total characters: {len(full_text)}]")
StreamingResponse with media_type="text/event-stream". The browser receives Server-Sent Events and can render tokens in real time without WebSockets.For the Converse API, streaming is available through client.converse_stream(), which returns the same event-stream format but uses a unified schema regardless of the underlying model — making it the preferred approach when your application supports multiple model providers.
Knowledge Bases — RAG with S3 and OpenSearch Serverless
Retrieval-Augmented Generation (RAG) solves one of the most common limitations of LLMs: they have a knowledge cutoff and cannot access your proprietary documents. Bedrock Knowledge Bases manages the entire RAG pipeline: document ingestion from S3, chunking, embedding (using Amazon Titan Embeddings V2 or Cohere Embed), vector storage in OpenSearch Serverless or Aurora PostgreSQL (pgvector), and retrieval at query time — all without you managing any of the infrastructure.
Step 1 — Create the OpenSearch Serverless Collection
Bedrock needs a vector store to hold document embeddings. OpenSearch Serverless is the easiest option — it scales to zero when idle and requires no capacity planning.
import boto3
import json
import time
aoss = boto3.client("opensearchserverless", region_name="us-east-1")
# Create encryption policy
aoss.create_security_policy(
name="bedrock-kb-encryption",
type="encryption",
policy=json.dumps({
"Rules": [{"Resource": ["collection/bedrock-knowledge-base"], "ResourceType": "collection"}],
"AWSOwnedKey": True,
}),
)
# Create network policy (VPC endpoint or public — use public for simplicity)
aoss.create_security_policy(
name="bedrock-kb-network",
type="network",
policy=json.dumps([
{
"Rules": [
{"Resource": ["collection/bedrock-knowledge-base"], "ResourceType": "collection"},
{"Resource": ["collection/bedrock-knowledge-base"], "ResourceType": "dashboard"},
],
"AllowFromPublic": True,
}
]),
)
# Create access policy — allow Bedrock service role to access indices
bedrock_role_arn = "arn:aws:iam::123456789012:role/AmazonBedrockExecutionRoleForKnowledgeBase"
aoss.create_access_policy(
name="bedrock-kb-access",
type="data",
policy=json.dumps([
{
"Rules": [
{
"Resource": ["index/bedrock-knowledge-base/*"],
"Permission": [
"aoss:CreateIndex", "aoss:DeleteIndex", "aoss:UpdateIndex",
"aoss:DescribeIndex", "aoss:ReadDocument", "aoss:WriteDocument",
],
"ResourceType": "index",
}
],
"Principal": [bedrock_role_arn],
}
]),
)
# Create the collection
response = aoss.create_collection(name="bedrock-knowledge-base", type="VECTORSEARCH")
collection_id = response["createCollectionDetail"]["id"]
collection_endpoint = f"https://{collection_id}.us-east-1.aoss.amazonaws.com"
print(f"Collection ID: {collection_id}")
print(f"Endpoint: {collection_endpoint}")
Step 2 — Create the Knowledge Base
import boto3
import json
bedrock_agent = boto3.client("bedrock-agent", region_name="us-east-1")
kb_response = bedrock_agent.create_knowledge_base(
name="product-documentation-kb",
description="Knowledge base for product documentation and FAQs",
roleArn="arn:aws:iam::123456789012:role/AmazonBedrockExecutionRoleForKnowledgeBase",
knowledgeBaseConfiguration={
"type": "VECTOR",
"vectorKnowledgeBaseConfiguration": {
"embeddingModelArn": "arn:aws:bedrock:us-east-1::foundation-model/amazon.titan-embed-text-v2:0",
},
},
storageConfiguration={
"type": "OPENSEARCH_SERVERLESS",
"opensearchServerlessConfiguration": {
"collectionArn": f"arn:aws:aoss:us-east-1:123456789012:collection/{collection_id}",
"vectorIndexName": "bedrock-knowledge-base-default-index",
"fieldMapping": {
"vectorField": "bedrock-knowledge-base-default-vector",
"textField": "AMAZON_BEDROCK_TEXT_CHUNK",
"metadataField": "AMAZON_BEDROCK_METADATA",
},
},
},
)
kb_id = kb_response["knowledgeBase"]["knowledgeBaseId"]
print(f"Knowledge Base ID: {kb_id}")
# Step 3 — Add an S3 data source
ds_response = bedrock_agent.create_data_source(
knowledgeBaseId=kb_id,
name="product-docs-s3",
dataSourceConfiguration={
"type": "S3",
"s3Configuration": {
"bucketArn": "arn:aws:s3:::my-product-documentation",
"inclusionPrefixes": ["docs/", "faqs/"], # Only ingest these prefixes
},
},
vectorIngestionConfiguration={
"chunkingConfiguration": {
"chunkingStrategy": "FIXED_SIZE",
"fixedSizeChunkingConfiguration": {
"maxTokens": 512, # Tokens per chunk
"overlapPercentage": 20, # 20% overlap between chunks for context continuity
},
}
},
)
ds_id = ds_response["dataSource"]["dataSourceId"]
print(f"Data Source ID: {ds_id}")
# Step 4 — Start ingestion job (sync S3 → vector store)
ingest = bedrock_agent.start_ingestion_job(
knowledgeBaseId=kb_id,
dataSourceId=ds_id,
)
job_id = ingest["ingestionJob"]["ingestionJobId"]
print(f"Ingestion job started: {job_id}")
Step 4 — Query the Knowledge Base (Retrieve + Generate)
import boto3
bedrock_agent_rt = boto3.client("bedrock-agent-runtime", region_name="us-east-1")
# RetrieveAndGenerate — single call that retrieves context + generates answer
response = bedrock_agent_rt.retrieve_and_generate(
input={"text": "What is the return policy for enterprise subscriptions?"},
retrieveAndGenerateConfiguration={
"type": "KNOWLEDGE_BASE",
"knowledgeBaseConfiguration": {
"knowledgeBaseId": kb_id,
"modelArn": "arn:aws:bedrock:us-east-1::foundation-model/anthropic.claude-3-5-sonnet-20241022-v2:0",
"retrievalConfiguration": {
"vectorSearchConfiguration": {
"numberOfResults": 5, # Retrieve top 5 chunks
"overrideSearchType": "HYBRID", # Hybrid: semantic + keyword
}
},
"generationConfiguration": {
"promptTemplate": {
"textPromptTemplate": "You are a support agent. Use only the provided context to answer. Context: $search_results$ Question: $query$"
}
},
},
},
)
print(response["output"]["text"])
print("\n--- Citations ---")
for citation in response.get("citations", []):
for ref in citation.get("retrievedReferences", []):
loc = ref["location"]["s3Location"]
print(f" Source: {loc['uri']} (score: {ref.get('score', 'N/A')})")
bedrock_agent.start_ingestion_job() again to re-sync. Bedrock tracks document checksums and only re-embeds changed files — incremental syncs on a 10,000-document corpus typically complete in under 2 minutes.Bedrock Agents — Multi-Step Reasoning and Tool Use
Bedrock Agents extends the model beyond a single prompt-response cycle. An Agent can reason over a goal, decide which tools (called action groups) to invoke, call Lambda functions or API endpoints, observe the results, and continue reasoning until the task is complete. This is the Bedrock equivalent of ReAct / function-calling patterns — but the orchestration loop is managed by AWS, not your application code.
The architecture has three parts: (1) the Agent itself, which holds the system prompt and reasoning configuration; (2) Action Groups, which define the tools the agent can call — each action group maps to an OpenAPI schema and a Lambda function; (3) an optional Knowledge Base attachment for retrieval during reasoning.
Creating an Agent with a Lambda Action Group
import boto3
import json
bedrock_agent = boto3.client("bedrock-agent", region_name="us-east-1")
# Step 1: Create the Agent
agent_response = bedrock_agent.create_agent(
agentName="order-support-agent",
agentResourceRoleArn="arn:aws:iam::123456789012:role/AmazonBedrockExecutionRoleForAgents",
foundationModel="anthropic.claude-3-5-sonnet-20241022-v2:0",
instruction="""You are an order support agent for Techoral Store.
When a customer asks about their order:
1. Use get_order_status to look up the current status.
2. If the order is delayed, use check_carrier_tracking for real-time carrier data.
3. If the customer wants to cancel, use initiate_cancellation.
Always be polite and concise. Never make up order information.""",
idleSessionTTLInSeconds=600,
)
agent_id = agent_response["agent"]["agentId"]
print(f"Agent ID: {agent_id}")
# Step 2: Define the action group OpenAPI schema
# This schema tells the Agent what functions are available and their parameters
openapi_schema = {
"openapi": "3.0.0",
"info": {"title": "Order Support API", "version": "1.0.0"},
"paths": {
"/get_order_status": {
"get": {
"operationId": "get_order_status",
"description": "Retrieve the current status of a customer order",
"parameters": [
{
"name": "order_id",
"in": "query",
"required": True,
"description": "The order ID (format: ORD-XXXXXX)",
"schema": {"type": "string"},
}
],
"responses": {"200": {"description": "Order status returned"}},
}
},
"/initiate_cancellation": {
"post": {
"operationId": "initiate_cancellation",
"description": "Initiate cancellation of an order",
"requestBody": {
"required": True,
"content": {
"application/json": {
"schema": {
"type": "object",
"properties": {
"order_id": {"type": "string"},
"reason": {"type": "string"},
},
"required": ["order_id", "reason"],
}
}
},
},
"responses": {"200": {"description": "Cancellation initiated"}},
}
},
},
}
# Step 3: Attach the action group to the agent
bedrock_agent.create_agent_action_group(
agentId=agent_id,
agentVersion="DRAFT",
actionGroupName="order-operations",
actionGroupExecutor={
"lambda": "arn:aws:lambda:us-east-1:123456789012:function:order-support-handler"
},
apiSchema={
"payload": json.dumps(openapi_schema),
},
description="Functions for looking up and managing customer orders",
)
# Step 4: Prepare the agent (compiles the instruction + action groups)
bedrock_agent.prepare_agent(agentId=agent_id)
print("Agent prepared — waiting for status Active...")
Lambda Handler for Action Groups
When the Agent decides to call one of the actions, Bedrock invokes the Lambda with a structured event. Your Lambda handler extracts the function name and parameters, executes the business logic, and returns a result in the expected format.
import json
def lambda_handler(event, context):
"""Lambda handler for Bedrock Agent action group."""
action_group = event.get("actionGroup")
function_name = event.get("function")
parameters = event.get("parameters", [])
# Convert parameters list to dict
params = {p["name"]: p["value"] for p in parameters}
print(f"Agent calling: {action_group}::{function_name} with {params}")
if function_name == "get_order_status":
order_id = params.get("order_id")
# --- Your real business logic here ---
result = {
"order_id": order_id,
"status": "SHIPPED",
"estimated_delivery": "2026-06-10",
"carrier": "FedEx",
"tracking_number": "7749283746271",
}
response_body = {"application/json": {"body": json.dumps(result)}}
elif function_name == "initiate_cancellation":
order_id = params.get("order_id")
reason = params.get("reason")
# --- Call your order management system ---
result = {
"success": True,
"cancellation_id": "CXL-998877",
"message": f"Order {order_id} cancellation initiated. Refund in 3–5 business days.",
}
response_body = {"application/json": {"body": json.dumps(result)}}
else:
response_body = {"application/json": {"body": json.dumps({"error": "Unknown function"})}}
# Bedrock Agent expects this specific response structure
return {
"messageVersion": "1.0",
"response": {
"actionGroup": action_group,
"function": function_name,
"functionResponse": {"responseBody": response_body},
},
}
Invoking the Agent in Your Application
import boto3
import uuid
bedrock_agent_rt = boto3.client("bedrock-agent-runtime", region_name="us-east-1")
session_id = str(uuid.uuid4()) # Maintain session state across turns
response = bedrock_agent_rt.invoke_agent(
agentId=agent_id,
agentAliasId="TSTALIASID", # Use "TSTALIASID" for the DRAFT alias
sessionId=session_id,
inputText="My order ORD-447829 hasn't arrived. It was supposed to be here 3 days ago.",
enableTrace=True, # Returns the agent's reasoning steps
)
# Stream the agent's response (it may invoke multiple tools before answering)
event_stream = response["completion"]
for event in event_stream:
if "chunk" in event:
chunk = event["chunk"]
print(chunk["bytes"].decode(), end="", flush=True)
elif "trace" in event:
trace = event["trace"]["trace"]
if "orchestrationTrace" in trace:
orch = trace["orchestrationTrace"]
if "invocationInput" in orch:
inv = orch["invocationInput"]
if inv.get("invocationType") == "ACTION_GROUP":
ag_inv = inv["actionGroupInvocationInput"]
print(f"\n[Agent calling: {ag_inv['function']}({ag_inv.get('parameters', [])})]")
Fine-Tuning and Continued Pre-Training on Bedrock
Fine-tuning adapts a foundation model to your specific domain, tone, or task format. Bedrock supports two customisation modes on eligible base models (currently Amazon Titan, Cohere Command, and Meta Llama): Fine-tuning (supervised, uses labelled prompt-completion pairs) and Continued Pre-Training (unsupervised, uses raw domain text to shift the model's knowledge distribution without requiring explicit labels).
When to Fine-Tune vs Prompt Engineer
Fine-tuning is expensive in time and data preparation. Before committing, try these cheaper alternatives in order: (1) better system prompts with examples (few-shot), (2) Bedrock Knowledge Bases for factual grounding, (3) model parameter tuning (temperature, top-p). Only fine-tune when you have 1,000+ high-quality labelled examples and consistent quality requirements that prompting alone cannot meet.
Preparing Fine-Tuning Data
Fine-tuning data must be in JSONL format, one example per line. Each line is a JSON object with a prompt and completion field. Upload to S3 before starting the job.
{"prompt": "Classify the sentiment of this support ticket: 'The product broke on day 1 and support is unresponsive.'", "completion": "NEGATIVE"}
{"prompt": "Classify the sentiment of this support ticket: 'Setup was seamless and the team loves it!'", "completion": "POSITIVE"}
{"prompt": "Classify the sentiment of this support ticket: 'Works as expected, nothing special.'", "completion": "NEUTRAL"}
Starting a Fine-Tuning Job
import boto3
import json
bedrock = boto3.client("bedrock", region_name="us-east-1")
response = bedrock.create_model_customization_job(
jobName="support-ticket-classifier-v1",
customModelName="support-classifier-titan",
roleArn="arn:aws:iam::123456789012:role/BedrockCustomizationRole",
baseModelIdentifier="amazon.titan-text-express-v1",
customizationType="FINE_TUNING",
trainingDataConfig={
"s3Uri": "s3://my-training-data/fine-tune/train.jsonl",
},
validationDataConfig={
"validators": [{"s3Uri": "s3://my-training-data/fine-tune/val.jsonl"}]
},
outputDataConfig={
"s3Uri": "s3://my-training-data/fine-tune/output/",
},
hyperParameters={
"epochCount": "5",
"batchSize": "8",
"learningRate": "0.00005",
},
)
job_arn = response["jobArn"]
print(f"Fine-tuning job ARN: {job_arn}")
# Poll for completion
import time
while True:
status = bedrock.get_model_customization_job(jobIdentifier=job_arn)
job_status = status["status"]
print(f"Status: {job_status}")
if job_status in ("Completed", "Failed", "Stopped"):
break
time.sleep(60)
if job_status == "Completed":
custom_model_arn = status["outputModelArn"]
print(f"Custom model ARN: {custom_model_arn}")
# The custom model is now invokable via invoke_model using this ARN
Guardrails — Content Filtering, PII Redaction, and Topic Denial
Bedrock Guardrails is a moderation layer you attach to any model invocation. It intercepts both the user's input and the model's output and applies a configurable set of policies: filter harmful content (hate, violence, sexual, self-harm, insults), block topics you define (e.g., "competitor products", "legal advice"), redact PII (names, SSNs, credit card numbers, email addresses), and enforce word-level deny lists. Guardrails work with all Bedrock models and also with Knowledge Bases and Agents — a single guardrail config can protect your entire generative AI surface.
import boto3
import json
bedrock = boto3.client("bedrock", region_name="us-east-1")
# Create a guardrail
guardrail_response = bedrock.create_guardrail(
name="production-safety-guardrail",
description="Safety guardrail for customer-facing chatbot",
# Block harmful content categories
contentPolicyConfig={
"filtersConfig": [
{"type": "SEXUAL", "inputStrength": "HIGH", "outputStrength": "HIGH"},
{"type": "VIOLENCE", "inputStrength": "MEDIUM", "outputStrength": "HIGH"},
{"type": "HATE", "inputStrength": "HIGH", "outputStrength": "HIGH"},
{"type": "INSULTS", "inputStrength": "MEDIUM", "outputStrength": "MEDIUM"},
{"type": "MISCONDUCT","inputStrength": "HIGH", "outputStrength": "HIGH"},
]
},
# Deny specific topics entirely
topicPolicyConfig={
"topicsConfig": [
{
"name": "legal-advice",
"definition": "Requests for specific legal advice, interpretation of contracts, or legal representation",
"examples": [
"Can I sue my employer for this?",
"Is my NDA enforceable?",
],
"type": "DENY",
},
{
"name": "competitor-comparison",
"definition": "Questions asking to compare Techoral products against specific named competitors",
"examples": ["How does Techoral compare to Competitor X?"],
"type": "DENY",
},
]
},
# Redact PII from inputs and outputs
sensitiveInformationPolicyConfig={
"piiEntitiesConfig": [
{"type": "EMAIL", "action": "ANONYMIZE"},
{"type": "PHONE", "action": "ANONYMIZE"},
{"type": "CREDIT_DEBIT_CARD_NUMBER", "action": "BLOCK"},
{"type": "US_SOCIAL_SECURITY_NUMBER", "action": "BLOCK"},
{"type": "NAME", "action": "ANONYMIZE"},
]
},
# Custom word blocklist
wordPolicyConfig={
"wordsConfig": [
{"text": "competitors-product-name"},
],
"managedWordListsConfig": [{"type": "PROFANITY"}],
},
blockedInputMessaging="I cannot process that request. Please rephrase your question.",
blockedOutputsMessaging="I cannot provide that information. How else can I help you?",
)
guardrail_id = guardrail_response["guardrailId"]
guardrail_version = guardrail_response["version"]
print(f"Guardrail ID: {guardrail_id}, Version: {guardrail_version}")
# Apply guardrail to a model invocation
client = boto3.client("bedrock-runtime", region_name="us-east-1")
response = client.invoke_model(
modelId="anthropic.claude-3-5-sonnet-20241022-v2:0",
body=json.dumps({
"anthropic_version": "bedrock-2023-05-31",
"max_tokens": 512,
"messages": [{"role": "user", "content": "My email is john@example.com. What is your refund policy?"}],
}),
contentType="application/json",
accept="application/json",
guardrailIdentifier=guardrail_id,
guardrailVersion=guardrail_version,
trace="ENABLED",
)
result = json.loads(response["body"].read())
print(result["content"][0]["text"])
# PII in the input is anonymized before reaching the model
# "My email is [EMAIL]. What is your refund policy?"
Guardrail:Invocations, Guardrail:InterventionCount, and per-policy breakdown metrics to CloudWatch. Create a CloudWatch Dashboard showing daily intervention rates per guardrail policy type — spikes indicate either prompt injection attempts or legitimate user confusion about what the chatbot can help with.Prompt Management — Prompt Flows and Versioning
Bedrock Prompt Management lets you store, version, and deploy prompts as first-class resources — separate from application code. This solves a common problem: prompt changes are made ad-hoc by engineers, are not reviewed or tracked, and production regressions are difficult to diagnose. With Prompt Management, every prompt has a unique ARN, a version history, and can be referenced by alias (like "production" or "staging") so your application code never hardcodes prompt text.
import boto3
bedrock_agent = boto3.client("bedrock-agent", region_name="us-east-1")
# Create a versioned prompt
prompt_response = bedrock_agent.create_prompt(
name="support-chat-system-prompt",
description="System prompt for the customer support chatbot",
variants=[
{
"name": "default",
"templateType": "TEXT",
"templateConfiguration": {
"text": {
"text": """You are a friendly and knowledgeable support agent for {{company_name}}.
Your goals:
- Resolve customer issues on the first interaction
- Be empathetic and solution-focused
- Escalate to a human agent if the issue cannot be resolved in 3 turns
- Never discuss pricing without checking the current rate card
Current date: {{current_date}}
Agent name: {{agent_name}}""",
"inputVariables": [
{"name": "company_name"},
{"name": "current_date"},
{"name": "agent_name"},
],
}
},
"modelId": "anthropic.claude-3-5-sonnet-20241022-v2:0",
"inferenceConfiguration": {
"text": {"temperature": 0.3, "maxTokens": 2048}
},
}
],
)
prompt_id = prompt_response["id"]
print(f"Prompt ID: {prompt_id}")
# Create a version (immutable snapshot)
version_response = bedrock_agent.create_prompt_version(
promptIdentifier=prompt_id,
description="Version 1 — initial production release",
)
version_number = version_response["version"]
print(f"Prompt version: {version_number}")
# Reference the prompt in an invocation via Converse API
import json
client = boto3.client("bedrock-runtime", region_name="us-east-1")
response = client.converse(
modelId="anthropic.claude-3-5-sonnet-20241022-v2:0",
system=[
{
"text": "You are a support agent for Techoral. Be helpful and concise."
# In practice, fetch prompt text from Bedrock Prompt Management
# and substitute variables before passing here
}
],
messages=[{"role": "user", "content": [{"text": "How do I reset my password?"}]}],
)
print(response["output"]["message"]["content"][0]["text"])
Bedrock Prompt Flows goes further — it provides a visual no-code pipeline builder where you chain model invocations, knowledge base retrievals, Lambda calls, and conditional branching. A flow is defined as a directed acyclic graph and can be tested, versioned, and deployed with a stable alias URL that your application invokes via invoke_flow.
Cost Optimisation — On-Demand vs Provisioned Throughput
Bedrock pricing has two modes. Understanding both is essential to avoiding bill shock and choosing the right architecture for your traffic profile.
On-Demand Pricing
You pay per token processed — input tokens and output tokens priced separately. There is no minimum commitment, no idle cost, and no capacity reservation. This is the right choice for variable or unpredictable workloads, prototypes, and applications with bursts of traffic separated by long periods of inactivity.
| Model | Input (per 1K tokens) | Output (per 1K tokens) | Notes |
|---|---|---|---|
| Claude 3.5 Sonnet | $0.003 | $0.015 | Best quality/cost for reasoning tasks |
| Claude 3 Haiku | $0.00025 | $0.00125 | 12x cheaper than Sonnet, good for classification/extraction |
| Llama 3.1 70B Instruct | $0.00265 | $0.0035 | Strong open-weight alternative |
| Llama 3.1 8B Instruct | $0.00022 | $0.00022 | Ultra-low cost, fast, for simple tasks |
| Titan Text Express | $0.0008 | $0.0016 | AWS-native, no third-party agreement |
| Titan Embeddings V2 | $0.00002 | N/A | Per token for embedding generation |
Provisioned Throughput
Provisioned Throughput reserves Model Units (MUs) — each MU guarantees a fixed number of tokens per minute. You pay per hour regardless of whether you use the full throughput. The break-even point compared to on-demand is typically around 60–70% sustained utilisation. Provisioned Throughput is the right choice for high-volume, consistent workloads where predictable latency matters and usage is above the on-demand break-even.
import boto3
bedrock = boto3.client("bedrock", region_name="us-east-1")
# Purchase provisioned throughput — 1 model unit, 1-month commitment
pt_response = bedrock.create_provisioned_model_throughput(
provisionedModelName="production-claude-haiku",
modelId="anthropic.claude-3-haiku-20240307-v1:0",
modelUnits=1, # 1 MU = ~1,000 tokens/minute for Haiku
commitmentDuration="OneMonth", # "OneMonth" or "SixMonths" (6-month gets ~40% discount)
)
provisioned_model_arn = pt_response["provisionedModelArn"]
print(f"Provisioned model ARN: {provisioned_model_arn}")
# Invoke using the provisioned model ARN instead of the base model ID
import json
client = boto3.client("bedrock-runtime", region_name="us-east-1")
response = client.invoke_model(
modelId=provisioned_model_arn, # Use provisioned ARN for guaranteed throughput
body=json.dumps({
"anthropic_version": "bedrock-2023-05-31",
"max_tokens": 256,
"messages": [{"role": "user", "content": "Classify this ticket: billing issue"}],
}),
contentType="application/json",
accept="application/json",
)
print(json.loads(response["body"].read())["content"][0]["text"])
Token Counting Before Invocation
The Bedrock Converse API includes a count_tokens operation that estimates token count without invoking the model. Use this in batch pipelines to pre-filter large inputs that would exceed model context windows or budget thresholds before paying for the full invocation.
import boto3
client = boto3.client("bedrock-runtime", region_name="us-east-1")
# Count tokens without invoking the model (no charge)
messages = [
{"role": "user", "content": [{"text": "Summarise this 50-page document: " + "x" * 50000}]}
]
token_count_response = client.count_tokens(
modelId="anthropic.claude-3-5-sonnet-20241022-v2:0",
messages=messages,
)
total_tokens = token_count_response["inputTokenCount"]
print(f"Input token count: {total_tokens}")
MAX_TOKENS = 100_000
if total_tokens > MAX_TOKENS:
print(f"Input exceeds {MAX_TOKENS} token budget — truncating or chunking required")
else:
# Safe to invoke
response = client.converse(
modelId="anthropic.claude-3-5-sonnet-20241022-v2:0",
messages=messages,
)
print(response["output"]["message"]["content"][0]["text"])
max_tokens to the minimum needed — you are charged for output tokens generated. (4) Use Knowledge Bases to reduce context window usage instead of stuffing full documents into prompts. (5) Monitor token usage per endpoint using CloudWatch metric Bedrock:InputTokenCount and Bedrock:OutputTokenCount.Frequently Asked Questions
Does Bedrock train on my data?
No. AWS explicitly states that data you send to Bedrock for model inference is not used to train or improve foundation models. Your prompts and completions are not stored beyond the session unless you explicitly enable logging to S3 via Bedrock Model Invocation Logging. This is a key differentiator from some third-party AI API providers and is backed by AWS data processing agreements (DPA) and SOC 2 / ISO 27001 compliance certifications.
Which Bedrock model should I use for my chatbot?
For general-purpose customer-facing chatbots, Claude 3.5 Sonnet offers the best combination of quality, instruction-following, and safety. For high-volume classification, routing, or extraction tasks where cost per call matters more than maximum capability, Claude 3 Haiku cuts costs by 12x with acceptable quality degradation. For open-weight requirements (compliance, audit, local deployment fallback), Llama 3.1 70B is the strongest option. Run a model evaluation using Bedrock's built-in Model Evaluation feature — it benchmarks multiple models against your specific test dataset before you commit to one.
What is the context window limit for Bedrock models?
Context limits vary by model: Claude 3.5 Sonnet supports 200K tokens (~150,000 words), Llama 3.1 405B supports 128K tokens, Titan Text Express supports 8K tokens, and Mistral Large supports 32K tokens. For documents exceeding the context limit, use Bedrock Knowledge Bases to retrieve only relevant chunks rather than stuffing the entire document into the prompt — this also reduces cost significantly.
How do I use Bedrock with LangChain or LlamaIndex?
Both LangChain and LlamaIndex have native Bedrock integrations. In LangChain, use from langchain_aws import ChatBedrock and pass the model_id and region_name. In LlamaIndex, use from llama_index.llms.bedrock import Bedrock. Both abstractions sit on top of the boto3 client, so they inherit your IAM role's Bedrock permissions automatically — no separate API keys required.
Can I run Bedrock in a private VPC with no internet access?
Yes. Create a VPC endpoint for com.amazonaws.us-east-1.bedrock-runtime (Interface endpoint). Once the endpoint is active, all Bedrock API traffic stays on the AWS private network — it never traverses the public internet. Combine this with SCPs that deny Bedrock calls not made through the VPC endpoint for defence-in-depth. This configuration is required for many financial services and government compliance frameworks.