AWS Comprehend: Natural Language Processing API for Text Analysis (2026)
Amazon Comprehend is a fully managed Natural Language Processing (NLP) service that uses machine learning to find insights and relationships in text — with no ML expertise required. From sentiment analysis and entity recognition to custom classifiers and topic modeling, Comprehend lets you extract structured meaning from unstructured text at any scale, paying only for what you use.
In 2026, NLP is no longer a specialist skill — it is table-stakes for any application that touches user-generated content, support tickets, legal documents, medical records, or social media. Comprehend removes the infrastructure burden and model-training complexity, giving developers a set of pre-trained and customizable APIs callable over standard boto3 Python calls. This guide covers every major feature with working code, cost notes, and a production-ready integration pattern.
Table of Contents
1. What is Amazon Comprehend?
Amazon Comprehend is part of the AWS AI/ML services family alongside Rekognition (image/video analysis) and SageMaker (custom model training). Unlike SageMaker, Comprehend requires zero model building — AWS maintains and continuously retrains the underlying foundation models.
Built-in vs. Custom NLP
Comprehend offers two tiers of capability:
- Built-in APIs — Pre-trained, instantly available, no training data needed. Cover sentiment, entities, key phrases, language detection, PII, syntax, and topic modeling.
- Custom models — You supply labeled training data (CSV or augmented manifest). Comprehend fine-tunes a model specific to your domain. Covers custom classification (multi-class, multi-label) and custom entity recognition.
Core Use Cases
- Customer support routing — classify tickets by topic and urgency before they hit an agent queue
- Social media monitoring — real-time sentiment tracking across product mentions
- Document intelligence — extract entities, key phrases, and relationships from contracts, invoices, or research papers
- Content moderation — detect PII before storing user-generated content
- Healthcare analytics — extract diagnoses, medications, and procedures via Comprehend Medical
- Compliance — redact sensitive information from documents at scale
pip install boto3), AWS credentials configured (aws configure), and an IAM role or user with comprehend:* permissions. See the IAM Roles and Policies guide for setup details.
2. Sentiment Analysis
Sentiment analysis classifies text as POSITIVE, NEGATIVE, NEUTRAL, or MIXED, and returns confidence scores for each label. It is one of the most heavily used Comprehend APIs — ideal for customer reviews, social posts, and survey responses.
Basic detect_sentiment() Call
import boto3
comprehend = boto3.client('comprehend', region_name='us-east-1')
response = comprehend.detect_sentiment(
Text="The new AWS Comprehend API is incredibly fast and easy to use. Highly recommended!",
LanguageCode='en'
)
print(f"Sentiment: {response['Sentiment']}")
print(f"Scores: {response['SentimentScore']}")
# Sentiment: POSITIVE
# Scores: {'Positive': 0.9876, 'Negative': 0.0012, 'Neutral': 0.0089, 'Mixed': 0.0023}
Response Format
The response object contains:
Sentiment— string: POSITIVE | NEGATIVE | NEUTRAL | MIXEDSentimentScore— dict with four float values summing to ~1.0ResponseMetadata— standard AWS HTTP metadata
Batch Processing with batch_detect_sentiment()
For throughput, use the batch API to process up to 25 documents per call, significantly reducing round-trip latency.
import boto3
comprehend = boto3.client('comprehend', region_name='us-east-1')
reviews = [
"Product arrived damaged. Very disappointed.",
"Exactly what I needed. Fast shipping too!",
"It was okay, nothing special but does the job.",
"Terrible customer support, waited 3 days for a reply.",
"Outstanding quality. Would buy again without hesitation."
]
# batch_detect_sentiment accepts a list of text strings + language
response = comprehend.batch_detect_sentiment(
TextList=reviews,
LanguageCode='en'
)
for item in response['ResultList']:
print(f"[{item['Index']}] {item['Sentiment']} — "
f"Positive: {item['SentimentScore']['Positive']:.3f}")
# Check for any errors
if response['ErrorList']:
for err in response['ErrorList']:
print(f"Error at index {err['Index']}: {err['ErrorMessage']}")
start_sentiment_detection_job() to run an asynchronous batch job against S3 input files. Results are written back to S3 as line-delimited JSON.
3. Entity Recognition
Entity recognition (Named Entity Recognition / NER) identifies and classifies named entities in text. Comprehend recognizes 12 built-in entity types including PERSON, LOCATION, ORGANIZATION, DATE, QUANTITY, TITLE, EVENT, and COMMERCIAL_ITEM.
import boto3
import json
comprehend = boto3.client('comprehend', region_name='us-east-1')
text = """
Amazon Web Services announced its new data center in Hyderabad, India on March 15, 2026.
The $3.5 billion facility will be managed by CEO Andy Jassy and serve customers across Asia Pacific.
"""
response = comprehend.detect_entities(
Text=text,
LanguageCode='en'
)
for entity in response['Entities']:
print(f" Type: {entity['Type']:<20} Text: {entity['Text']:<30} Score: {entity['Score']:.3f}")
# Output:
# Type: ORGANIZATION Text: Amazon Web Services Score: 0.998
# Type: LOCATION Text: Hyderabad, India Score: 0.996
# Type: DATE Text: March 15, 2026 Score: 0.999
# Type: QUANTITY Text: $3.5 billion Score: 0.994
# Type: TITLE Text: CEO Score: 0.981
# Type: PERSON Text: Andy Jassy Score: 0.997
Each entity result includes:
Text— the exact string matched in the source textType— entity categoryScore— confidence from 0 to 1BeginOffset/EndOffset— character positions, useful for highlighting in a UI
BeginOffset and EndOffset alongside entity results to render inline annotations in document review tools without re-running the API.
4. Key Phrase Extraction
Key phrase extraction identifies the most important noun phrases in a document. Unlike entity recognition which targets named things, key phrases capture conceptual topics — ideal for automatic document tagging, search index enrichment, and content recommendation.
import boto3
comprehend = boto3.client('comprehend', region_name='us-east-1')
document = """
Kubernetes cluster autoscaling relies on the Horizontal Pod Autoscaler and Cluster Autoscaler
working in tandem. Proper resource requests and limits are essential for predictable scaling
behavior in production environments running microservices workloads.
"""
response = comprehend.detect_key_phrases(
Text=document,
LanguageCode='en'
)
# Sort by score descending and print top phrases
phrases = sorted(response['KeyPhrases'], key=lambda x: x['Score'], reverse=True)
for phrase in phrases[:8]:
print(f" {phrase['Text']:<45} Score: {phrase['Score']:.3f}")
# Output:
# Horizontal Pod Autoscaler Score: 0.999
# Cluster Autoscaler Score: 0.998
# Kubernetes cluster autoscaling Score: 0.997
# production environments Score: 0.994
# predictable scaling behavior Score: 0.991
# microservices workloads Score: 0.988
# resource requests Score: 0.985
# Proper resource requests and limits Score: 0.982
Document Indexing Pipeline
A practical use case: when a new document is uploaded to S3, a Lambda function extracts key phrases and stores them as tags in DynamoDB. Users can then full-text search across documents using the phrase index, reducing search latency by orders of magnitude compared to scanning raw text.
5. Language Detection
Comprehend can identify the dominant language of a text sample from over 100 languages. This is essential for multi-lingual applications where downstream processing (translation, sentiment analysis) must use the correct language code.
import boto3
comprehend = boto3.client('comprehend', region_name='us-east-1')
samples = [
"Machine learning is transforming every industry.", # English
"El aprendizaje automático transforma cada industria.", # Spanish
"机器学习正在改变每个行业。", # Chinese
"Das maschinelle Lernen verändert jede Branche.", # German
"L'apprentissage automatique transforme chaque secteur.", # French
]
for text in samples:
response = comprehend.detect_dominant_language(Text=text)
top = response['Languages'][0]
print(f" Lang: {top['LanguageCode']} Score: {top['Score']:.3f} Text: {text[:50]}")
Multi-Language Content Routing
In a customer support system, detect the language first, then route to the appropriate sentiment/entity endpoint with the correct LanguageCode — or trigger an Amazon Translate job before NLP processing. This pattern keeps latency low while supporting a global user base without separate per-language pipelines.
6. PII Detection and Redaction
PII (Personally Identifiable Information) detection finds sensitive data like names, addresses, phone numbers, email addresses, SSNs, credit card numbers, and passport numbers. This is critical for GDPR, HIPAA, and CCPA compliance before storing or logging user content.
detect_pii_entities()
import boto3
comprehend = boto3.client('comprehend', region_name='us-east-1')
text = """
Please contact John Smith at john.smith@example.com or call +1-555-867-5309.
His account SSN is 123-45-6789 and credit card ending 4532 1234 5678 9012.
Shipping address: 742 Evergreen Terrace, Springfield, IL 62701.
"""
response = comprehend.detect_pii_entities(
Text=text,
LanguageCode='en'
)
print("PII Entities Found:")
for entity in response['Entities']:
snippet = text[entity['BeginOffset']:entity['EndOffset']]
print(f" Type: {entity['Type']:<20} Value: {snippet:<30} Score: {entity['Score']:.3f}")
Redaction with contains_pii_entities()
For a quick compliance gate (without needing exact positions), use contains_pii_entities() to get a boolean-style check, then apply redaction using the offset data from detect_pii_entities():
def redact_pii(text: str, language_code: str = 'en') -> str:
"""Replace all PII spans with [REDACTED] markers."""
comprehend = boto3.client('comprehend', region_name='us-east-1')
response = comprehend.detect_pii_entities(Text=text, LanguageCode=language_code)
# Sort entities in reverse order so offsets stay valid as we replace
entities = sorted(response['Entities'], key=lambda e: e['BeginOffset'], reverse=True)
result = list(text)
for entity in entities:
start = entity['BeginOffset']
end = entity['EndOffset']
replacement = f"[{entity['Type']}]"
result[start:end] = list(replacement)
return "".join(result)
# Usage
clean_text = redact_pii(text)
print(clean_text)
# Output:
# Please contact [NAME] at [EMAIL] or call [PHONE].
# His account SSN is [SSN] and credit card ending [CREDIT_DEBIT_NUMBER].
# Shipping address: [ADDRESS], [LOCATION], [ADDRESS].
start_pii_entities_detection_job() with a RedactionConfig to have Comprehend write pre-redacted output documents directly to S3 — no Lambda code needed for the replacement logic.
7. Syntax Analysis
Syntax analysis (part-of-speech tagging) identifies the grammatical role of each token in a sentence — NOUN, VERB, ADJECTIVE, ADVERB, PROPN (proper noun), DET (determiner), etc. This is useful for building grammar-aware search, content simplification tools, or feeding downstream NLP pipelines.
import boto3
comprehend = boto3.client('comprehend', region_name='us-east-1')
response = comprehend.detect_syntax(
Text="The serverless Lambda function processes incoming S3 events efficiently.",
LanguageCode='en'
)
for token in response['SyntaxTokens']:
pos = token['PartOfSpeech']
print(f" Token: {token['Text']:<15} POS: {pos['Tag']:<8} Score: {pos['Score']:.3f}")
# Output:
# Token: The POS: DET Score: 1.000
# Token: serverless POS: ADJ Score: 0.997
# Token: Lambda POS: PROPN Score: 0.994
# Token: function POS: NOUN Score: 0.999
# Token: processes POS: VERB Score: 0.998
# Token: incoming POS: ADJ Score: 0.992
# Token: S3 POS: PROPN Score: 0.989
# Token: events POS: NOUN Score: 0.999
# Token: efficiently POS: ADV Score: 0.998
Common POS tags: ADJ, ADP (adposition), ADV, AUX, CCONJ, DET, INTJ, NOUN, NUM, O (other), PART, PRON, PROPN, PUNCT, SCONJ, SYM, VERB.
8. Custom Classifiers
When the built-in sentiment labels (POSITIVE/NEGATIVE/etc.) are too coarse, train a custom classifier. Examples: routing support tickets to departments (Billing, Technical, Returns), classifying legal clauses by type, or labeling news articles by topic.
Training Data Format (CSV)
Prepare a UTF-8 CSV with two columns — label first, then text. No header row.
# training-data.csv
BILLING,"My invoice shows duplicate charges from last month."
BILLING,"I was charged twice for the annual subscription."
TECHNICAL,"The app keeps crashing when I open the settings screen."
TECHNICAL,"Login fails with error code 503 on mobile devices."
RETURNS,"I want to return the product I received yesterday."
RETURNS,"How do I initiate a refund for my recent order?"
GENERAL,"What are your business hours?"
GENERAL,"Can you tell me more about your premium plan?"
Upload the CSV to S3, then create the classifier:
import boto3
comprehend = boto3.client('comprehend', region_name='us-east-1')
# Step 1: Create the classifier (training takes 30–90 minutes)
response = comprehend.create_document_classifier(
DocumentClassifierName='SupportTicketClassifier-v1',
DataAccessRoleArn='arn:aws:iam::123456789012:role/ComprehendDataAccessRole',
InputDataConfig={
'DataFormat': 'COMPREHEND_CSV',
'S3Uri': 's3://my-comprehend-bucket/training-data/training-data.csv'
},
OutputDataConfig={
'S3Uri': 's3://my-comprehend-bucket/classifier-output/'
},
LanguageCode='en',
Mode='MULTI_CLASS' # or 'MULTI_LABEL' for documents with multiple tags
)
classifier_arn = response['DocumentClassifierArn']
print(f"Classifier ARN: {classifier_arn}")
# Step 2: Wait for TRAINED status (poll or use EventBridge)
import time
while True:
status = comprehend.describe_document_classifier(
DocumentClassifierArn=classifier_arn
)['DocumentClassifierProperties']['Status']
print(f"Status: {status}")
if status in ('TRAINED', 'FAILED'):
break
time.sleep(60)
Real-Time vs. Async Inference
Real-time endpoint — create an endpoint from a trained classifier for low-latency synchronous classification (suitable for live ticket routing):
# Create real-time endpoint
endpoint_response = comprehend.create_endpoint(
EndpointName='support-classifier-endpoint',
ModelArn=classifier_arn,
DesiredInferenceUnits=1 # scale up for higher throughput
)
endpoint_arn = endpoint_response['EndpointArn']
# Classify a document synchronously
result = comprehend.classify_document(
Text="I was charged twice on my credit card this billing cycle.",
EndpointArn=endpoint_arn
)
print(result['Classes'])
# [{'Name': 'BILLING', 'Score': 0.9823}, {'Name': 'GENERAL', 'Score': 0.0124}, ...]
Async batch job — use start_document_classification_job() for bulk inference against an S3 folder without a persistent endpoint (more cost-efficient for non-real-time workloads).
9. Custom Entity Recognizers
Custom entity recognizers let you teach Comprehend domain-specific entity types not covered by the built-in 12. Examples: product SKU codes, internal employee IDs, medical drug names, or proprietary terminology.
Annotation Format
You supply two files: a plain text document corpus and an annotations CSV mapping entity spans to their type.
# annotations.csv (columns: File, Line, Begin, End, Type)
documents.txt,0,12,25,PRODUCT_SKU
documents.txt,1,0,14,PRODUCT_SKU
documents.txt,2,18,31,EMPLOYEE_ID
# documents.txt (one document per line)
Order contains SKU-A1234-XL-BLK in the cart.
SKU-B9876-SM-RED was restocked today.
Assigned to employee EMP-00421 for processing.
import boto3
comprehend = boto3.client('comprehend', region_name='us-east-1')
response = comprehend.create_entity_recognizer(
RecognizerName='ProductSKURecognizer-v1',
DataAccessRoleArn='arn:aws:iam::123456789012:role/ComprehendDataAccessRole',
InputDataConfig={
'EntityTypes': [
{'Type': 'PRODUCT_SKU'},
{'Type': 'EMPLOYEE_ID'}
],
'Documents': {
'S3Uri': 's3://my-comprehend-bucket/entity-training/documents.txt'
},
'Annotations': {
'S3Uri': 's3://my-comprehend-bucket/entity-training/annotations.csv'
}
},
LanguageCode='en'
)
recognizer_arn = response['EntityRecognizerArn']
print(f"Recognizer ARN: {recognizer_arn}")
Once trained, deploy via an endpoint and use detect_entities() with the EndpointArn parameter to invoke your custom recognizer in real time.
10. Topic Modeling
Topic modeling uses Latent Dirichlet Allocation (LDA) to discover the dominant themes across a large document corpus — without any predefined labels. It is inherently asynchronous; input and output both live in S3.
import boto3
import time
comprehend = boto3.client('comprehend', region_name='us-east-1')
# Start an async topic detection job
response = comprehend.start_topics_detection_job(
InputDataConfig={
'S3Uri': 's3://my-comprehend-bucket/topic-input/',
'InputFormat': 'ONE_DOC_PER_FILE' # or ONE_DOC_PER_LINE
},
OutputDataConfig={
'S3Uri': 's3://my-comprehend-bucket/topic-output/'
},
DataAccessRoleArn='arn:aws:iam::123456789012:role/ComprehendDataAccessRole',
NumberOfTopics=10, # 1–100; start with 10–20 and tune
JobName='BlogTopicAnalysis-2026-06'
)
job_id = response['JobId']
print(f"Started job: {job_id}")
# Poll until complete
while True:
job = comprehend.describe_topics_detection_job(JobId=job_id)
status = job['TopicsDetectionJobProperties']['JobStatus']
print(f"Status: {status}")
if status in ('COMPLETED', 'FAILED', 'STOP_REQUESTED'):
break
time.sleep(30)
Parsing S3 Output
The completed job writes two gzipped files to your output S3 path:
topic-terms.csv— Each row:topic,term,weight— the top terms defining each topicdoc-topics.csv— Each row:docname,topic,proportion— how much each document belongs to each topic
import boto3
import gzip
import csv
import io
s3 = boto3.client('s3')
# Download and parse topic-terms
obj = s3.get_object(
Bucket='my-comprehend-bucket',
Key='topic-output/output/topic-terms.csv.gz'
)
with gzip.GzipFile(fileobj=io.BytesIO(obj['Body'].read())) as f:
reader = csv.DictReader(io.TextIOWrapper(f, encoding='utf-8'))
current_topic = None
for row in reader:
if row['topic'] != current_topic:
current_topic = row['topic']
print(f"\n--- Topic {current_topic} ---")
print(f" {row['term']:<25} weight: {float(row['weight']):.4f}")
11. Comprehend Medical
Amazon Comprehend Medical is a separate but related service optimized for clinical text — physician notes, discharge summaries, lab reports. It understands medical ontologies and can extract entities, PHI (Protected Health Information), and map findings to standard codes.
import boto3
cm = boto3.client('comprehendmedical', region_name='us-east-1')
clinical_note = """
Patient: Jane Doe, DOB 1980-03-22. Diagnosed with Type 2 Diabetes Mellitus (E11.9).
Prescribed Metformin 500mg twice daily. Blood pressure 145/92 mmHg.
Referred to Dr. Patel at Mysore General Hospital for nephrology follow-up.
"""
# Detect medical entities
response = cm.detect_entities_v2(Text=clinical_note)
print("Medical Entities:")
for entity in response['Entities']:
print(f" Category: {entity['Category']:<22} Type: {entity['Type']:<25} Text: {entity['Text']}")
# Detect PHI separately
phi_response = cm.detect_phi(Text=clinical_note)
print("\nPHI Entities:")
for phi in phi_response['Entities']:
print(f" Type: {phi['Type']:<20} Text: {phi['Text']}")
Comprehend Medical entity categories include: MEDICATION, MEDICAL_CONDITION, ANATOMY, TEST_TREATMENT_PROCEDURE, TIME_EXPRESSION, and PROTECTED_HEALTH_INFORMATION. It also supports infer_icd10_cm() to map conditions to ICD-10 codes and infer_rx_norm() to map medications to RxNorm identifiers — critical for EHR integration and clinical analytics.
12. Integration Patterns
Comprehend works best as a component in an event-driven pipeline. A canonical pattern for real-time document analysis:
S3 → Lambda → Comprehend → DynamoDB Pipeline
import boto3
import json
import os
comprehend = boto3.client('comprehend', region_name=os.environ['AWS_REGION'])
s3 = boto3.client('s3')
dynamodb = boto3.resource('dynamodb', region_name=os.environ['AWS_REGION'])
table = dynamodb.Table(os.environ['RESULTS_TABLE'])
def lambda_handler(event, context):
"""Triggered by S3 PutObject event on the documents bucket."""
for record in event['Records']:
bucket = record['s3']['bucket']['name']
key = record['s3']['object']['key']
# Fetch document text from S3
obj = s3.get_object(Bucket=bucket, Key=key)
text = obj['Body'].read().decode('utf-8')[:5000] # Comprehend max 5000 bytes
# Run NLP in parallel (conceptually; real parallelism uses asyncio or threads)
sentiment = comprehend.detect_sentiment(Text=text, LanguageCode='en')
entities = comprehend.detect_entities(Text=text, LanguageCode='en')
phrases = comprehend.detect_key_phrases(Text=text, LanguageCode='en')
# Store enriched metadata in DynamoDB
table.put_item(Item={
'documentId': key,
'sentiment': sentiment['Sentiment'],
'sentimentScores': {
k: str(round(v, 4))
for k, v in sentiment['SentimentScore'].items()
},
'topEntities': [
{'text': e['Text'], 'type': e['Type']}
for e in sorted(entities['Entities'],
key=lambda x: x['Score'], reverse=True)[:5]
],
'keyPhrases': [p['Text'] for p in phrases['KeyPhrases'][:10]],
'processedAt': context.aws_request_id
})
return {'statusCode': 200, 'body': json.dumps('Processed successfully')}
Architecture details:
- S3 bucket notification triggers Lambda on new object creation
- Lambda reads the document, calls three Comprehend APIs, writes to DynamoDB
- Downstream dashboards query DynamoDB for aggregated sentiment trends, entity frequencies, and phrase clouds
- For monitoring and alerting on pipeline health, use CloudWatch metrics and alarms
- For more complex multi-step workflows (e.g., detect language → translate → analyze sentiment), orchestrate with Step Functions
- For event fan-out (e.g., routing processed results to multiple consumers), use EventBridge
13. Cost Model
Comprehend charges per unit of 100 characters (with a minimum of 300 characters per API call). Pricing varies by API and tier.
Built-in API Pricing (us-east-1, 2026)
| API | Price per 100 chars | Notes |
|---|---|---|
| Sentiment, Entities, Key Phrases, Language, Syntax | $0.0001 | First 10M units/mo; volume discounts after |
| PII Detection | $0.0001 | Same tier structure |
| Async batch jobs | $0.0001 | Same rate, lower per-request overhead |
Custom Model Pricing
| Activity | Price |
|---|---|
| Custom classifier / entity recognizer training | $3.00 per training hour |
| Async custom inference (batch) | $0.0005 per 100 chars |
| Real-time endpoint (Inference Unit) | $0.0005 per hr per IU + $0.0005/100 chars |
Free Tier
New AWS accounts get 50,000 units (5 million characters) per month free for each of the standard APIs for the first 12 months. This is enough to process roughly 20,000 average-length customer reviews per month at no charge — sufficient to build and validate a proof of concept.
- Use async batch jobs instead of synchronous calls for large volumes — same rate but far fewer API round-trips.
- Pre-filter documents before calling Comprehend (e.g., skip very short strings below 20 characters where NLP results are unreliable).
- Cache results in DynamoDB keyed by a hash of the input text — avoid re-processing unchanged documents.
- Delete custom endpoints when not in use — idle endpoints still accrue the hourly inference unit charge.
- Compare against Amazon Bedrock foundation models for tasks where higher reasoning quality justifies the cost difference.
Read Next
- AWS Rekognition: Computer Vision API Guide
- Amazon SageMaker: End-to-End ML Platform Guide
- Amazon Bedrock: Foundation Models on AWS
- AWS Lambda: Serverless Functions in Depth
- Amazon DynamoDB: NoSQL Database Guide
- AWS Step Functions: Orchestrating Serverless Workflows
- Amazon EventBridge: Event-Driven Architecture
- AWS CloudWatch: Monitoring and Observability