AWS Athena and Glue: Serverless Data Analytics
Amazon Athena and AWS Glue are the two pillars of serverless analytics on AWS. Athena lets you run SQL directly against files sitting in S3 — no cluster to manage, no data loading step, pay only for the bytes you scan. Glue sits alongside it as a managed ETL service and metadata catalog: crawlers discover your data, the Data Catalog stores schema, and Glue ETL jobs transform raw files into optimized Parquet at scale. Together they form a practical, low-overhead data lake that can scale from gigabytes to petabytes without you provisioning a single server. This guide covers everything you need to build, optimize, and govern a production Athena + Glue pipeline in 2026.
Table of Contents
- Athena Overview and Pricing
- Creating Tables: DDL, Partitions, MSCK REPAIR
- Analytical SQL Examples
- Athena Best Practices: Columnar Formats and Partitioning
- AWS Glue Data Catalog and Crawlers
- Glue ETL Jobs: PySpark CSV to Parquet
- Glue DataBrew: Visual Data Preparation
- Lake Formation: Fine-Grained Access Control
- Athena Federated Query
- Cost Optimization Strategies
- boto3: Running Athena Queries Programmatically
- Frequently Asked Questions
Athena Overview and Pricing
Amazon Athena is an interactive query service built on top of Presto/Trino (the open-source distributed SQL engine). You point it at an S3 location, define a table schema in the Glue Data Catalog, and run ANSI SQL. No infrastructure to start, stop, or patch.
| Dimension | Detail |
|---|---|
| Pricing | $5 per TB of data scanned. Only bytes read from S3 count — not bytes returned to you. |
| Minimum charge | 10 MB per query (even if your result is 1 KB) |
| Supported formats | CSV, JSON, ORC, Parquet, Avro, TSV, Ion, custom SerDe |
| Compression | GZIP, Snappy, ZSTD, LZO, Brotli (format-dependent) |
| DDL operations | Free — CREATE TABLE, DROP TABLE, SHOW PARTITIONS don't scan data |
| Result storage | Query results written to a designated S3 output location; cached for 24h by default |
| Concurrency | Default 20 DML queries in parallel per account; request increase via Service Quotas |
Creating Tables: DDL, Partitions, MSCK REPAIR
Athena uses the Hive DDL syntax. Tables are external — the data stays in S3 and the schema lives in the Glue Data Catalog. Here is a realistic example for an e-commerce clickstream dataset partitioned by year, month, and day:
-- Create a partitioned external table pointing at Parquet files in S3
CREATE EXTERNAL TABLE clickstream (
event_id STRING,
user_id BIGINT,
session_id STRING,
page STRING,
referrer STRING,
device_type STRING,
country_code STRING,
duration_ms INT,
revenue DOUBLE
)
PARTITIONED BY (
year STRING,
month STRING,
day STRING
)
STORED AS PARQUET
LOCATION 's3://my-data-lake/clickstream/'
TBLPROPERTIES (
'parquet.compress'='SNAPPY',
'classification'='parquet'
);
-- After adding new partition folders to S3, make Athena discover them:
MSCK REPAIR TABLE clickstream;
-- Or add a specific partition manually (faster for a single new day):
ALTER TABLE clickstream ADD IF NOT EXISTS
PARTITION (year='2026', month='06', day='06')
LOCATION 's3://my-data-lake/clickstream/year=2026/month=06/day=06/';
key=value folder naming convention (Hive-style). Example: s3://bucket/clickstream/year=2026/month=06/day=06/. Athena automatically maps folder segments to partition columns. If your S3 paths don't follow this convention, use explicit ALTER TABLE ADD PARTITION instead of MSCK REPAIR.MSCK REPAIR TABLE scans all S3 prefixes under the table location and registers any unregistered partitions. On very large tables with thousands of partitions it can be slow — in that case, use Glue crawlers or the ALTER TABLE ADD PARTITION approach via AWS CLI or boto3.
Analytical SQL Examples
Athena supports most ANSI SQL including window functions, CTEs, UNNEST, JSON extraction, and approximate aggregations. Below are practical examples covering common analytics patterns.
-- 1. Revenue by country for June 2026 using partition pruning
-- Athena will only scan year=2026/month=06 partitions (not the full table)
SELECT
country_code,
COUNT(*) AS sessions,
ROUND(SUM(revenue), 2) AS total_revenue,
ROUND(AVG(duration_ms) / 1000.0, 1) AS avg_session_sec
FROM clickstream
WHERE year = '2026' AND month = '06'
GROUP BY country_code
ORDER BY total_revenue DESC
LIMIT 20;
-- 2. Rolling 7-day revenue using a window function
SELECT
day,
SUM(revenue) AS daily_revenue,
SUM(SUM(revenue)) OVER (
ORDER BY day
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
) AS rolling_7d_revenue
FROM clickstream
WHERE year = '2026' AND month = '06'
GROUP BY day
ORDER BY day;
-- 3. Funnel analysis: sessions that hit /cart but not /checkout
WITH cart_sessions AS (
SELECT DISTINCT session_id
FROM clickstream
WHERE year = '2026' AND month = '06'
AND page = '/cart'
),
checkout_sessions AS (
SELECT DISTINCT session_id
FROM clickstream
WHERE year = '2026' AND month = '06'
AND page = '/checkout'
)
SELECT
COUNT(c.session_id) AS reached_cart,
COUNT(o.session_id) AS completed_checkout,
ROUND(
100.0 * COUNT(o.session_id) / NULLIF(COUNT(c.session_id), 0),
2
) AS conversion_pct
FROM cart_sessions c
LEFT JOIN checkout_sessions o USING (session_id);
-- 4. CTAS: create an optimised summary table for a dashboard
-- Results written to S3 in Parquet — future queries hit only this small table
CREATE TABLE clickstream_monthly_summary
WITH (
format = 'PARQUET',
parquet_compression = 'SNAPPY',
partitioned_by = ARRAY['year', 'month'],
external_location = 's3://my-data-lake/summaries/clickstream_monthly/'
) AS
SELECT
year,
month,
country_code,
device_type,
COUNT(*) AS sessions,
SUM(revenue) AS revenue,
AVG(duration_ms) AS avg_duration_ms
FROM clickstream
GROUP BY year, month, country_code, device_type;
approx_distinct(user_id) instead of COUNT(DISTINCT user_id). This is orders of magnitude faster and uses much less memory, with a typical error rate under 1%.Athena Best Practices: Columnar Formats and Partitioning
The three biggest levers for reducing Athena cost and query time are columnar format, compression, and partitioning. They compound: use all three together.
| Practice | Impact | Recommendation |
|---|---|---|
| Use Parquet or ORC | 60–90% less data scanned vs CSV for typical analytical queries | Parquet for most cases; ORC for Hive-native workloads |
| Snappy compression | ~3–5x size reduction vs uncompressed; faster decompression than GZIP | Default for Parquet/ORC. Use ZSTD for better ratio with acceptable CPU overhead. |
| Partition by query patterns | Eliminates entire S3 prefix scans | Always include date fields. Avoid high-cardinality partitions (e.g., user_id — millions of tiny files). |
| Target file sizes 128 MB–1 GB | Avoids the small-files problem (many tiny files = many S3 GET requests) | Use Glue ETL to compact small files during transformation |
| Workgroups | Cost control and query isolation | Create a workgroup per team; set data scanned per-query limit (e.g., 1 TB) |
| Result reuse | Free repeated reads of the same result | Enable query result reuse in workgroup settings (TTL up to 7 days) |
AWS Glue Data Catalog and Crawlers
The Glue Data Catalog is a central metadata repository — a managed Apache Hive Metastore compatible with Athena, EMR, and Spark. You define databases and tables here; Athena reads schema from it automatically.
A Glue Crawler inspects your S3 data, infers schema (column names and types), and writes table definitions to the Catalog. This beats writing DDL by hand for data you don't fully control.
Setting up a crawler via AWS CLI:
# 1. Create an IAM role for the crawler (one-time setup)
aws iam create-role \
--role-name GlueCrawlerRole \
--assume-role-policy-document '{
"Version":"2012-10-17",
"Statement":[{"Effect":"Allow","Principal":{"Service":"glue.amazonaws.com"},"Action":"sts:AssumeRole"}]
}'
aws iam attach-role-policy \
--role-name GlueCrawlerRole \
--policy-arn arn:aws:iam::aws:policy/service-role/AWSGlueServiceRole
# Grant read access to your S3 bucket (inline policy)
aws iam put-role-policy \
--role-name GlueCrawlerRole \
--policy-name S3ReadClickstream \
--policy-document '{
"Version":"2012-10-17",
"Statement":[{
"Effect":"Allow",
"Action":["s3:GetObject","s3:ListBucket"],
"Resource":[
"arn:aws:s3:::my-data-lake",
"arn:aws:s3:::my-data-lake/*"
]
}]
}'
# 2. Create the Glue database
aws glue create-database \
--database-input '{"Name":"ecommerce_lake","Description":"E-commerce data lake"}'
# 3. Create and run the crawler
aws glue create-crawler \
--name clickstream-crawler \
--role GlueCrawlerRole \
--database-name ecommerce_lake \
--targets '{"S3Targets":[{"Path":"s3://my-data-lake/clickstream/"}]}' \
--configuration '{"Version":1.0,"CrawlerOutput":{"Partitions":{"AddOrUpdateBehavior":"InheritFromTable"}}}' \
--schedule 'cron(0 6 * * ? *)' # run daily at 6 AM UTC
aws glue start-crawler --name clickstream-crawler
# 4. Check crawler status
aws glue get-crawler --name clickstream-crawler \
--query 'Crawler.State'
Glue ETL Jobs: PySpark CSV to Parquet
A Glue ETL job is a serverless Apache Spark script managed by AWS. You write it in PySpark or Python shell (for smaller datasets), specify the number of DPUs (Data Processing Units), and AWS handles cluster provisioning, job scheduling, and retries. Each DPU provides 4 vCPUs and 16 GB RAM.
The following job reads raw CSV clickstream data from S3, applies schema transformation, drops bad rows, and writes Parquet partitioned by date back to S3. It uses the Glue Dynamic Frame API which integrates with the Data Catalog automatically:
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue.dynamicframe import DynamicFrame
from pyspark.sql import functions as F
from pyspark.sql.types import DoubleType, IntegerType
# ----- Initialise Glue context -----
args = getResolvedOptions(sys.argv, ['JOB_NAME', 'source_bucket', 'dest_bucket'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
SOURCE_PATH = f"s3://{args['source_bucket']}/raw/clickstream/"
DEST_PATH = f"s3://{args['dest_bucket']}/clickstream/"
# ----- Read raw CSV from S3 via Glue Data Catalog -----
raw_dyf = glueContext.create_dynamic_frame.from_options(
connection_type="s3",
connection_options={"paths": [SOURCE_PATH], "recurse": True},
format="csv",
format_options={"withHeader": True, "separator": ",", "quoteChar": '"'},
transformation_ctx="raw_dyf"
)
# Convert to Spark DataFrame for richer transformations
df = raw_dyf.toDF()
# ----- Schema cleanup and type casting -----
df = (df
.withColumn("user_id", F.col("user_id").cast("bigint"))
.withColumn("duration_ms", F.col("duration_ms").cast(IntegerType()))
.withColumn("revenue", F.col("revenue").cast(DoubleType()))
.withColumn("event_time", F.to_timestamp("event_time", "yyyy-MM-dd HH:mm:ss"))
# Derive partition columns from the event timestamp
.withColumn("year", F.date_format("event_time", "yyyy"))
.withColumn("month", F.date_format("event_time", "MM"))
.withColumn("day", F.date_format("event_time", "dd"))
)
# ----- Data quality: drop rows missing mandatory fields -----
REQUIRED = ["event_id", "user_id", "session_id", "page", "event_time"]
df_clean = df.dropna(subset=REQUIRED)
bad_count = df.count() - df_clean.count()
print(f"Dropped {bad_count} rows with null mandatory fields")
# ----- Write Parquet partitioned by year/month/day -----
# Use coalesce to avoid the small-files problem within each partition
output_dyf = DynamicFrame.fromDF(df_clean, glueContext, "output_dyf")
glueContext.write_dynamic_frame.from_options(
frame=output_dyf,
connection_type="s3",
connection_options={
"path": DEST_PATH,
"partitionKeys": ["year", "month", "day"]
},
format="glueparquet",
format_options={"compression": "snappy"},
transformation_ctx="write_parquet"
)
job.commit()
print("ETL job complete. Parquet data written to", DEST_PATH)
Deploy this script by uploading it to S3 and creating the job via CLI or console:
# Upload script to S3
aws s3 cp clickstream_etl.py s3://my-scripts-bucket/glue/clickstream_etl.py
# Create the Glue ETL job
aws glue create-job \
--name clickstream-csv-to-parquet \
--role GlueCrawlerRole \
--command '{
"Name": "glueetl",
"ScriptLocation": "s3://my-scripts-bucket/glue/clickstream_etl.py",
"PythonVersion": "3"
}' \
--default-arguments '{
"--source_bucket": "my-data-lake",
"--dest_bucket": "my-data-lake",
"--job-bookmark-option": "job-bookmark-enable",
"--enable-metrics": "",
"--enable-continuous-cloudwatch-log": "true"
}' \
--glue-version "4.0" \
--number-of-workers 10 \
--worker-type G.1X \
--timeout 120
# Start a run
aws glue start-job-run --job-name clickstream-csv-to-parquet
--job-bookmark-option job-bookmark-enable so Glue tracks which S3 files it has already processed. On the next run it automatically skips them and only processes new files. This is the Glue equivalent of checkpointing in Spark Structured Streaming.Glue DataBrew: Visual Data Preparation
AWS Glue DataBrew is a visual, no-code data preparation tool aimed at analysts who prefer a spreadsheet-style interface over writing PySpark. You connect DataBrew to an S3 location or Glue Catalog table, see a sample of the data, and apply transformations from a menu (rename columns, parse dates, remove duplicates, standardize casing, flag outliers, etc.). DataBrew records every transformation as a Recipe — a reusable, versionable list of steps that can be published as a Glue job and run on a schedule.
DataBrew is particularly useful for:
- One-off exploratory data cleaning before building a formal ETL pipeline
- Business analysts who need to prepare data without help from engineering
- Profiling datasets — DataBrew generates statistics (null rates, cardinality, distribution histograms) across all columns in a single click
- PII detection — DataBrew can automatically flag columns that look like email addresses, phone numbers, SSNs, and credit card numbers
Lake Formation: Fine-Grained Access Control
AWS Lake Formation sits on top of the Glue Data Catalog and IAM to provide column-level and row-level security for your data lake. Without Lake Formation, access control is done entirely via S3 bucket policies and IAM — meaning you can grant or deny access to entire S3 prefixes but not individual columns or filtered rows. Lake Formation lets you say things like "the analyst role can query the clickstream table but cannot see the user_id column" or "the EU team can only see rows where country_code is in ('DE','FR','ES')".
| Lake Formation Permission | Scope |
|---|---|
| Database permissions | CREATE TABLE, DROP, ALTER on the whole database |
| Table permissions | SELECT, INSERT, DELETE on a specific table |
| Column permissions | SELECT on specific columns only (column-level security) |
| Row filters | SQL WHERE clause applied automatically to every query (row-level security) |
| Data cell filters | Combine column + row restrictions in one policy |
| Tag-based access (LF-Tags) | Attach sensitivity tags to tables/columns; grant access by tag rather than by resource ARN |
Athena Federated Query
Athena Federated Query extends Athena beyond S3. Using Lambda-based connectors, Athena can query RDS, Aurora, DynamoDB, Redis, DocumentDB, OpenSearch, and custom JDBC sources — all from the same SQL interface, without ETL or data movement.
Each connector is a Lambda function that implements the Athena Query Federation SDK. AWS publishes pre-built connectors in the Serverless Application Repository. You deploy the connector Lambda, register it as an Athena data source, and then reference it in queries using a catalog prefix:
-- After deploying the DynamoDB connector as catalog "dynamo_catalog":
-- Join S3-backed Athena table with a live DynamoDB table in one query
SELECT
c.user_id,
c.country_code,
c.total_spend,
u.email,
u.signup_date
FROM
-- S3 / Glue Catalog table (the default AwsDataCatalog)
"AwsDataCatalog"."ecommerce_lake"."clickstream_monthly_summary" c
-- DynamoDB table exposed through the federated connector
JOIN "dynamo_catalog"."ecommerce_lake"."users" u
ON c.user_id = CAST(u.user_id AS BIGINT)
WHERE
c.year = '2026'
AND c.month = '06'
AND c.total_spend > 500
ORDER BY c.total_spend DESC;
WHERE partition_key = X) so the DynamoDB connector uses key-based lookups rather than a table scan.Cost Optimization Strategies
Because Athena charges per byte scanned, every optimization that reduces bytes read directly reduces your bill. Here is a prioritized checklist:
- Convert CSV/JSON to Parquet with Snappy compression first. This alone typically cuts scan costs by 70–90% because Parquet's columnar layout means a SELECT on 3 of 20 columns reads only those 3 columns' data blocks.
- Partition your data by the columns most commonly in your WHERE clause — usually date fields. A query with
WHERE year='2026' AND month='06'on a daily-partitioned table skips 11/12 of the data immediately. - Use CTAS for hot queries. If a dashboard runs the same aggregation 100 times a day, run it once as a CTAS to write a small summary table, then point the dashboard at that summary. Scans drop from TB to MB.
- Enable query result reuse. Athena caches query results for up to 7 days. Identical queries (same SQL + same underlying data) return from cache instantly at zero cost. Enable this per workgroup.
- Set data scan limits on workgroups. Prevent runaway queries by configuring a per-query or per-workgroup data limit in the Athena workgroup settings.
- Right-size Parquet row group sizes. The default 128 MB row group is optimal for most cases. If your queries filter on a column that is not a partition key, consider sorting the data on that column before writing Parquet so Athena's predicate pushdown can skip row groups efficiently.
- Compact small files periodically. Schedule a daily Glue job to merge files smaller than 32 MB into larger ones. The S3 GET overhead for thousands of small files can dwarf the actual data scan cost.
boto3: Running Athena Queries Programmatically
For application code that needs to trigger Athena queries and retrieve results — for example, a nightly reporting Lambda — use the boto3 athena client. The Athena API is asynchronous: you start a query, poll for completion, then fetch the result from S3.
import boto3
import time
import csv
import io
athena = boto3.client('athena', region_name='us-east-1')
s3 = boto3.client('s3')
WORKGROUP = 'analytics'
OUTPUT_LOCATION = 's3://my-query-results/athena/'
DATABASE = 'ecommerce_lake'
def run_athena_query(sql: str, database: str = DATABASE) -> list[dict]:
"""
Execute an Athena query synchronously and return results as a list of dicts.
Raises RuntimeError on failure; polls every 2 seconds until done.
"""
# Start the query
response = athena.start_query_execution(
QueryString=sql,
QueryExecutionContext={'Database': database},
ResultConfiguration={'OutputLocation': OUTPUT_LOCATION},
WorkGroup=WORKGROUP
)
execution_id = response['QueryExecutionId']
print(f"Query started: {execution_id}")
# Poll for completion
while True:
status_resp = athena.get_query_execution(QueryExecutionId=execution_id)
state = status_resp['QueryExecution']['Status']['State']
if state == 'SUCCEEDED':
break
elif state in ('FAILED', 'CANCELLED'):
reason = status_resp['QueryExecution']['Status'].get('StateChangeReason', 'unknown')
raise RuntimeError(f"Athena query {state}: {reason}")
time.sleep(2)
# Retrieve results (paginated)
rows = []
paginator = athena.get_paginator('get_query_results')
headers = None
for page in paginator.paginate(QueryExecutionId=execution_id):
result_rows = page['ResultSet']['Rows']
if headers is None:
# First row of first page is the header
headers = [col['VarCharValue'] for col in result_rows[0]['Data']]
result_rows = result_rows[1:]
for row in result_rows:
values = [col.get('VarCharValue', '') for col in row['Data']]
rows.append(dict(zip(headers, values)))
print(f"Query returned {len(rows)} rows. Scanned: "
f"{status_resp['QueryExecution']['Statistics']['DataScannedInBytes'] / 1e9:.3f} GB")
return rows
# Example usage
if __name__ == '__main__':
results = run_athena_query("""
SELECT country_code, ROUND(SUM(revenue), 2) AS total_revenue
FROM clickstream
WHERE year = '2026' AND month = '06'
GROUP BY country_code
ORDER BY total_revenue DESC
LIMIT 10
""")
for row in results:
print(f"{row['country_code']:5s} ${row['total_revenue']}")
QueryExecutionId in DynamoDB and use a Step Functions wait state or EventBridge + Lambda trigger to poll for completion instead of a synchronous time.sleep loop. This avoids Lambda timeout issues on long-running queries and doesn't waste function invocation time waiting.Frequently Asked Questions
When should I use Athena vs Redshift?
Use Athena when your data lives primarily in S3, your query pattern is ad hoc or infrequent, and you don't want to manage a cluster. Athena has no upfront cost and scales automatically. Use Redshift when you need consistently fast sub-second query latency on large datasets, complex multi-table JOINs with many concurrent users, or BI tools that rely on a persistent JDBC connection. Redshift Serverless bridges the gap — it auto-scales but still maintains a Redshift engine, giving better performance for structured, repetitive workloads at a higher per-query cost than Athena.
What's the difference between Glue ETL and a Lambda function for data transformation?
Glue ETL is Apache Spark under the hood — it scales horizontally across multiple workers and is designed for large datasets (GBs to TBs). Lambda has a 15-minute timeout and 10 GB memory limit, making it suitable for transforming individual files or small payloads. Use Glue for bulk historical loads and recurring large-scale transformations; use Lambda for event-driven micro-transforms triggered by S3 events on individual files.
How do I handle schema evolution in Athena?
Parquet and ORC support schema evolution natively. If you add a new column to your data but not to your Athena table DDL, the new column is simply ignored by Athena. If you add the column to the Athena table with ALTER TABLE ADD COLUMNS, old files that don't have that column return NULL for it. Dropping or renaming columns is more complex — it's generally safer to create a new table version and use a view to abstract the schema change from consumers.
Can Athena query data in multiple AWS accounts?
Yes. Configure cross-account S3 bucket policies to allow the querying account's IAM principal read access to the source bucket, and optionally share the Glue Data Catalog via AWS RAM (Resource Access Manager). Athena can then query S3 data in other accounts as long as the IAM role running the query has s3:GetObject and s3:ListBucket permissions on the cross-account bucket.
What is the maximum query timeout for Athena?
Athena has a default query timeout of 30 minutes. This is adjustable per workgroup up to a maximum of 30 minutes for DML queries. For very long-running queries, consider splitting the query logic into smaller steps using CTAS intermediate tables, or move the computation to Glue ETL (no timeout constraint other than the job timeout you configure, up to 48 hours).