AWS Athena and Glue: Serverless Data Analytics

Amazon Athena and AWS Glue are the two pillars of serverless analytics on AWS. Athena lets you run SQL directly against files sitting in S3 — no cluster to manage, no data loading step, pay only for the bytes you scan. Glue sits alongside it as a managed ETL service and metadata catalog: crawlers discover your data, the Data Catalog stores schema, and Glue ETL jobs transform raw files into optimized Parquet at scale. Together they form a practical, low-overhead data lake that can scale from gigabytes to petabytes without you provisioning a single server. This guide covers everything you need to build, optimize, and govern a production Athena + Glue pipeline in 2026.

Athena Overview and Pricing
Creating Tables: DDL, Partitions, MSCK REPAIR
Analytical SQL Examples
Athena Best Practices: Columnar Formats and Partitioning
AWS Glue Data Catalog and Crawlers
Glue ETL Jobs: PySpark CSV to Parquet
Glue DataBrew: Visual Data Preparation
Lake Formation: Fine-Grained Access Control
Athena Federated Query
Cost Optimization Strategies
boto3: Running Athena Queries Programmatically
Frequently Asked Questions

Athena Overview and Pricing

Amazon Athena is an interactive query service built on top of Presto/Trino (the open-source distributed SQL engine). You point it at an S3 location, define a table schema in the Glue Data Catalog, and run ANSI SQL. No infrastructure to start, stop, or patch.

Dimension	Detail
Pricing	$5 per TB of data scanned. Only bytes read from S3 count — not bytes returned to you.
Minimum charge	10 MB per query (even if your result is 1 KB)
Supported formats	CSV, JSON, ORC, Parquet, Avro, TSV, Ion, custom SerDe
Compression	GZIP, Snappy, ZSTD, LZO, Brotli (format-dependent)
DDL operations	Free — CREATE TABLE, DROP TABLE, SHOW PARTITIONS don't scan data
Result storage	Query results written to a designated S3 output location; cached for 24h by default
Concurrency	Default 20 DML queries in parallel per account; request increase via Service Quotas

Key insight: At $5/TB, a query that scans 1 TB of raw CSV costs $5. The same query on Parquet with partition pruning might scan only 40 GB and cost $0.20 — a 25x cost reduction. Format and partitioning decisions are the most impactful optimization you can make.

Creating Tables: DDL, Partitions, MSCK REPAIR

Athena uses the Hive DDL syntax. Tables are external — the data stays in S3 and the schema lives in the Glue Data Catalog. Here is a realistic example for an e-commerce clickstream dataset partitioned by year, month, and day:

-- Create a partitioned external table pointing at Parquet files in S3
CREATE EXTERNAL TABLE clickstream (
  event_id      STRING,
  user_id       BIGINT,
  session_id    STRING,
  page          STRING,
  referrer      STRING,
  device_type   STRING,
  country_code  STRING,
  duration_ms   INT,
  revenue       DOUBLE
)
PARTITIONED BY (
  year  STRING,
  month STRING,
  day   STRING
)
STORED AS PARQUET
LOCATION 's3://my-data-lake/clickstream/'
TBLPROPERTIES (
  'parquet.compress'='SNAPPY',
  'classification'='parquet'
);

-- After adding new partition folders to S3, make Athena discover them:
MSCK REPAIR TABLE clickstream;

-- Or add a specific partition manually (faster for a single new day):
ALTER TABLE clickstream ADD IF NOT EXISTS
  PARTITION (year='2026', month='06', day='06')
  LOCATION 's3://my-data-lake/clickstream/year=2026/month=06/day=06/';

Partition naming: Use the key=value folder naming convention (Hive-style). Example: s3://bucket/clickstream/year=2026/month=06/day=06/. Athena automatically maps folder segments to partition columns. If your S3 paths don't follow this convention, use explicit ALTER TABLE ADD PARTITION instead of MSCK REPAIR.

MSCK REPAIR TABLE scans all S3 prefixes under the table location and registers any unregistered partitions. On very large tables with thousands of partitions it can be slow — in that case, use Glue crawlers or the ALTER TABLE ADD PARTITION approach via AWS CLI or boto3.

Analytical SQL Examples

Athena supports most ANSI SQL including window functions, CTEs, UNNEST, JSON extraction, and approximate aggregations. Below are practical examples covering common analytics patterns.

-- 1. Revenue by country for June 2026 using partition pruning
--    Athena will only scan year=2026/month=06 partitions (not the full table)
SELECT
  country_code,
  COUNT(*)                              AS sessions,
  ROUND(SUM(revenue), 2)               AS total_revenue,
  ROUND(AVG(duration_ms) / 1000.0, 1) AS avg_session_sec
FROM clickstream
WHERE year = '2026' AND month = '06'
GROUP BY country_code
ORDER BY total_revenue DESC
LIMIT 20;

-- 2. Rolling 7-day revenue using a window function
SELECT
  day,
  SUM(revenue) AS daily_revenue,
  SUM(SUM(revenue)) OVER (
    ORDER BY day
    ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
  ) AS rolling_7d_revenue
FROM clickstream
WHERE year = '2026' AND month = '06'
GROUP BY day
ORDER BY day;

-- 3. Funnel analysis: sessions that hit /cart but not /checkout
WITH cart_sessions AS (
  SELECT DISTINCT session_id
  FROM clickstream
  WHERE year = '2026' AND month = '06'
    AND page = '/cart'
),
checkout_sessions AS (
  SELECT DISTINCT session_id
  FROM clickstream
  WHERE year = '2026' AND month = '06'
    AND page = '/checkout'
)
SELECT
  COUNT(c.session_id)                  AS reached_cart,
  COUNT(o.session_id)                  AS completed_checkout,
  ROUND(
    100.0 * COUNT(o.session_id) / NULLIF(COUNT(c.session_id), 0),
    2
  )                                    AS conversion_pct
FROM cart_sessions c
LEFT JOIN checkout_sessions o USING (session_id);

-- 4. CTAS: create an optimised summary table for a dashboard
--    Results written to S3 in Parquet — future queries hit only this small table
CREATE TABLE clickstream_monthly_summary
WITH (
  format           = 'PARQUET',
  parquet_compression = 'SNAPPY',
  partitioned_by   = ARRAY['year', 'month'],
  external_location = 's3://my-data-lake/summaries/clickstream_monthly/'
) AS
SELECT
  year,
  month,
  country_code,
  device_type,
  COUNT(*)           AS sessions,
  SUM(revenue)       AS revenue,
  AVG(duration_ms)   AS avg_duration_ms
FROM clickstream
GROUP BY year, month, country_code, device_type;

Approximate functions: For very large tables, use Athena's built-in HyperLogLog functions: approx_distinct(user_id) instead of COUNT(DISTINCT user_id). This is orders of magnitude faster and uses much less memory, with a typical error rate under 1%.

Athena Best Practices: Columnar Formats and Partitioning

The three biggest levers for reducing Athena cost and query time are columnar format, compression, and partitioning. They compound: use all three together.

Practice	Impact	Recommendation
Use Parquet or ORC	60—90% less data scanned vs CSV for typical analytical queries	Parquet for most cases; ORC for Hive-native workloads
Snappy compression	~3—5x size reduction vs uncompressed; faster decompression than GZIP	Default for Parquet/ORC. Use ZSTD for better ratio with acceptable CPU overhead.
Partition by query patterns	Eliminates entire S3 prefix scans	Always include date fields. Avoid high-cardinality partitions (e.g., user_id — millions of tiny files).
Target file sizes 128 MB—1 GB	Avoids the small-files problem (many tiny files = many S3 GET requests)	Use Glue ETL to compact small files during transformation
Workgroups	Cost control and query isolation	Create a workgroup per team; set data scanned per-query limit (e.g., 1 TB)
Result reuse	Free repeated reads of the same result	Enable query result reuse in workgroup settings (TTL up to 7 days)

Small files problem: If your S3 data contains thousands of files under 5 MB each, Athena must open each file individually, causing high S3 GET costs and slow queries regardless of format. Fix this by running a periodic Glue ETL compaction job that reads the small files and rewrites them as fewer, larger Parquet files.

AWS Glue Data Catalog and Crawlers

The Glue Data Catalog is a central metadata repository — a managed Apache Hive Metastore compatible with Athena, EMR, and Spark. You define databases and tables here; Athena reads schema from it automatically.

A Glue Crawler inspects your S3 data, infers schema (column names and types), and writes table definitions to the Catalog. This beats writing DDL by hand for data you don't fully control.

Setting up a crawler via AWS CLI:

# 1. Create an IAM role for the crawler (one-time setup)
aws iam create-role \
  --role-name GlueCrawlerRole \
  --assume-role-policy-document '{
    "Version":"2012-10-17",
    "Statement":[{"Effect":"Allow","Principal":{"Service":"glue.amazonaws.com"},"Action":"sts:AssumeRole"}]
  }'

aws iam attach-role-policy \
  --role-name GlueCrawlerRole \
  --policy-arn arn:aws:iam::aws:policy/service-role/AWSGlueServiceRole

# Grant read access to your S3 bucket (inline policy)
aws iam put-role-policy \
  --role-name GlueCrawlerRole \
  --policy-name S3ReadClickstream \
  --policy-document '{
    "Version":"2012-10-17",
    "Statement":[{
      "Effect":"Allow",
      "Action":["s3:GetObject","s3:ListBucket"],
      "Resource":[
        "arn:aws:s3:::my-data-lake",
        "arn:aws:s3:::my-data-lake/*"
      ]
    }]
  }'

# 2. Create the Glue database
aws glue create-database \
  --database-input '{"Name":"ecommerce_lake","Description":"E-commerce data lake"}'

# 3. Create and run the crawler
aws glue create-crawler \
  --name clickstream-crawler \
  --role GlueCrawlerRole \
  --database-name ecommerce_lake \
  --targets '{"S3Targets":[{"Path":"s3://my-data-lake/clickstream/"}]}' \
  --configuration '{"Version":1.0,"CrawlerOutput":{"Partitions":{"AddOrUpdateBehavior":"InheritFromTable"}}}' \
  --schedule 'cron(0 6 * * ? *)'   # run daily at 6 AM UTC

aws glue start-crawler --name clickstream-crawler

# 4. Check crawler status
aws glue get-crawler --name clickstream-crawler \
  --query 'Crawler.State'

Crawler scheduling: Crawlers support cron schedules or on-demand runs. For streaming data landing in S3 every hour, trigger the crawler via EventBridge rule + Lambda rather than a fixed cron, so new partitions appear in Athena within minutes of landing.

Glue ETL Jobs: PySpark CSV to Parquet

A Glue ETL job is a serverless Apache Spark script managed by AWS. You write it in PySpark or Python shell (for smaller datasets), specify the number of DPUs (Data Processing Units), and AWS handles cluster provisioning, job scheduling, and retries. Each DPU provides 4 vCPUs and 16 GB RAM.

The following job reads raw CSV clickstream data from S3, applies schema transformation, drops bad rows, and writes Parquet partitioned by date back to S3. It uses the Glue Dynamic Frame API which integrates with the Data Catalog automatically:

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue.dynamicframe import DynamicFrame
from pyspark.sql import functions as F
from pyspark.sql.types import DoubleType, IntegerType

# ----- Initialise Glue context -----
args = getResolvedOptions(sys.argv, ['JOB_NAME', 'source_bucket', 'dest_bucket'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

SOURCE_PATH = f"s3://{args['source_bucket']}/raw/clickstream/"
DEST_PATH   = f"s3://{args['dest_bucket']}/clickstream/"

# ----- Read raw CSV from S3 via Glue Data Catalog -----
raw_dyf = glueContext.create_dynamic_frame.from_options(
    connection_type="s3",
    connection_options={"paths": [SOURCE_PATH], "recurse": True},
    format="csv",
    format_options={"withHeader": True, "separator": ",", "quoteChar": '"'},
    transformation_ctx="raw_dyf"
)

# Convert to Spark DataFrame for richer transformations
df = raw_dyf.toDF()

# ----- Schema cleanup and type casting -----
df = (df
    .withColumn("user_id",      F.col("user_id").cast("bigint"))
    .withColumn("duration_ms",  F.col("duration_ms").cast(IntegerType()))
    .withColumn("revenue",      F.col("revenue").cast(DoubleType()))
    .withColumn("event_time",   F.to_timestamp("event_time", "yyyy-MM-dd HH:mm:ss"))
    # Derive partition columns from the event timestamp
    .withColumn("year",         F.date_format("event_time", "yyyy"))
    .withColumn("month",        F.date_format("event_time", "MM"))
    .withColumn("day",          F.date_format("event_time", "dd"))
)

# ----- Data quality: drop rows missing mandatory fields -----
REQUIRED = ["event_id", "user_id", "session_id", "page", "event_time"]
df_clean = df.dropna(subset=REQUIRED)

bad_count = df.count() - df_clean.count()
print(f"Dropped {bad_count} rows with null mandatory fields")

# ----- Write Parquet partitioned by year/month/day -----
# Use coalesce to avoid the small-files problem within each partition
output_dyf = DynamicFrame.fromDF(df_clean, glueContext, "output_dyf")

glueContext.write_dynamic_frame.from_options(
    frame=output_dyf,
    connection_type="s3",
    connection_options={
        "path": DEST_PATH,
        "partitionKeys": ["year", "month", "day"]
    },
    format="glueparquet",
    format_options={"compression": "snappy"},
    transformation_ctx="write_parquet"
)

job.commit()
print("ETL job complete. Parquet data written to", DEST_PATH)

Deploy this script by uploading it to S3 and creating the job via CLI or console:

# Upload script to S3
aws s3 cp clickstream_etl.py s3://my-scripts-bucket/glue/clickstream_etl.py

# Create the Glue ETL job
aws glue create-job \
  --name clickstream-csv-to-parquet \
  --role GlueCrawlerRole \
  --command '{
    "Name": "glueetl",
    "ScriptLocation": "s3://my-scripts-bucket/glue/clickstream_etl.py",
    "PythonVersion": "3"
  }' \
  --default-arguments '{
    "--source_bucket": "my-data-lake",
    "--dest_bucket":   "my-data-lake",
    "--job-bookmark-option": "job-bookmark-enable",
    "--enable-metrics": "",
    "--enable-continuous-cloudwatch-log": "true"
  }' \
  --glue-version "4.0" \
  --number-of-workers 10 \
  --worker-type G.1X \
  --timeout 120

# Start a run
aws glue start-job-run --job-name clickstream-csv-to-parquet

Job Bookmarks: Enable --job-bookmark-option job-bookmark-enable so Glue tracks which S3 files it has already processed. On the next run it automatically skips them and only processes new files. This is the Glue equivalent of checkpointing in Spark Structured Streaming.

Glue DataBrew: Visual Data Preparation

AWS Glue DataBrew is a visual, no-code data preparation tool aimed at analysts who prefer a spreadsheet-style interface over writing PySpark. You connect DataBrew to an S3 location or Glue Catalog table, see a sample of the data, and apply transformations from a menu (rename columns, parse dates, remove duplicates, standardize casing, flag outliers, etc.). DataBrew records every transformation as a Recipe — a reusable, versionable list of steps that can be published as a Glue job and run on a schedule.

DataBrew is particularly useful for:

One-off exploratory data cleaning before building a formal ETL pipeline
Business analysts who need to prepare data without help from engineering
Profiling datasets — DataBrew generates statistics (null rates, cardinality, distribution histograms) across all columns in a single click
PII detection — DataBrew can automatically flag columns that look like email addresses, phone numbers, SSNs, and credit card numbers

Pricing note: DataBrew charges $1 per interactive session (30-minute blocks) and $0.48 per DPU-hour for job runs. It is more expensive per DPU-hour than raw Glue ETL, but since it requires no coding the total cost including developer time is often lower for one-off work.

Lake Formation: Fine-Grained Access Control

AWS Lake Formation sits on top of the Glue Data Catalog and IAM to provide column-level and row-level security for your data lake. Without Lake Formation, access control is done entirely via S3 bucket policies and IAM — meaning you can grant or deny access to entire S3 prefixes but not individual columns or filtered rows. Lake Formation lets you say things like "the analyst role can query the clickstream table but cannot see the user_id column" or "the EU team can only see rows where country_code is in ('DE','FR','ES')".

Lake Formation Permission	Scope
Database permissions	CREATE TABLE, DROP, ALTER on the whole database
Table permissions	SELECT, INSERT, DELETE on a specific table
Column permissions	SELECT on specific columns only (column-level security)
Row filters	SQL WHERE clause applied automatically to every query (row-level security)
Data cell filters	Combine column + row restrictions in one policy
Tag-based access (LF-Tags)	Attach sensitivity tags to tables/columns; grant access by tag rather than by resource ARN

Enabling Lake Formation: By default, the Glue Catalog uses IAM-only access control. To switch to Lake Formation permissions, go to the Lake Formation console, set the Data Catalog settings to use Lake Formation permissions, then grant permissions explicitly to each principal. Once enabled, IAM policies alone are no longer sufficient — both IAM and Lake Formation must allow the action.

Athena Federated Query

Athena Federated Query extends Athena beyond S3. Using Lambda-based connectors, Athena can query RDS, Aurora, DynamoDB, Redis, DocumentDB, OpenSearch, and custom JDBC sources — all from the same SQL interface, without ETL or data movement.

Each connector is a Lambda function that implements the Athena Query Federation SDK. AWS publishes pre-built connectors in the Serverless Application Repository. You deploy the connector Lambda, register it as an Athena data source, and then reference it in queries using a catalog prefix:

-- After deploying the DynamoDB connector as catalog "dynamo_catalog":
-- Join S3-backed Athena table with a live DynamoDB table in one query
SELECT
  c.user_id,
  c.country_code,
  c.total_spend,
  u.email,
  u.signup_date
FROM
  -- S3 / Glue Catalog table (the default AwsDataCatalog)
  "AwsDataCatalog"."ecommerce_lake"."clickstream_monthly_summary" c
  -- DynamoDB table exposed through the federated connector
  JOIN "dynamo_catalog"."ecommerce_lake"."users" u
    ON c.user_id = CAST(u.user_id AS BIGINT)
WHERE
  c.year  = '2026'
  AND c.month = '06'
  AND c.total_spend > 500
ORDER BY c.total_spend DESC;

When to use Federated Query: It is ideal for dashboards that need fresh transactional data alongside historical S3 data without running a full ETL cycle. However, pushing large scan predicates to DynamoDB is expensive. Always add specific key filters (WHERE partition_key = X) so the DynamoDB connector uses key-based lookups rather than a table scan.

Cost Optimization Strategies

Because Athena charges per byte scanned, every optimization that reduces bytes read directly reduces your bill. Here is a prioritized checklist:

Convert CSV/JSON to Parquet with Snappy compression first. This alone typically cuts scan costs by 70—90% because Parquet's columnar layout means a SELECT on 3 of 20 columns reads only those 3 columns' data blocks.
Partition your data by the columns most commonly in your WHERE clause — usually date fields. A query with WHERE year='2026' AND month='06' on a daily-partitioned table skips 11/12 of the data immediately.
Use CTAS for hot queries. If a dashboard runs the same aggregation 100 times a day, run it once as a CTAS to write a small summary table, then point the dashboard at that summary. Scans drop from TB to MB.
Enable query result reuse. Athena caches query results for up to 7 days. Identical queries (same SQL + same underlying data) return from cache instantly at zero cost. Enable this per workgroup.
Set data scan limits on workgroups. Prevent runaway queries by configuring a per-query or per-workgroup data limit in the Athena workgroup settings.
Right-size Parquet row group sizes. The default 128 MB row group is optimal for most cases. If your queries filter on a column that is not a partition key, consider sorting the data on that column before writing Parquet so Athena's predicate pushdown can skip row groups efficiently.
Compact small files periodically. Schedule a daily Glue job to merge files smaller than 32 MB into larger ones. The S3 GET overhead for thousands of small files can dwarf the actual data scan cost.

boto3: Running Athena Queries Programmatically

For application code that needs to trigger Athena queries and retrieve results — for example, a nightly reporting Lambda — use the boto3 athena client. The Athena API is asynchronous: you start a query, poll for completion, then fetch the result from S3.

import boto3
import time
import csv
import io

athena = boto3.client('athena', region_name='us-east-1')
s3 = boto3.client('s3')

WORKGROUP       = 'analytics'
OUTPUT_LOCATION = 's3://my-query-results/athena/'
DATABASE        = 'ecommerce_lake'

def run_athena_query(sql: str, database: str = DATABASE) -> list[dict]:
    """
    Execute an Athena query synchronously and return results as a list of dicts.
    Raises RuntimeError on failure; polls every 2 seconds until done.
    """
    # Start the query
    response = athena.start_query_execution(
        QueryString=sql,
        QueryExecutionContext={'Database': database},
        ResultConfiguration={'OutputLocation': OUTPUT_LOCATION},
        WorkGroup=WORKGROUP
    )
    execution_id = response['QueryExecutionId']
    print(f"Query started: {execution_id}")

    # Poll for completion
    while True:
        status_resp = athena.get_query_execution(QueryExecutionId=execution_id)
        state = status_resp['QueryExecution']['Status']['State']

        if state == 'SUCCEEDED':
            break
        elif state in ('FAILED', 'CANCELLED'):
            reason = status_resp['QueryExecution']['Status'].get('StateChangeReason', 'unknown')
            raise RuntimeError(f"Athena query {state}: {reason}")

        time.sleep(2)

    # Retrieve results (paginated)
    rows = []
    paginator = athena.get_paginator('get_query_results')
    headers = None

    for page in paginator.paginate(QueryExecutionId=execution_id):
        result_rows = page['ResultSet']['Rows']
        if headers is None:
            # First row of first page is the header
            headers = [col['VarCharValue'] for col in result_rows[0]['Data']]
            result_rows = result_rows[1:]

        for row in result_rows:
            values = [col.get('VarCharValue', '') for col in row['Data']]
            rows.append(dict(zip(headers, values)))

    print(f"Query returned {len(rows)} rows. Scanned: "
          f"{status_resp['QueryExecution']['Statistics']['DataScannedInBytes'] / 1e9:.3f} GB")
    return rows


# Example usage
if __name__ == '__main__':
    results = run_athena_query("""
        SELECT country_code, ROUND(SUM(revenue), 2) AS total_revenue
        FROM clickstream
        WHERE year = '2026' AND month = '06'
        GROUP BY country_code
        ORDER BY total_revenue DESC
        LIMIT 10
    """)

    for row in results:
        print(f"{row['country_code']:5s}  ${row['total_revenue']}")

Production tip: For Lambda functions, store the QueryExecutionId in DynamoDB and use a Step Functions wait state or EventBridge + Lambda trigger to poll for completion instead of a synchronous time.sleep loop. This avoids Lambda timeout issues on long-running queries and doesn't waste function invocation time waiting.

Frequently Asked Questions

When should I use Athena vs Redshift?

Use Athena when your data lives primarily in S3, your query pattern is ad hoc or infrequent, and you don't want to manage a cluster. Athena has no upfront cost and scales automatically. Use Redshift when you need consistently fast sub-second query latency on large datasets, complex multi-table JOINs with many concurrent users, or BI tools that rely on a persistent JDBC connection. Redshift Serverless bridges the gap — it auto-scales but still maintains a Redshift engine, giving better performance for structured, repetitive workloads at a higher per-query cost than Athena.

What's the difference between Glue ETL and a Lambda function for data transformation?

Glue ETL is Apache Spark under the hood — it scales horizontally across multiple workers and is designed for large datasets (GBs to TBs). Lambda has a 15-minute timeout and 10 GB memory limit, making it suitable for transforming individual files or small payloads. Use Glue for bulk historical loads and recurring large-scale transformations; use Lambda for event-driven micro-transforms triggered by S3 events on individual files.

How do I handle schema evolution in Athena?

Parquet and ORC support schema evolution natively. If you add a new column to your data but not to your Athena table DDL, the new column is simply ignored by Athena. If you add the column to the Athena table with ALTER TABLE ADD COLUMNS, old files that don't have that column return NULL for it. Dropping or renaming columns is more complex — it's generally safer to create a new table version and use a view to abstract the schema change from consumers.

Can Athena query data in multiple AWS accounts?

Yes. Configure cross-account S3 bucket policies to allow the querying account's IAM principal read access to the source bucket, and optionally share the Glue Data Catalog via AWS RAM (Resource Access Manager). Athena can then query S3 data in other accounts as long as the IAM role running the query has s3:GetObject and s3:ListBucket permissions on the cross-account bucket.

What is the maximum query timeout for Athena?

Athena has a default query timeout of 30 minutes. This is adjustable per workgroup up to a maximum of 30 minutes for DML queries. For very long-running queries, consider splitting the query logic into smaller steps using CTAS intermediate tables, or move the computation to Glue ETL (no timeout constraint other than the job timeout you configure, up to 48 hours).

AWS Athena and Glue: Serverless Data Analytics

Table of Contents

Athena Overview and Pricing

Creating Tables: DDL, Partitions, MSCK REPAIR

Analytical SQL Examples

Athena Best Practices: Columnar Formats and Partitioning

AWS Glue Data Catalog and Crawlers

Glue ETL Jobs: PySpark CSV to Parquet

Glue DataBrew: Visual Data Preparation

Lake Formation: Fine-Grained Access Control

Athena Federated Query

Cost Optimization Strategies

boto3: Running Athena Queries Programmatically

Frequently Asked Questions

When should I use Athena vs Redshift?

What's the difference between Glue ETL and a Lambda function for data transformation?

How do I handle schema evolution in Athena?

Can Athena query data in multiple AWS accounts?

What is the maximum query timeout for Athena?

Read Next

AWS Kinesis: Real-Time Data Streaming

AWS S3 Complete Tutorial

AWS Articles