AWS Lake Formation: Build a Secure Data Lake in Hours
Every organisation reaching petabyte scale eventually hits the same wall: S3 buckets full of data that nobody trusts, IAM policies too coarse to enforce compliance, and analysts waiting weeks for data access approvals. AWS Lake Formation was purpose-built to solve all three problems. It sits as a security and governance layer on top of Amazon S3, the AWS Glue Data Catalog, and IAM — giving you column-level security, row-level filters, ACID-compliant Governed Tables, and one-click blueprints for common ingestion patterns. This guide walks through every layer, from initial setup to cross-account sharing, with real CLI commands, Python boto3 code, SQL examples, and IAM JSON you can paste directly into your environment.
Table of Contents
- Data Lake vs Data Warehouse vs Data Lakehouse
- Lake Formation Architecture
- Setting Up Lake Formation: Admin, S3 Registration, and First Database
- Data Ingestion: Crawlers, Firehose, and DMS
- Fine-Grained Permissions: Column-Level and Row-Level Security
- Governed Tables: ACID Transactions and Automatic Compaction
- Querying with Athena: Partition Projection, SQL Examples, and Cost Tips
- Cross-Account Data Sharing with RAM and Lake Formation
- Redshift Spectrum, EMR, and QuickSight Integration
- Blueprints: One-Click Data Lake Setup
- Frequently Asked Questions
Data Lake vs Data Warehouse vs Data Lakehouse
Before diving into Lake Formation specifics, it is worth being precise about the storage paradigms it governs. Teams misapply these terms constantly, leading to wrong architectural choices and wasted budget.
| Dimension | Data Lake | Data Warehouse | Data Lakehouse |
|---|---|---|---|
| Storage | Raw files on object storage (S3, GCS, ADLS) | Proprietary columnar blocks inside the engine (Redshift, Snowflake, BigQuery) | Open table format (Delta Lake, Apache Iceberg, Apache Hudi) on object storage |
| Schema | Schema-on-read — defined at query time | Schema-on-write — enforced at insert time | Schema-on-write enforced by the table format, stored on open storage |
| ACID | No native transactions; S3 eventual consistency | Full ACID via the warehouse engine | Full ACID via the table format (Iceberg snapshots, Delta transaction log) |
| Query performance | Good with Parquet + partitioning; no statistics | Excellent — pre-computed statistics, zone maps, result caching | Excellent — file-level statistics, data skipping, Z-order clustering |
| Cost | Lowest — S3 at $0.023/GB/month | Higher — compute + managed storage bundle | Low storage (S3) + pay-per-query compute (Athena, EMR, Spark) |
| Ideal for | Raw archive, ML feature store, cheap long-term retention | BI dashboards, sub-second OLAP, high-concurrency reporting | Unified analytical + ML workloads, GDPR delete requirements, streaming upserts |
| AWS implementation | S3 + Glue Catalog + Athena | Amazon Redshift | S3 + Glue Catalog + Athena (Iceberg) + Lake Formation |
AWS recommends the data lakehouse pattern for new projects in 2026: store data in open table formats (Iceberg or Hudi) on S3, govern it with Lake Formation, and choose the right compute engine per workload — Athena for ad hoc SQL, EMR for large-scale Spark, Redshift Spectrum for BI dashboards, and SageMaker for ML. This avoids vendor lock-in while delivering warehouse-grade performance and ACID guarantees.
Lake Formation Architecture
Lake Formation is not a storage service. It is a metadata + permissions overlay that sits between principals (IAM roles, users, SAML-federated identities) and data resources (Glue Catalog databases, tables, columns). Understanding the layers prevents hours of debugging mysterious "Access Denied" errors.
The four layers in order from bottom to top:
- Amazon S3 — The actual bytes live here. S3 bucket policies still apply but Lake Formation adds a service-level check on top.
- AWS Glue Data Catalog — The metadata layer. Databases and tables defined here point at S3 locations. Lake Formation permissions are expressed in terms of these catalog objects, not raw S3 ARNs.
- Lake Formation permissions — A separate permission model layered on top of IAM. For a query to succeed, both the IAM policy AND the Lake Formation grant must allow the action. Think of it as a second firewall.
- Compute engines — Athena, Redshift Spectrum, EMR, Glue ETL. These engines query the Glue Catalog; Lake Formation intercepts the metadata call and enforces column/row filters before returning metadata to the engine.
lakeformation:GetDataAccess or the S3 bucket policy is too restrictive.The architecture for a production data lake looks like this:
┌─────────────────────────────────────────────────────────────┐
│ Principals: IAM Roles, SSO Users, SAML-federated Groups │
└──────────────────────────┬──────────────────────────────────┘
│ query / API call
┌──────────────────────────▼──────────────────────────────────┐
│ Compute Engines │
│ Athena · Redshift Spectrum · EMR · Glue ETL · QuickSight │
└──────────────────────────┬──────────────────────────────────┘
│ GetTable / GetPartitions
┌──────────────────────────▼──────────────────────────────────┐
│ AWS Lake Formation (permission enforcement layer) │
│ · LF-Tag based policies · Column filters │
│ · Row filters (data cell filters) · Cross-account grants │
└──────────────────────────┬──────────────────────────────────┘
│ filtered metadata + S3 credentials
┌──────────────────────────▼──────────────────────────────────┐
│ AWS Glue Data Catalog │
│ Databases · Tables · Partitions · Schemas │
└──────────────────────────┬──────────────────────────────────┘
│ s3:GetObject
┌──────────────────────────▼──────────────────────────────────┐
│ Amazon S3 │
│ s3://company-data-lake/raw/ s3://company-data-lake/curated│
└─────────────────────────────────────────────────────────────┘
Setting Up Lake Formation: Admin, S3 Registration, and First Database
Initial Lake Formation setup has three steps: designate a Lake Formation administrator, register your S3 locations, and create a Glue database under Lake Formation governance. Everything after that builds on this foundation.
Step 1 — Designate a Lake Formation Administrator
The Lake Formation administrator is an IAM principal who can grant permissions to other principals. There is no default admin — you must set one explicitly, or no one can grant table access.
# Add an IAM role as Lake Formation administrator
aws lakeformation put-data-lake-settings \
--data-lake-settings '{
"DataLakeAdmins": [
{"DataLakePrincipalIdentifier": "arn:aws:iam::123456789012:role/DataLakeAdminRole"}
],
"CreateDatabaseDefaultPermissions": [],
"CreateTableDefaultPermissions": []
}' \
--region us-east-1
IAMAllowedPrincipals (essentially "any principal allowed by IAM") SELECT on all new databases and tables. Set CreateDatabaseDefaultPermissions and CreateTableDefaultPermissions to empty arrays as shown above. Otherwise Lake Formation adds no real protection — IAM alone would still control access, which defeats the purpose of enabling fine-grained governance.Step 2 — Register S3 Locations
Lake Formation must know which S3 paths it governs. Registering a location tells Lake Formation to use its own service-linked role (or a custom role you specify) for S3 access — the data access credentials that compute engines receive come from Lake Formation, not directly from the user's IAM role.
# Register the S3 location (Lake Formation will use its service-linked role)
aws lakeformation register-resource \
--resource-arn arn:aws:s3:::company-data-lake \
--use-service-linked-role \
--region us-east-1
# Verify the registration
aws lakeformation list-resources \
--region us-east-1 \
--query 'ResourceInfoList[].{ARN:ResourceArn,RoleArn:RoleArn}'
If you prefer a custom IAM role (for cross-account scenarios or tighter control), replace --use-service-linked-role with --role-arn arn:aws:iam::123456789012:role/LakeFormationS3Role and attach this policy to that role:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:PutObject",
"s3:DeleteObject",
"s3:ListBucket",
"s3:GetBucketLocation"
],
"Resource": [
"arn:aws:s3:::company-data-lake",
"arn:aws:s3:::company-data-lake/*"
]
}
]
}
Step 3 — Create a Database and Grant Permissions
# Create the Glue database
aws glue create-database \
--database-input '{
"Name": "ecommerce_lake",
"Description": "E-commerce data lake governed by Lake Formation"
}' \
--region us-east-1
# Grant the data engineering role full access to the database
aws lakeformation grant-permissions \
--principal DataLakePrincipalIdentifier=arn:aws:iam::123456789012:role/DataEngineeringRole \
--permissions CREATE_TABLE DROP ALTER \
--resource '{"Database":{"Name":"ecommerce_lake"}}' \
--region us-east-1
# Grant the analyst role SELECT on all tables in the database (with grant option)
aws lakeformation grant-permissions \
--principal DataLakePrincipalIdentifier=arn:aws:iam::123456789012:role/AnalystRole \
--permissions SELECT \
--permissions-with-grant-option SELECT \
--resource '{"Table":{"DatabaseName":"ecommerce_lake","TableWildcard":{}}}' \
--region us-east-1
Data Ingestion: Crawlers, Firehose, and DMS
A data lake is only as good as the data flowing into it. Lake Formation does not handle ingestion directly — that is the job of Glue crawlers, Kinesis Firehose, and AWS DMS. What Lake Formation provides is automatic governance of the tables those services create.
Glue Crawlers — Automatic Schema Discovery
Glue crawlers inspect S3 data, infer schema, and register tables in the Glue Data Catalog. When you enable Lake Formation on a database, any table a crawler creates in that database automatically falls under Lake Formation governance.
# Create a crawler that writes to the Lake Formation-governed database
aws glue create-crawler \
--name orders-crawler \
--role arn:aws:iam::123456789012:role/GlueCrawlerRole \
--database-name ecommerce_lake \
--targets '{"S3Targets":[{"Path":"s3://company-data-lake/raw/orders/"}]}' \
--schema-change-policy '{"UpdateBehavior":"UPDATE_IN_DATABASE","DeleteBehavior":"LOG"}' \
--configuration '{"Version":1.0,"CrawlerOutput":{"Partitions":{"AddOrUpdateBehavior":"InheritFromTable"}}}' \
--region us-east-1
aws glue start-crawler --name orders-crawler --region us-east-1
lakeformation:GetDataAccess permission when Lake Formation is enabled. Without it, the crawler can list S3 objects but cannot register the table metadata in a Lake Formation-governed database. Add lakeformation:GetDataAccess and glue:* permissions to the crawler role.Kinesis Firehose — Streaming Data to S3
For real-time data (clickstream events, IoT telemetry, application logs), Kinesis Data Firehose writes data directly to S3 in the format and partitioning you specify. Combine it with a Glue crawler triggered by EventBridge to keep the Lake Formation catalog current within minutes of new data landing.
# Create a Firehose delivery stream writing Parquet to the data lake
aws firehose create-delivery-stream \
--delivery-stream-name clickstream-to-lake \
--delivery-stream-type DirectPut \
--extended-s3-destination-configuration '{
"RoleARN": "arn:aws:iam::123456789012:role/FirehoseDeliveryRole",
"BucketARN": "arn:aws:s3:::company-data-lake",
"Prefix": "raw/clickstream/year=!{timestamp:yyyy}/month=!{timestamp:MM}/day=!{timestamp:dd}/",
"ErrorOutputPrefix": "errors/clickstream/",
"BufferingHints": {"SizeInMBs": 128, "IntervalInSeconds": 300},
"CompressionFormat": "UNCOMPRESSED",
"DataFormatConversionConfiguration": {
"Enabled": true,
"InputFormatConfiguration": {
"Deserializer": {"OpenXJsonSerDe": {}}
},
"OutputFormatConfiguration": {
"Serializer": {"ParquetSerDe": {"Compression": "SNAPPY"}}
},
"SchemaConfiguration": {
"RoleARN": "arn:aws:iam::123456789012:role/FirehoseDeliveryRole",
"DatabaseName": "ecommerce_lake",
"TableName": "clickstream",
"Region": "us-east-1",
"VersionId": "LATEST"
}
}
}' \
--region us-east-1
AWS DMS — Migrating RDS Databases to the Lake
AWS Database Migration Service (DMS) lets you continuously replicate an RDS or Aurora database into S3 in Parquet format, feeding the data lake with transactional records in near real-time via Change Data Capture (CDC).
import boto3
dms = boto3.client('dms', region_name='us-east-1')
# Create an S3 target endpoint for the data lake
response = dms.create_endpoint(
EndpointIdentifier='orders-db-lake-target',
EndpointType='target',
EngineName='s3',
S3Settings={
'BucketName': 'company-data-lake',
'BucketFolder': 'raw/orders_cdc',
'ServiceAccessRoleArn': 'arn:aws:iam::123456789012:role/DMSAccessRole',
'DataFormat': 'parquet',
'ParquetVersion': 'parquet-2-0',
'CompressionType': 'SNAPPY',
'CdcPath': 'cdc/orders',
'DatePartitionEnabled': True,
'DatePartitionSequence': 'YYYYMMDD',
'AddColumnName': True,
'PreserveTransactions': True, # retain CDC op type (I/U/D)
}
)
print("Target endpoint ARN:", response['Endpoint']['EndpointArn'])
Fine-Grained Permissions: Column-Level and Row-Level Security
This is Lake Formation's killer feature. Without it, IAM can only grant or deny access to an entire S3 prefix — you cannot hide the salary column from junior analysts or restrict European data to EU-based roles. Lake Formation makes both trivially easy.
Column-Level Security
Grant SELECT on specific columns only. When an analyst queries the table, Lake Formation rewrites the metadata response so the excluded columns are invisible — they do not appear in SHOW COLUMNS, DESCRIBE, or query results.
# Grant the analyst role SELECT on all columns EXCEPT email and salary
# First, grant SELECT on all columns
aws lakeformation grant-permissions \
--principal DataLakePrincipalIdentifier=arn:aws:iam::123456789012:role/AnalystRole \
--permissions SELECT \
--resource '{
"TableWithColumns": {
"DatabaseName": "ecommerce_lake",
"Name": "employees",
"ColumnNames": ["employee_id","department","title","hire_date","performance_score"]
}
}' \
--region us-east-1
# The analyst now sees ONLY the 5 listed columns — email and salary are invisible
You can also use column wildcards with exclusions, which is more maintainable when tables have many columns:
# Grant access to ALL columns EXCEPT the two PII columns
aws lakeformation grant-permissions \
--principal DataLakePrincipalIdentifier=arn:aws:iam::123456789012:role/AnalystRole \
--permissions SELECT \
--resource '{
"TableWithColumns": {
"DatabaseName": "ecommerce_lake",
"Name": "employees",
"ColumnWildcard": {
"ExcludedColumnNames": ["email", "salary"]
}
}
}' \
--region us-east-1
Row-Level Security — Data Cell Filters
A data cell filter combines a row predicate (SQL WHERE clause) with an optional column list. When a principal queries the table through Athena or Redshift Spectrum, Lake Formation transparently applies the filter so they only ever see their allowed rows — even if they write a query with no WHERE clause.
# Create a data cell filter: EU team sees only EU/UK rows, and not the salary column
aws lakeformation create-data-cells-filter \
--table-data '{
"TableCatalogId": "123456789012",
"DatabaseName": "ecommerce_lake",
"TableName": "employees",
"Name": "eu_region_filter",
"RowFilter": {
"FilterExpression": "region IN ('"'"'EU'"'"', '"'"'UK'"'"')"
},
"ColumnWildcard": {
"ExcludedColumnNames": ["salary"]
}
}' \
--region us-east-1
# Grant the EU team role access via this data cell filter
aws lakeformation grant-permissions \
--principal DataLakePrincipalIdentifier=arn:aws:iam::123456789012:role/EUTeamRole \
--permissions SELECT \
--resource '{
"DataCellsFilter": {
"TableCatalogId": "123456789012",
"DatabaseName": "ecommerce_lake",
"TableName": "employees",
"Name": "eu_region_filter"
}
}' \
--region us-east-1
LF-Tag Based Access Control
Managing permissions by resource ARN becomes unmanageable at scale — if you have 500 tables and 30 IAM roles, that is potentially 15,000 individual grants. LF-Tags flip the model: attach sensitivity tags to tables and columns, then grant access by tag. When a new table is tagged sensitivity=confidential, all roles granted access to that tag level automatically can (or cannot) access the new table.
# 1. Create LF-Tag keys and values
aws lakeformation create-lf-tag \
--tag-key "sensitivity" \
--tag-values '["public","internal","confidential","restricted"]' \
--region us-east-1
aws lakeformation create-lf-tag \
--tag-key "domain" \
--tag-values '["ecommerce","finance","hr","marketing"]' \
--region us-east-1
# 2. Attach tags to the table
aws lakeformation add-lf-tags-to-resource \
--resource '{"Table":{"DatabaseName":"ecommerce_lake","Name":"employees"}}' \
--lf-tags '[
{"TagKey":"sensitivity","TagValues":["confidential"]},
{"TagKey":"domain","TagValues":["hr"]}
]' \
--region us-east-1
# 3. Grant the HR analytics role access to all hr-domain tables tagged internal or confidential
aws lakeformation grant-permissions \
--principal DataLakePrincipalIdentifier=arn:aws:iam::123456789012:role/HRAnalyticsRole \
--permissions SELECT DESCRIBE \
--resource '{
"LFTagPolicy": {
"ResourceType": "TABLE",
"Expression": [
{"TagKey":"domain","TagValues":["hr"]},
{"TagKey":"sensitivity","TagValues":["internal","confidential"]}
]
}
}' \
--region us-east-1
domain=hr and sensitivity=confidential, every role with LF-Tag grants on those values automatically gets the configured access. No manual per-table grants needed. This is the recommended governance model for organizations with 50+ tables.Governed Tables: ACID Transactions and Automatic Compaction
Standard S3/Parquet data lakes have a painful limitation: no atomic updates. If you need to delete a user's records for GDPR compliance, or upsert CDC events from a relational database, you either rebuild entire partitions or accept partial reads during updates. Lake Formation Governed Tables solve this with ACID guarantees on top of S3.
Governed Tables use a transaction log stored in the Lake Formation service. Writes are staged; a commit makes them atomically visible. Concurrent readers always see a consistent snapshot. Automatic compaction merges small files in the background without downtime.
Creating a Governed Table
import boto3
lf = boto3.client('lakeformation', region_name='us-east-1')
glue = boto3.client('glue', region_name='us-east-1')
# Create the Glue table with StorageDescriptor pointing to S3
# Then enable Governed Table status through Lake Formation
table_input = {
"Name": "orders_governed",
"StorageDescriptor": {
"Columns": [
{"Name": "order_id", "Type": "string"},
{"Name": "customer_id","Type": "bigint"},
{"Name": "order_date", "Type": "date"},
{"Name": "total_usd", "Type": "double"},
{"Name": "status", "Type": "string"},
{"Name": "region", "Type": "string"}
],
"Location": "s3://company-data-lake/governed/orders/",
"InputFormat": "org.apache.hadoop.mapred.TextInputFormat",
"OutputFormat": "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat",
"SerdeInfo": {
"SerializationLibrary": "org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe"
}
},
"PartitionKeys": [
{"Name": "order_year", "Type": "string"},
{"Name": "order_month", "Type": "string"}
],
"TableType": "GOVERNED",
"Parameters": {
"classification": "parquet",
"GOVERNED": "true"
}
}
glue.create_table(DatabaseName='ecommerce_lake', TableInput=table_input)
print("Governed Table created.")
Transactional Reads and Writes with boto3
import boto3
import time
lf = boto3.client('lakeformation', region_name='us-east-1')
s3 = boto3.client('s3', region_name='us-east-1')
DATABASE = 'ecommerce_lake'
TABLE = 'orders_governed'
def write_governed_objects(s3_paths: list[str], transaction_id: str):
"""Register newly written S3 objects with an open Lake Formation transaction."""
objects = [{"Uri": path, "ETag": get_etag(path)} for path in s3_paths]
lf.update_table_objects(
DatabaseName=DATABASE,
TableName=TABLE,
TransactionId=transaction_id,
WriteOperations=[
{"AddObject": obj} for obj in objects
]
)
def get_etag(s3_uri: str) -> str:
bucket, key = s3_uri.replace("s3://", "").split("/", 1)
return s3.head_object(Bucket=bucket, Key=key)['ETag'].strip('"')
# ---- Transactional write: start → write → commit ----
txn = lf.start_transaction(TransactionType='WRITE')
txn_id = txn['TransactionId']
print(f"Transaction started: {txn_id}")
try:
# Write Parquet files to S3 here (e.g., using PyArrow or Glue job output)
# For demonstration, assume files already exist:
new_files = [
"s3://company-data-lake/governed/orders/order_year=2026/order_month=06/part-000.parquet",
"s3://company-data-lake/governed/orders/order_year=2026/order_month=06/part-001.parquet"
]
write_governed_objects(new_files, txn_id)
lf.commit_transaction(TransactionId=txn_id)
print("Transaction committed — data is now visible to readers.")
except Exception as e:
lf.cancel_transaction(TransactionId=txn_id)
print(f"Transaction cancelled: {e}")
raise
# ---- GDPR delete: remove specific objects in a transaction ----
delete_txn = lf.start_transaction(TransactionType='WRITE')
delete_txn_id = delete_txn['TransactionId']
# Get the current object list for the partition
objects = lf.get_table_objects(
DatabaseName=DATABASE,
TableName=TABLE,
PartitionPredicate="order_year='2026' AND order_month='05'"
)
files_to_remove = [
obj['Object']
for page in [objects]
for obj in page.get('Objects', [])
if obj['Object']['Uri'].endswith('customer_99999.parquet')
]
if files_to_remove:
lf.update_table_objects(
DatabaseName=DATABASE,
TableName=TABLE,
TransactionId=delete_txn_id,
WriteOperations=[
{"DeleteObject": f} for f in files_to_remove
]
)
lf.commit_transaction(TransactionId=delete_txn_id)
print("GDPR delete committed.")
else:
lf.cancel_transaction(TransactionId=delete_txn_id)
aws lakeformation start-query-planning with a compaction request. Compaction never blocks readers — it creates new files and atomically swaps the object list in the transaction log.Querying with Athena: Partition Projection, SQL Examples, and Cost Tips
Athena is the primary SQL interface for data in a Lake Formation-governed lake. When Lake Formation permissions are active, Athena checks both the IAM policy and the Lake Formation grant before returning results. Column and row filters are applied server-side — the analyst's compute node never receives filtered data.
Partition Projection — Eliminate Partition Metadata Calls
Standard partitioned tables require Athena to call the Glue Data Catalog to list partitions before executing a query. On tables with thousands of partitions this adds noticeable latency and Glue API calls. Partition projection encodes the partition scheme directly in the table properties so Athena computes partition paths mathematically without any metadata lookup.
-- Create a table with partition projection enabled
-- No MSCK REPAIR TABLE needed — Athena computes S3 paths from the date range
CREATE EXTERNAL TABLE orders_projected (
order_id STRING,
customer_id BIGINT,
total_usd DOUBLE,
status STRING
)
PARTITIONED BY (
order_date DATE
)
STORED AS PARQUET
LOCATION 's3://company-data-lake/curated/orders/'
TBLPROPERTIES (
-- Enable partition projection
'projection.enabled' = 'true',
'projection.order_date.type' = 'date',
'projection.order_date.range' = '2023-01-01,NOW',
'projection.order_date.format' = 'yyyy-MM-dd',
'projection.order_date.interval' = '1',
'projection.order_date.interval.unit' = 'DAYS',
-- Map partition key to S3 path
'storage.location.template' = 's3://company-data-lake/curated/orders/order_date=${order_date}/',
'parquet.compress' = 'SNAPPY',
'classification' = 'parquet'
);
Real SQL Examples on a Lake Formation Table
-- 1. Revenue by region for Q2 2026 (partition projection skips non-matching days)
SELECT
region,
COUNT(DISTINCT customer_id) AS unique_customers,
ROUND(SUM(total_usd), 2) AS revenue,
ROUND(AVG(total_usd), 2) AS avg_order_value
FROM orders_projected
WHERE order_date BETWEEN DATE '2026-04-01' AND DATE '2026-06-30'
GROUP BY region
ORDER BY revenue DESC;
-- 2. Month-over-month revenue growth using LAG window function
WITH monthly AS (
SELECT
DATE_TRUNC('month', order_date) AS month,
SUM(total_usd) AS revenue
FROM orders_projected
WHERE order_date >= DATE '2025-01-01'
GROUP BY 1
)
SELECT
month,
ROUND(revenue, 2) AS revenue,
ROUND(revenue - LAG(revenue) OVER (ORDER BY month), 2) AS mom_delta,
ROUND(
100.0 * (revenue - LAG(revenue) OVER (ORDER BY month))
/ NULLIF(LAG(revenue) OVER (ORDER BY month), 0), 1
) AS mom_pct
FROM monthly
ORDER BY month;
-- 3. CTAS: create a customer summary table in Parquet for the BI dashboard
CREATE TABLE customer_ltv_summary
WITH (
format = 'PARQUET',
parquet_compression = 'SNAPPY',
external_location = 's3://company-data-lake/curated/customer_ltv/'
) AS
SELECT
customer_id,
COUNT(*) AS total_orders,
ROUND(SUM(total_usd), 2) AS lifetime_value,
MIN(order_date) AS first_order_date,
MAX(order_date) AS last_order_date,
ROUND(AVG(total_usd), 2) AS avg_order_value,
MAX(status) AS latest_status
FROM orders_projected
WHERE order_date >= DATE '2023-01-01'
GROUP BY customer_id
HAVING COUNT(*) >= 2;
-- 4. Query a Lake Formation row-filter table (analyst with eu_region_filter sees only EU rows)
-- Even this unfiltered query transparently returns only EU rows for the EU analyst role:
SELECT status, COUNT(*) AS cnt, ROUND(SUM(total_usd), 2) AS revenue
FROM ecommerce_lake.orders_projected
GROUP BY status
ORDER BY revenue DESC;
aws athena create-work-group --name analysts --configuration '{"ResultConfiguration":{"OutputLocation":"s3://my-results/"},"BytesScannedCutoffPerQuery":107374182400}' — this caps each analyst query at 100 GB scanned, preventing accidental full-table scans that burn budget.Cross-Account Data Sharing with RAM and Lake Formation
Sharing data lake tables across AWS accounts is one of the most common enterprise requirements — a central data platform account needs to share curated tables with a dozen business unit accounts without copying data or granting S3 bucket access to each team separately. Lake Formation handles this elegantly via AWS Resource Access Manager (RAM).
Share a Glue Database Across Accounts
# In the PRODUCER account (123456789012) — share the database via RAM
aws ram create-resource-share \
--name "ecommerce-lake-share" \
--resource-arns "arn:aws:glue:us-east-1:123456789012:catalog" \
"arn:aws:glue:us-east-1:123456789012:database/ecommerce_lake" \
"arn:aws:glue:us-east-1:123456789012:table/ecommerce_lake/orders_projected" \
--principals "arn:aws:organizations::123456789012:organization/o-xxxx" \
--region us-east-1
# Grant Lake Formation permissions to the consumer account
aws lakeformation grant-permissions \
--catalog-id 123456789012 \
--principal DataLakePrincipalIdentifier=222222222222 \
--permissions SELECT DESCRIBE \
--resource '{
"Table": {
"CatalogId": "123456789012",
"DatabaseName": "ecommerce_lake",
"Name": "orders_projected"
}
}' \
--region us-east-1
# In the CONSUMER account (222222222222) — accept the RAM share and create a linked database
aws ram accept-resource-share-invitation \
--resource-share-invitation-arn arn:aws:ram:us-east-1:123456789012:resource-share-invitation/xxx \
--region us-east-1
# Create a local database that links to the producer's shared database
aws glue create-database \
--database-input '{
"Name": "shared_ecommerce",
"TargetDatabase": {
"CatalogId": "123456789012",
"DatabaseName": "ecommerce_lake"
}
}' \
--region us-east-1
# The consumer account can now query the table with Athena:
# SELECT * FROM shared_ecommerce.orders_projected WHERE order_date = DATE '2026-06-01';
# No data is copied — queries read directly from the producer account's S3 bucket
# Lake Formation enforces the producer's column and row filters on every query
salary from the consumer account's role, that filter is enforced even though the consumer is querying from a different AWS account. The producer retains full control over what each consumer account can see, at column and row granularity.Redshift Spectrum, EMR, and QuickSight Integration
Lake Formation is not just an Athena feature — it integrates with the full spectrum of AWS analytics services. The same permission grants apply regardless of which compute engine accesses the data.
Redshift Spectrum
Redshift Spectrum reads Glue Catalog tables from Redshift. When Lake Formation governs the catalog, Spectrum respects column and row filters just like Athena.
-- In Redshift: create an external schema pointing at the Lake Formation-governed catalog
CREATE EXTERNAL SCHEMA lake_ecommerce
FROM DATA CATALOG
DATABASE 'ecommerce_lake'
IAM_ROLE 'arn:aws:iam::123456789012:role/RedshiftSpectrumRole'
CREATE EXTERNAL DATABASE IF NOT EXISTS;
-- Now query Lake Formation-governed tables from Redshift
-- Column and row filters are applied transparently
SELECT
o.customer_id,
o.total_usd,
c.email -- This column will be absent if the Redshift role has a column filter excluding it
FROM lake_ecommerce.orders_projected o
JOIN lake_ecommerce.customers c ON o.customer_id = c.customer_id
WHERE o.order_date >= '2026-01-01'
AND o.status = 'COMPLETED';
EMR with Lake Formation
EMR clusters can use the Glue Data Catalog as the Hive Metastore. Enable Lake Formation integration on the EMR cluster to enforce fine-grained permissions on Spark SQL and Hive jobs. Add the following to your EMR cluster configuration:
{
"Classification": "spark-hive-site",
"Properties": {
"hive.metastore.client.factory.class":
"com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory"
}
},
{
"Classification": "core-site",
"Properties": {
"fs.s3.customAWSCredentialsProvider":
"com.amazonaws.auth.InstanceProfileCredentialsProvider"
}
}
# PySpark job on EMR — Lake Formation filters apply to Spark SQL queries
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName("LakeFormationEMR") \
.enableHiveSupport() \
.getOrCreate()
# This Spark SQL query will respect Lake Formation column and row filters
# The EMR cluster's IAM role determines which columns/rows are visible
df = spark.sql("""
SELECT customer_id, order_date, total_usd, status
FROM ecommerce_lake.orders_projected
WHERE order_date >= '2026-01-01'
""")
# Write aggregated results back to the data lake
df.groupBy("status") \
.agg({"total_usd": "sum", "customer_id": "count"}) \
.write \
.mode("overwrite") \
.parquet("s3://company-data-lake/curated/order_status_summary/")
QuickSight Integration
Amazon QuickSight connects to Athena as a data source, which means it inherits Lake Formation governance automatically. Each QuickSight user maps to an IAM role; the role's Lake Formation grants determine what data they see in dashboards. Set up the integration by creating a QuickSight data source with the Athena workgroup, and ensure each QuickSight user's IAM role has the appropriate Lake Formation SELECT grants on the tables the dashboard uses.
Blueprints: One-Click Data Lake Setup
Lake Formation Blueprints automate the creation of Glue workflows for common data ingestion patterns. Instead of manually creating crawlers, Glue jobs, and triggers, a Blueprint generates the entire pipeline from a configuration wizard. There are four built-in blueprints:
| Blueprint | Source | What It Creates |
|---|---|---|
| Database snapshot | JDBC database (RDS, Redshift, on-prem) | Glue connection, full-load crawlers, Glue ETL job to S3 Parquet, Glue catalog tables |
| Incremental database | JDBC with a bookmark column (updated_at) | Same as snapshot + incremental load job using the bookmark column to fetch only new rows |
| Existing S3 | S3 path with existing files | Glue crawler + catalog table + Lake Formation permissions — registers data already in S3 |
| Log data | CloudTrail, VPC Flow Logs, ELB access logs in S3 | Pre-built schema + Glue catalog table + Athena-optimized Parquet conversion job |
# Create a Blueprint workflow from a database snapshot source
aws lakeformation start-blueprint-run \
--blueprint-name "DatabaseSnapshot" \
--parameters '{
"databaseName": "ecommerce_lake",
"sourceConnectionName": "rds-orders-connection",
"sourcePath": "orders_db/public/%",
"destinationStorageLocation": "s3://company-data-lake/raw/rds-snapshot/",
"targetPath": "s3://company-data-lake/curated/rds-parquet/",
"workflowName": "orders-db-snapshot-workflow",
"IAMRoleForGlue": "arn:aws:iam::123456789012:role/GlueBlueprintRole"
}' \
--role-arn arn:aws:iam::123456789012:role/GlueBlueprintRole \
--region us-east-1
# Monitor blueprint run status
aws lakeformation get-blueprint-run \
--blueprint-name "DatabaseSnapshot" \
--run-id \
--region us-east-1 \
--query 'BlueprintRun.{Status:Status,WorkflowName:WorkflowName,ErrorMessage:ErrorMessage}'
Complete End-to-End boto3: Lake Formation Permission Audit
In production, you will periodically need to audit what each role can access. The following script lists all Lake Formation grants for a given principal:
import boto3
import json
from typing import Generator
lf = boto3.client('lakeformation', region_name='us-east-1')
def list_permissions_for_principal(principal_arn: str) -> Generator[dict, None, None]:
"""Yield all Lake Formation permissions for the given IAM principal."""
paginator = lf.get_paginator('list_permissions')
pages = paginator.paginate(
Principal={'DataLakePrincipalIdentifier': principal_arn}
)
for page in pages:
for entry in page.get('PrincipalResourcePermissions', []):
yield entry
def format_resource(resource: dict) -> str:
if 'Database' in resource:
return f"Database: {resource['Database']['Name']}"
if 'Table' in resource:
t = resource['Table']
return f"Table: {t['DatabaseName']}.{t.get('Name', '*')}"
if 'TableWithColumns' in resource:
t = resource['TableWithColumns']
cols = t.get('ColumnNames', []) or t.get('ColumnWildcard', {}).get('ExcludedColumnNames', [])
label = "included" if 'ColumnNames' in t else "excluded"
return f"TableWithColumns: {t['DatabaseName']}.{t['Name']} ({label}: {cols})"
if 'DataCellsFilter' in resource:
f = resource['DataCellsFilter']
return f"DataCellFilter: {f['DatabaseName']}.{f['TableName']}/{f['Name']}"
return json.dumps(resource, default=str)
# Run the audit
ROLE = 'arn:aws:iam::123456789012:role/AnalystRole'
print(f"\nLake Formation permissions for {ROLE}\n{'='*60}")
for entry in list_permissions_for_principal(ROLE):
resource_str = format_resource(entry['Resource'])
permissions = ', '.join(entry['Permissions'])
grantable = ', '.join(entry.get('PermissionsWithGrantOption', []))
print(f" {resource_str}")
print(f" PERMISSIONS: {permissions}")
if grantable:
print(f" GRANTABLE: {grantable}")
print()
Frequently Asked Questions
What is the difference between Lake Formation permissions and S3 bucket policies?
S3 bucket policies control access to S3 objects at the prefix level — you can allow or deny reads on entire folders. Lake Formation permissions operate on the logical Glue Catalog layer — tables, databases, columns, and rows. When Lake Formation is active, the compute engine (Athena, Redshift Spectrum) receives temporary S3 credentials from the Lake Formation service, not from the user's IAM role directly. The S3 bucket policy should grant access to the Lake Formation service role, not to individual IAM principals. This way, S3 becomes effectively opaque to end users — they can only access data through the governed compute engines.
Can I use Lake Formation with Apache Iceberg tables?
Yes. As of 2025, Lake Formation supports Iceberg tables registered in the Glue Data Catalog. You can enforce column and row filters on Iceberg tables just like standard Parquet tables. Iceberg's own snapshot isolation and schema evolution features work independently of Lake Formation governance. The combination — Iceberg for ACID + Lake Formation for fine-grained access — is the recommended production pattern for 2026. Create the Iceberg table via Athena's CREATE TABLE ... TBLPROPERTIES ('table_type'='ICEBERG') syntax, then apply Lake Formation grants as normal.
How does Lake Formation handle Athena query result caching?
Athena's query result reuse is safe with Lake Formation because the cached result is tied to the query execution context, which includes the caller's identity. A different principal with different Lake Formation grants will not receive a cached result from a more privileged user's query — Athena checks the permissions before serving from cache. Enable result reuse per workgroup for up to 7 days; it is a significant cost saver for dashboards that run identical aggregation queries repeatedly.
What happens if I accidentally delete a Governed Table transaction?
Lake Formation retains the transaction log for Governed Tables for 7 days by default. During this window, you can query the table at a specific transaction ID (a form of time travel) to recover data. After 7 days the log is pruned, but the underlying S3 objects are only deleted when automatic compaction runs — you can still recover objects from S3 versioning if enabled. Always enable S3 versioning on data lake buckets as a last-resort safety net.
Is Lake Formation free?
Lake Formation itself has no additional charge for managing permissions on the Glue Data Catalog. You pay for the underlying resources: S3 storage, Glue Data Catalog API calls (first 1 million free per month, then $1 per million), Athena query scan costs, and Glue ETL DPU-hours. Governed Tables have a charge for storage management operations — currently $0.10 per 100,000 object-level operations (read, write, delete in the transaction log). For most organisations, the Lake Formation service charge is negligible compared to compute and storage.
How do I migrate from IAM-only access control to Lake Formation?
Migration is a one-time change per Glue database. The key steps: (1) Set CreateDatabaseDefaultPermissions to empty to remove the IAMAllowedPrincipals pseudo-group from new tables. (2) For existing tables, use aws lakeformation batch-grant-permissions to explicitly grant SELECT/DESCRIBE to the IAM roles that currently have access. (3) Test with each role in a staging environment — a missing Lake Formation grant is the most common migration failure. (4) Only after validating, remove the IAMAllowedPrincipals grant from existing tables using the console's "Upgrade IAM policies" workflow. Budget two to four hours per Glue database for a careful migration.