AWS Macie: Automated Sensitive Data Discovery and Protection

AWS Macie — Sensitive Data Discovery and Protection

AWS Macie is a fully managed data security service that uses machine learning to automatically discover, classify, and protect sensitive data stored in Amazon S3. If your organization stores customer PII, financial records, health data, or credentials anywhere in S3 — and in most organizations it ends up there eventually — Macie is the service that tells you exactly where it is, how exposed it is, and helps you remediate automatically. This guide covers every dimension of Macie: how the ML-based PII detection engine works, enabling Macie at scale with CLI and Terraform across AWS Organizations, building sensitive data discovery jobs, creating custom data identifiers, understanding finding anatomy, wiring automated remediation with EventBridge and Lambda, querying discovery results with Athena, integrating with Security Hub, and managing costs intelligently.

1. What Macie Does — ML Detection, Managed Identifiers, Finding Categories
2. Enabling Macie — CLI, Terraform, Multi-Account via Organizations
3. Sensitive Data Discovery Jobs — One-Time vs Scheduled, Sampling, Terraform
4. Managed Data Identifiers — PII Types, Confidence Levels, Suppression
5. Custom Data Identifiers — Regex, Keywords, boto3 Testing
6. Finding Anatomy — Severity, JSON Structure, Affected Object Details
7. Automated Remediation — EventBridge + Lambda Auto-Tagging and Blocking
8. Sensitive Data Discovery Results — S3 Export and Athena Querying
9. Integration with Security Hub — Aggregation, Custom Insights, Suppression
10. Cost Optimization — Sampling, Bucket Scoping, 30-Day Free Trial

1. What Macie Does — ML Detection, Managed Identifiers, Finding Categories

Amazon Macie operates on a fundamentally different model from traditional data classification tools. Instead of relying purely on regex pattern matching, Macie combines machine learning models trained on large datasets of sensitive documents with a library of 130+ managed data identifiers — each representing a well-understood sensitive data type — and allows you to extend coverage with your own custom identifiers. The result is a service that can distinguish a real US Social Security Number from a random nine-digit string, recognize a credit card number embedded in a PDF metadata field, or detect an AWS secret access key buried in a JSON configuration file uploaded to a misconfigured S3 bucket.

How ML-Based Detection Works

Macie's ML models are trained to understand context, not just patterns. When Macie evaluates an S3 object, it reads the file content (supporting dozens of formats: CSV, JSON, Parquet, PDF, Word, Excel, plain text, and more), applies its ML models to understand the semantic structure of the document, then runs the managed and custom data identifiers against that structured content. The ML layer is responsible for reducing false positives — for example, correctly recognizing that 123-45-6789 in a column labeled "Employee ID" is likely an SSN, whereas the same value in a column labeled "Invoice Number" is not.

Finding Categories

Macie generates two types of findings:

Policy findings — Report changes to an S3 bucket's settings that reduce its security posture. Examples: a bucket that was previously private becomes publicly accessible (Policy:IAMUser/S3BucketPubliclyAccessible), server-side encryption is disabled (Policy:IAMUser/S3BucketEncryptionDisabled), or block public access settings are changed (Policy:IAMUser/S3BlockPublicAccessDisabled). Policy findings do not require you to run a discovery job — Macie monitors bucket configurations continuously.
Sensitive data findings — Generated when a discovery job finds sensitive data in an S3 object. Examples: SensitiveData:S3Object/Personal, SensitiveData:S3Object/Financial, SensitiveData:S3Object/Credentials, SensitiveData:S3Object/Multiple. Each finding maps to specific managed or custom data identifiers that fired, the object path, and confidence-weighted counts of detected instances.

Macie vs manual auditing: A single AWS account can easily have thousands of S3 buckets containing millions of objects accumulated over years of operations. Manual PII auditing at this scale is impossible. Macie can scan petabytes of data automatically and continuously, producing an auditable record of where sensitive data lives — essential for GDPR, HIPAA, PCI-DSS, and SOC 2 compliance programs.

Supported File Formats

Macie can inspect content inside: CSV, TSV, JSON, JSON Lines, Parquet, Avro, plain text (.txt, .log), HTML, XML, Microsoft Office formats (Word .docx, Excel .xlsx, PowerPoint .pptx), PDF, and compressed archives (ZIP, GZ, TAR — Macie decompresses and inspects the contents). For binary formats it cannot parse, Macie inspects the file metadata and embedded text. Objects larger than 20 MB are partially sampled.

2. Enabling Macie — CLI, Terraform, Multi-Account via Organizations

Macie is a regional service. You enable it per region, and it monitors S3 buckets in that region. For multi-account organizations, you designate one account as the Macie administrator — it manages Macie configuration for all member accounts and receives aggregated findings.

Enable Macie via AWS CLI

# Enable Macie in current region
aws macie2 enable-macie

# Check Macie status
aws macie2 get-macie-session \
  --query '{Status:status,ServiceRole:serviceRole,CreatedAt:createdAt}'

# List all S3 buckets Macie is monitoring
aws macie2 describe-buckets \
  --query 'buckets[*].{Name:bucketName,Public:publicAccess.effectivePermission,Encrypted:serverSideEncryption.kmsMasterKeyId,Objects:objectCount}'

Enable Macie with Terraform

resource "aws_macie2_account" "macie" {
  status                       = "ENABLED"
  finding_publishing_frequency = "FIFTEEN_MINUTES"
}

# Export findings to S3 for long-term retention
resource "aws_macie2_findings_filter" "high_severity" {
  name        = "high-severity-findings"
  description = "Filter for HIGH and CRITICAL sensitive data findings"
  action      = "ARCHIVE"

  finding_criteria {
    criterion {
      field  = "severity.description"
      eq     = ["Low"]
    }
  }
}

# KMS key for encrypting Macie findings in S3
resource "aws_kms_key" "macie_findings" {
  description             = "KMS key for Macie findings export"
  deletion_window_in_days = 30
  enable_key_rotation     = true
}

resource "aws_macie2_export_configuration" "findings_export" {
  s3_destination {
    bucket_name = aws_s3_bucket.macie_findings.id
    key_prefix  = "macie-findings/"
    kms_key_arn = aws_kms_key.macie_findings.arn
  }
}

resource "aws_s3_bucket" "macie_findings" {
  bucket = "my-org-macie-findings-${data.aws_caller_identity.current.account_id}"
}

Multi-Account Setup via AWS Organizations

In an AWS Organization, designate a dedicated security account as the Macie delegated administrator. This account can create discovery jobs that span all member accounts and receives all findings in a single place.

# From the organization management account — designate delegated admin
aws organizations enable-aws-service-access \
  --service-principal macie.amazonaws.com

aws macie2 enable-organization-admin-account \
  --admin-account-id 111122223333

# From the delegated admin account — enable Macie for all org accounts
aws macie2 update-organization-configuration \
  --auto-enable

# List member accounts and their Macie status
aws macie2 list-members \
  --query 'members[*].{AccountId:accountId,Status:relationshipStatus,Email:email}'

# Enable Macie in a specific member account that opted out
aws macie2 create-member \
  --account '{"accountId":"444455556666","email":"security@example.com"}'

Auto-enable for new accounts: The --auto-enable flag means any new AWS account added to your organization automatically gets Macie enabled with the organization-level configuration. This closes the coverage gap that exists when teams spin up new sandbox accounts — a common source of undetected sensitive data exposure.

3. Sensitive Data Discovery Jobs — One-Time vs Scheduled, Sampling, Terraform

Macie discovery jobs are the mechanism for scanning S3 objects for sensitive data. You can run a one-time job (for an immediate audit) or a scheduled job (for ongoing monitoring). Jobs are scoped to specific S3 buckets, prefixes, object tags, or file extensions — giving you fine-grained control over what gets scanned and when.

Create a Discovery Job via CLI

# One-time job scanning two specific buckets
aws macie2 create-classification-job \
  --job-type ONE_TIME \
  --name "PII-Audit-Q2-2026" \
  --description "Quarterly PII audit of production data buckets" \
  --s3-job-definition '{
    "bucketDefinitions": [
      {
        "accountId": "111122223333",
        "buckets": ["prod-customer-data", "prod-payment-records"]
      }
    ],
    "scoping": {
      "includes": {
        "and": [
          {
            "simpleScopeTerm": {
              "comparator": "EQ",
              "key": "OBJECT_EXTENSION",
              "values": ["csv", "json", "parquet", "txt", "pdf"]
            }
          }
        ]
      }
    }
  }' \
  --managed-data-identifier-selector ALL

# Scheduled job running daily on buckets tagged as containing PII
aws macie2 create-classification-job \
  --job-type SCHEDULED \
  --name "Daily-PII-Monitor-Tagged-Buckets" \
  --schedule-frequency '{"dailySchedule":{}}' \
  --s3-job-definition '{
    "bucketCriteria": {
      "includes": {
        "and": [
          {
            "simpleCriterion": {
              "comparator": "EQ",
              "key": "TAG",
              "values": ["DataClassification:PII"]
            }
          }
        ]
      }
    }
  }' \
  --managed-data-identifier-selector ALL

# Check job status
aws macie2 describe-classification-job \
  --job-id YOUR_JOB_ID \
  --query '{Status:jobStatus,Progress:statistics,CreatedAt:createdAt}'

Discovery Job with Terraform

resource "aws_macie2_classification_job" "pii_monitor" {
  job_type = "SCHEDULED"
  name     = "scheduled-pii-discovery"

  schedule_frequency {
    weekly_schedule {
      day_of_week = "MONDAY"
    }
  }

  s3_job_definition {
    bucket_definitions {
      account_id = data.aws_caller_identity.current.account_id
      buckets    = ["prod-customer-data", "analytics-warehouse", "data-lake-raw"]
    }

    scoping {
      includes {
        and {
          simple_scope_term {
            comparator = "LT"
            key        = "OBJECT_SIZE"
            values     = ["20971520"]  # 20 MB max
          }
        }
      }
      excludes {
        and {
          simple_scope_term {
            comparator = "EQ"
            key        = "OBJECT_PREFIX"
            values     = ["archive/", "backups/old/", "tmp/"]
          }
        }
      }
    }
  }

  managed_data_identifier_selector = "RECOMMENDED"

  depends_on = [aws_macie2_account.macie]
}

Sampling Strategies

For very large buckets (millions of objects), scanning everything on every run is expensive. Macie supports sampling at the job level via the samplingPercentage parameter. Setting this to 10 scans a random 10% of eligible objects per job run. Over time (with weekly scheduled jobs), statistically all objects will be sampled while keeping per-run costs predictable.

# Create a job with 25% sampling for a very large data lake bucket
aws macie2 create-classification-job \
  --job-type SCHEDULED \
  --name "DataLake-Sampled-Scan" \
  --schedule-frequency '{"weeklySchedule":{"dayOfWeek":"WEDNESDAY"}}' \
  --sampling-percentage 25 \
  --s3-job-definition '{
    "bucketDefinitions": [{
      "accountId": "111122223333",
      "buckets": ["enterprise-data-lake"]
    }]
  }' \
  --managed-data-identifier-selector RECOMMENDED

4. Managed Data Identifiers — PII Types, Confidence Levels, Suppression

Macie ships with 130+ managed data identifiers maintained and updated by AWS. Each identifier targets a specific type of sensitive data, is tuned for a specific country or global context, and has an associated confidence level. Understanding these identifiers — and knowing which to suppress for your use case — is the key to running low-noise, high-value Macie jobs.

Key Managed Identifier Categories

Category	Example Identifiers	Typical Use Case
Personal Identification	USA_SOCIAL_SECURITY_NUMBER, GBR_NATIONAL_INSURANCE_NUMBER, DEU_NATIONAL_ID_CARD	GDPR, CCPA compliance
Financial	CREDIT_CARD_NUMBER, BANK_ACCOUNT_NUMBER, ABA_ROUTING_NUMBER, IBAN_CODE	PCI-DSS compliance
Health	USA_NATIONAL_PROVIDER_IDENTIFIER, USA_HEALTH_INSURANCE_CLAIM_NUMBER	HIPAA compliance
Credentials	AWS_SECRET_ACCESS_KEY, AWS_CREDENTIALS_FILE, PRIVATE_KEY, BASIC_AUTH_HEADER	Security scanning
Identity Documents	USA_PASSPORT_NUMBER, USA_DRIVERS_LICENSE, GBR_PASSPORT, AUS_PASSPORT	KYC / identity compliance
Network	IP_ADDRESS, MAC_ADDRESS, URL (context-dependent)	Data anonymization

List and Inspect Managed Identifiers via CLI

# List all managed data identifiers
aws macie2 list-managed-data-identifiers \
  --query 'items[*].{Id:id,Type:type,Category:category}' \
  --output table

# Filter for credential-type identifiers
aws macie2 list-managed-data-identifiers \
  --query 'items[?category==`CREDENTIALS`].{Id:id,Description:id}'

# Get details about a specific identifier
aws macie2 get-managed-data-identifier \
  --id AWS_SECRET_ACCESS_KEY

Suppressing Managed Identifiers

Some managed identifiers generate false positives in specific business contexts. For example, if your application stores test credit card numbers (like Stripe's test cards: 4242424242424242) in a known S3 prefix, you can exclude those identifiers from specific jobs.

# Create a job excluding specific managed identifiers
# (e.g., exclude IP_ADDRESS and MAC_ADDRESS to reduce noise in network logs)
aws macie2 create-classification-job \
  --job-type ONE_TIME \
  --name "PCI-Audit-No-Network-Noise" \
  --managed-data-identifier-selector EXCLUDE \
  --managed-data-identifier-ids '["IP_ADDRESS","MAC_ADDRESS","URL"]' \
  --s3-job-definition '{
    "bucketDefinitions": [{
      "accountId": "111122223333",
      "buckets": ["payment-processing-logs"]
    }]
  }'

RECOMMENDED vs ALL selector: The RECOMMENDED selector uses a curated subset of managed identifiers with the lowest false-positive rates — ideal for general-purpose scanning. Use ALL only when you need maximum coverage for a compliance audit and have the bandwidth to review more findings. Use INCLUDE or EXCLUDE for job-specific tuning.

5. Custom Data Identifiers — Regex, Keywords, boto3 Testing

Managed data identifiers cover globally recognized sensitive data types, but every organization has proprietary data formats that require custom detection: internal employee IDs, proprietary account numbers, project codenames, or confidential document markers. Macie custom data identifiers let you define these using regular expressions, optional keyword context requirements, and ignore words for known false positives.

Custom Identifier Anatomy

regex — The core pattern to match. Must be a valid Java-compatible regex. Maximum length: 512 characters.
keywords — Optional list of words that must appear within 50 characters of a regex match. Dramatically reduces false positives (e.g., require the word "SSN" or "Social" near a nine-digit pattern).
ignoreWords — List of literal strings that, if the match text equals them exactly, cause Macie to skip the finding. Use for known test values.
maximumMatchDistance — How far (in characters) from the regex match Macie looks for keywords. Default 50, max 300.

Create Custom Identifiers via CLI

# Custom identifier for internal employee IDs (format: EMP-XXXXXXXX)
aws macie2 create-custom-data-identifier \
  --name "Internal-Employee-ID" \
  --description "Detects internal employee IDs in format EMP-XXXXXXXX" \
  --regex "EMP-[0-9]{8}" \
  --keywords '["employee","staff","HR","personnel"]' \
  --ignore-words '["EMP-00000000","EMP-12345678"]'

# Custom identifier for API keys with org-specific prefix
aws macie2 create-custom-data-identifier \
  --name "Acme-API-Key" \
  --description "Detects Acme Corp internal API keys" \
  --regex "acme_[a-zA-Z0-9]{32}" \
  --keywords '["api_key","apikey","authorization","x-api-key"]'

# Custom identifier for proprietary contract numbers
aws macie2 create-custom-data-identifier \
  --name "Contract-Number" \
  --description "Acme Corp contract numbers: CTR-YYYY-XXXXXX" \
  --regex "CTR-20[0-9]{2}-[0-9]{6}" \
  --keywords '["contract","agreement","NDA","MSA"]'

# List all custom identifiers
aws macie2 list-custom-data-identifiers \
  --query 'items[*].{Id:id,Name:name,CreatedAt:createdAt}'

Testing Custom Identifiers with Python boto3

import boto3
import json

macie = boto3.client('macie2', region_name='us-east-1')

def test_custom_identifier(identifier_id: str, sample_texts: list[str]) -> dict:
    """
    Test a custom data identifier against sample text to validate it fires correctly.
    Uses Macie's TestCustomDataIdentifier API — no S3 scanning required.
    """
    results = {}
    for text in sample_texts:
        response = macie.test_custom_data_identifier(
            id=identifier_id,
            sampleText=text
        )
        results[text[:50]] = {
            'match_count': response['matchCount'],
            'fired': response['matchCount'] > 0
        }
    return results

def create_and_test_identifier(name: str, regex: str, keywords: list,
                                 ignore_words: list, samples: list) -> None:
    """Create a custom identifier and immediately test it against samples."""

    # Create the identifier
    response = macie.create_custom_data_identifier(
        name=name,
        description=f"Auto-created: {name}",
        regex=regex,
        keywords=keywords,
        ignoreWords=ignore_words
    )
    identifier_id = response['customDataIdentifierId']
    print(f"Created identifier: {identifier_id}")

    # Test against sample data
    results = test_custom_identifier(identifier_id, samples)

    print(f"\nTest results for '{name}':")
    for sample, result in results.items():
        status = "FIRED" if result['fired'] else "no match"
        print(f"  [{status}] '{sample}...' → {result['match_count']} matches")

    return identifier_id

# Example: Create and test an employee ID identifier
employee_id_samples = [
    "Employee EMP-12345678 joined the team",         # should fire
    "HR record for EMP-98765432 updated",            # should fire
    "The invoice number is 12345678",                 # should NOT fire (no keyword)
    "EMP-00000000 is a test value",                  # should NOT fire (ignore word)
]

create_and_test_identifier(
    name="Employee-ID-v2",
    regex=r"EMP-[0-9]{8}",
    keywords=["employee", "HR", "staff", "personnel", "hire"],
    ignore_words=["EMP-00000000"],
    samples=employee_id_samples
)

6. Finding Anatomy — Severity, JSON Structure, Affected Object Details

Every Macie finding is a structured JSON document that tells you precisely what was found, where it lives, who has access to it, and how severe the exposure is. Understanding the finding structure is essential for building automated triage and remediation pipelines.

Severity Levels

Severity	Score Range	Meaning
LOW	1–3.9	Small quantity of low-risk sensitive data; no immediate exposure
MEDIUM	4–6.9	Moderate quantity or moderate-risk data; review recommended
HIGH	7–8.9	Large quantity or high-risk data (credentials, financial); immediate review
CRITICAL	9–10	High-risk data in a publicly accessible object; immediate remediation required

Retrieve and Inspect a Finding via CLI

# List recent HIGH and CRITICAL findings
aws macie2 list-findings \
  --finding-criteria '{
    "criterion": {
      "severity.description": {
        "eq": ["High", "Critical"]
      }
    }
  }' \
  --query 'findingIds'

# Get full finding details
aws macie2 get-findings \
  --finding-ids YOUR_FINDING_ID_1 YOUR_FINDING_ID_2

Finding JSON Structure

{
  "id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
  "accountId": "111122223333",
  "region": "us-east-1",
  "type": "SensitiveData:S3Object/Credentials",
  "severity": {
    "description": "Critical",
    "score": 9
  },
  "title": "The S3 object contains AWS credentials.",
  "description": "The object contains 3 occurrences of AWS secret access keys.",
  "resourcesAffected": {
    "s3Bucket": {
      "name": "prod-config-bucket",
      "arn": "arn:aws:s3:::prod-config-bucket",
      "publicAccess": {
        "effectivePermission": "PUBLIC",
        "permissionConfiguration": {
          "bucketLevelPermissions": {
            "accessControlList": {"allowsPublicReadAccess": true}
          }
        }
      },
      "defaultServerSideEncryption": {
        "encryptionType": "NONE"
      },
      "tags": [{"key": "Environment", "value": "production"}]
    },
    "s3Object": {
      "key": "config/application.properties",
      "size": 4096,
      "lastModified": "2026-06-01T14:22:00Z",
      "eTag": "abc123",
      "serverSideEncryption": {"encryptionType": "NONE"},
      "publicAccess": true
    }
  },
  "classificationDetails": {
    "jobId": "abc123",
    "result": {
      "sensitiveData": [
        {
          "category": "CREDENTIALS",
          "detections": [
            {
              "count": 3,
              "type": "AWS_SECRET_ACCESS_KEY",
              "occurrences": {
                "lineRanges": [
                  {"start": 15, "end": 15, "startColumn": 25},
                  {"start": 23, "end": 23, "startColumn": 18}
                ]
              }
            }
          ],
          "totalCount": 3
        }
      ],
      "status": {"code": "COMPLETE"}
    }
  },
  "createdAt": "2026-06-09T08:30:00Z",
  "updatedAt": "2026-06-09T08:30:00Z"
}

Line-level precision: Notice the occurrences.lineRanges field — Macie tells you the exact line numbers and column positions where sensitive data was found within the file. This makes it practical to surgically redact only the sensitive fields rather than quarantining entire files.

7. Automated Remediation — EventBridge + Lambda Auto-Tagging and Blocking

Raw findings are not enough. For any compliance or security program to scale, remediation must be automatic. The canonical pattern is: Macie finding → EventBridge rule → Lambda function → remediation action (auto-tag the object, block public access, notify via SNS, rotate the leaked credential). Here is the complete implementation.

EventBridge Rule for Macie Findings

{
  "source": ["aws.macie"],
  "detail-type": ["Macie Finding"],
  "detail": {
    "severity": {
      "description": ["High", "Critical"]
    },
    "type": [
      "SensitiveData:S3Object/Credentials",
      "SensitiveData:S3Object/Financial",
      "SensitiveData:S3Object/Personal",
      "Policy:IAMUser/S3BucketPubliclyAccessible"
    ]
  }
}

# Create the EventBridge rule
aws events put-rule \
  --name "macie-critical-findings" \
  --description "Trigger Lambda for HIGH/CRITICAL Macie findings" \
  --event-pattern '{
    "source": ["aws.macie"],
    "detail-type": ["Macie Finding"],
    "detail": {
      "severity": {"description": ["High","Critical"]}
    }
  }' \
  --state ENABLED

# Add Lambda as the target
aws events put-targets \
  --rule "macie-critical-findings" \
  --targets '[{
    "Id": "macie-remediation-lambda",
    "Arn": "arn:aws:lambda:us-east-1:111122223333:function:macie-auto-remediation"
  }]'

Lambda Remediation Function (Python + boto3)

import boto3
import json
import os
import logging

logger = logging.getLogger()
logger.setLevel(logging.INFO)

s3 = boto3.client('s3')
sns = boto3.client('sns')

SNS_TOPIC_ARN = os.environ['SNS_ALERT_TOPIC_ARN']
SECURITY_TEAM_EMAIL = os.environ.get('SECURITY_EMAIL', 'security@example.com')

def lambda_handler(event, context):
    """
    Automated Macie finding remediation:
    1. Tag the affected S3 object with finding metadata
    2. Block public access on the affected bucket if it's public
    3. Send SNS alert to security team
    """
    finding = event.get('detail', {})
    finding_id = finding.get('id', 'UNKNOWN')
    finding_type = finding.get('type', '')
    severity = finding.get('severity', {}).get('description', 'UNKNOWN')

    resources = finding.get('resourcesAffected', {})
    bucket_info = resources.get('s3Bucket', {})
    object_info = resources.get('s3Object', {})

    bucket_name = bucket_info.get('name', '')
    object_key = object_info.get('key', '')
    is_public = object_info.get('publicAccess', False)

    logger.info(f"Processing Macie finding {finding_id}: {finding_type} [{severity}]")
    logger.info(f"Affected: s3://{bucket_name}/{object_key} (public={is_public})")

    actions_taken = []

    # 1. Tag the S3 object with Macie finding metadata
    if bucket_name and object_key:
        try:
            # Get existing tags first
            existing = s3.get_object_tagging(Bucket=bucket_name, Key=object_key)
            tags = {t['Key']: t['Value'] for t in existing.get('TagSet', [])}

            # Add Macie tags
            tags.update({
                'MacieFindingId': finding_id[:128],
                'MacieSeverity': severity,
                'MacieFindingType': finding_type.split('/')[-1][:128],
                'MacieAutoTagged': 'true'
            })

            s3.put_object_tagging(
                Bucket=bucket_name,
                Key=object_key,
                Tagging={'TagSet': [{'Key': k, 'Value': v} for k, v in tags.items()]}
            )
            actions_taken.append(f"Tagged s3://{bucket_name}/{object_key}")
            logger.info(f"Tagged object: {object_key}")
        except Exception as e:
            logger.error(f"Failed to tag object: {e}")

    # 2. Block public access on bucket if it's publicly accessible
    if is_public and bucket_name and severity in ['High', 'Critical']:
        try:
            s3.put_public_access_block(
                Bucket=bucket_name,
                PublicAccessBlockConfiguration={
                    'BlockPublicAcls': True,
                    'IgnorePublicAcls': True,
                    'BlockPublicPolicy': True,
                    'RestrictPublicBuckets': True
                }
            )
            actions_taken.append(f"Blocked public access on bucket: {bucket_name}")
            logger.info(f"Blocked public access on: {bucket_name}")
        except Exception as e:
            logger.error(f"Failed to block public access: {e}")

    # 3. Send SNS alert
    alert_message = {
        'finding_id': finding_id,
        'finding_type': finding_type,
        'severity': severity,
        'bucket': bucket_name,
        'object': object_key,
        'object_public': is_public,
        'actions_taken': actions_taken,
        'macie_console': (
            f"https://console.aws.amazon.com/macie/home"
            f"#/findings?search=id%3D{finding_id}"
        )
    }

    try:
        sns.publish(
            TopicArn=SNS_TOPIC_ARN,
            Subject=f"[MACIE {severity}] {finding_type} in {bucket_name}",
            Message=json.dumps(alert_message, indent=2)
        )
        actions_taken.append("SNS alert sent")
    except Exception as e:
        logger.error(f"Failed to send SNS alert: {e}")

    return {
        'statusCode': 200,
        'findingId': finding_id,
        'actionsTaken': actions_taken
    }

Credential rotation: For findings of type SensitiveData:S3Object/Credentials where AWS secret access keys are detected, add a step to call iam.update_access_key(AccessKeyId=extracted_key_id, Status='Inactive'). The finding's occurrences.lineRanges data tells you exactly where in the file to look for the key ID prefix. Combined with automated key deactivation, this reduces mean time to remediation for leaked AWS credentials from hours to seconds.

8. Sensitive Data Discovery Results — S3 Export and Athena Querying

For each S3 object that Macie evaluates, it can write a detailed discovery result record to S3 — regardless of whether sensitive data was found. This creates a comprehensive audit trail: you can prove to auditors not just what Macie found, but that Macie scanned every object in scope. These records are queryable with Amazon Athena, making it practical to analyze sensitive data distribution across your entire S3 estate.

Enable Discovery Results Export

# Create S3 bucket for discovery results (must be in same region as Macie)
aws s3api create-bucket \
  --bucket my-org-macie-discovery-results \
  --region us-east-1

# Enable server-side encryption
aws s3api put-bucket-encryption \
  --bucket my-org-macie-discovery-results \
  --server-side-encryption-configuration '{
    "Rules": [{
      "ApplyServerSideEncryptionByDefault": {
        "SSEAlgorithm": "aws:kms",
        "KMSMasterKeyID": "arn:aws:kms:us-east-1:111122223333:key/YOUR_KEY_ID"
      }
    }]
  }'

# Configure Macie to export discovery results to S3
aws macie2 put-classification-export-configuration \
  --configuration '{
    "s3Destination": {
      "bucketName": "my-org-macie-discovery-results",
      "keyPrefix": "results/",
      "kmsKeyArn": "arn:aws:kms:us-east-1:111122223333:key/YOUR_KEY_ID"
    }
  }'

Athena Table Definition for Discovery Results

-- Create Athena table over the Macie discovery results in S3
CREATE EXTERNAL TABLE macie_discovery_results (
  jobid                     STRING,
  timestamp                 STRING,
  s3object                  STRUCT<
    bucketarn: STRING,
    key: STRING,
    path: STRING,
    extension: STRING,
    sizecompressed: BIGINT,
    sizeuncompressed: BIGINT
  >,
  classificationresult      STRUCT<
    status: STRUCT<code: STRING>,
    sensitivedata: ARRAY<
      STRUCT<
        category: STRING,
        totalcount: BIGINT,
        detections: ARRAY<STRUCT<type: STRING, count: BIGINT>>
      >
    >,
    mimetype: STRING
  >
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
LOCATION 's3://my-org-macie-discovery-results/results/'
TBLPROPERTIES ('ignore.malformed.json' = 'true');

Useful Athena Queries

-- Find all S3 objects containing credentials by bucket
SELECT
  s3object.bucketarn,
  COUNT(*) AS objects_with_credentials
FROM macie_discovery_results
WHERE
  classificationresult.status.code = 'COMPLETE'
  AND EXISTS (
    SELECT 1 FROM UNNEST(classificationresult.sensitivedata) AS t(item)
    WHERE t.item.category = 'CREDENTIALS'
  )
GROUP BY s3object.bucketarn
ORDER BY objects_with_credentials DESC;

-- PII detection summary by data type across all buckets
SELECT
  detection.type,
  SUM(detection.count) AS total_occurrences,
  COUNT(DISTINCT s3object.bucketarn) AS affected_buckets
FROM macie_discovery_results,
  UNNEST(classificationresult.sensitivedata) AS t(sensitive_item),
  UNNEST(sensitive_item.detections) AS d(detection)
WHERE classificationresult.status.code = 'COMPLETE'
GROUP BY detection.type
ORDER BY total_occurrences DESC;

-- Objects with sensitive data that have not been scanned in 30 days
SELECT s3object.path, s3object.bucketarn, timestamp
FROM macie_discovery_results
WHERE
  classificationresult.sensitivedata IS NOT NULL
  AND ARRAY_LENGTH(classificationresult.sensitivedata) > 0
  AND CAST(timestamp AS TIMESTAMP) < (NOW() - INTERVAL '30' DAY)
ORDER BY timestamp ASC
LIMIT 100;

9. Integration with Security Hub — Aggregation, Custom Insights, Suppression

Macie integrates natively with AWS Security Hub. When enabled, all Macie findings are automatically forwarded to Security Hub in the ASFF (Amazon Security Finding Format), where they appear alongside findings from GuardDuty, Inspector, Config, and other AWS security services. This gives your security team a single pane of glass without needing to query each service independently.

Enable the Macie → Security Hub Integration

# Enable the integration (done from Security Hub side)
aws securityhub enable-import-findings-for-product \
  --product-arn "arn:aws:securityhub:us-east-1::product/aws/macie"

# Verify integration is active
aws securityhub list-enabled-products-for-import \
  --query 'ProductSubscriptions[?contains(@,`macie`)]'

# Get Macie findings via Security Hub (useful for cross-service dashboards)
aws securityhub get-findings \
  --filters '{
    "ProductName": [{"Value": "Amazon Macie", "Comparison": "EQUALS"}],
    "SeverityLabel": [
      {"Value": "HIGH", "Comparison": "EQUALS"},
      {"Value": "CRITICAL", "Comparison": "EQUALS"}
    ],
    "WorkflowStatus": [{"Value": "NEW", "Comparison": "EQUALS"}]
  }' \
  --sort-criteria '[{"Field":"SeverityNormalized","SortOrder":"desc"}]' \
  --max-items 20

Create a Custom Security Hub Insight for Macie

# Custom insight: "S3 buckets with public exposure + credentials findings"
aws securityhub create-insight \
  --name "Macie: Public Buckets with Credential Findings" \
  --filters '{
    "ProductName": [{"Value": "Amazon Macie", "Comparison": "EQUALS"}],
    "Type": [{"Value": "Sensitive Data Identifications/Passwords/Digital Credentials/Keys", "Comparison": "PREFIX"}],
    "ResourceType": [{"Value": "AwsS3Object", "Comparison": "EQUALS"}]
  }' \
  --group-by-attribute "ResourceId"

Security Hub Automation Rules for Macie Finding Suppression

# Suppress LOW severity Macie findings from known data science buckets
aws securityhub create-automation-rule \
  --rule-name "suppress-macie-low-ds-buckets" \
  --rule-order 10 \
  --description "Suppress low-severity Macie findings from approved ML training data buckets" \
  --criteria '{
    "ProductName": [{"Value": "Amazon Macie", "Comparison": "EQUALS"}],
    "SeverityLabel": [{"Value": "LOW", "Comparison": "EQUALS"}],
    "ResourceId": [
      {"Value": "arn:aws:s3:::ml-training-data", "Comparison": "PREFIX"},
      {"Value": "arn:aws:s3:::research-datasets", "Comparison": "PREFIX"}
    ]
  }' \
  --actions '[{
    "Type": "FINDING_FIELDS_UPDATE",
    "FindingFieldsUpdate": {
      "Workflow": {"Status": "SUPPRESSED"},
      "Note": {
        "Text": "Auto-suppressed: approved ML training data bucket",
        "UpdatedBy": "security-automation"
      }
    }
  }]'

Cross-region aggregation: Enable Security Hub's finding aggregation in your home region. All Macie findings from every region and every member account flow into the aggregator region's Security Hub, giving you a single Athena query endpoint for org-wide sensitive data posture reporting.

10. Cost Optimization — Sampling, Bucket Scoping, 30-Day Free Trial

Macie pricing is based on two dimensions: the number of S3 buckets monitored (for policy findings and bucket metadata evaluation) and the volume of data scanned in discovery jobs (per GB of object data processed). With smart scoping, you can achieve 80%+ cost reduction without meaningfully reducing coverage for high-risk data.

Macie Pricing Model (2026)

Pricing Dimension	Unit	Approximate Cost
S3 bucket inventory + policy monitoring	Per bucket/month	~$1.00/bucket/month (first 1M objects free)
Sensitive data discovery (first 1 GB)	Per GB	$1.00/GB
Sensitive data discovery (next 49 GB)	Per GB	$0.50/GB
Sensitive data discovery (>50 GB)	Per GB	$0.25/GB
Additional object capacity	Per million objects beyond free tier	~$0.10/million

Cost Optimization Strategies

# 1. Check estimated cost during the 30-day free trial
aws macie2 get-usage-statistics \
  --time-range '{"timeRange":"MONTH_TO_DATE"}' \
  --query 'records[*].{Account:accountId,DataScanned:freeTrialDetails.dataScanned,EstimatedCost:freeTrialDetails.usage.estimatedCost}'

# 2. Get usage breakdown by bucket
aws macie2 get-usage-totals \
  --query 'usageTotals[*].{Type:type,EstimatedCost:estimatedCost,Currency:currency}'

# 3. Identify your highest-cost buckets
aws macie2 get-bucket-statistics \
  --query 'buckets | sort_by(@, &sizeInBytesCompressed) | reverse(@) | [:10].{Bucket:bucketName,SizeGB:sizeInBytesCompressed,Objects:objectCount}'

Practical Cost Reduction Techniques

Scope jobs to sensitive prefixes only. If your data lake has a raw/pii/ prefix for confirmed PII and a raw/aggregated/ prefix for anonymized data, scope jobs to raw/pii/ only. This can reduce scan volume by 70–90%.
Use sampling for large buckets. Set --sampling-percentage 20 for buckets larger than 1 TB. Over 5 scheduled weekly runs, you achieve near-complete statistical coverage at 20% of the single-run cost.
Exclude known-clean file types. Log files (.log, .gz access logs), compiled artifacts (.jar, .war, .class), and image files (.jpg, .png) rarely contain PII. Exclude these extensions via OBJECT_EXTENSION scope criteria to avoid scanning them.
Run monthly one-time jobs instead of continuous scheduled jobs for low-risk buckets. Reserve daily or weekly schedules for buckets you know contain PII. For archival or backup buckets, a monthly one-time job is sufficient.
Use object age scoping for new-data-only scanning. Scope jobs to objects modified in the last 7 or 30 days. Older objects that were clean in previous scans are unlikely to have changed.

# Scope a job to objects modified in the last 30 days only
aws macie2 create-classification-job \
  --job-type SCHEDULED \
  --name "NewObjects-PII-Scan" \
  --schedule-frequency '{"monthlySchedule":{"dayOfMonth":1}}' \
  --s3-job-definition '{
    "bucketDefinitions": [{
      "accountId": "111122223333",
      "buckets": ["prod-customer-data"]
    }],
    "scoping": {
      "includes": {
        "and": [
          {
            "simpleScopeTerm": {
              "comparator": "GT",
              "key": "LAST_MODIFIED_DATE",
              "values": ["2026-05-09T00:00:00Z"]
            }
          },
          {
            "simpleScopeTerm": {
              "comparator": "NE",
              "key": "OBJECT_EXTENSION",
              "values": ["log","jpg","png","gif","mp4","jar","war","class"]
            }
          }
        ]
      }
    }
  }' \
  --sampling-percentage 50

30-day free trial: Macie's free trial covers both the bucket monitoring tier and sensitive data discovery. Enable all buckets and run a broad discovery job in week 1 of your trial. The console shows your projected monthly cost based on actual usage. Use this data to make an informed decision about which buckets to scope out before your trial ends — you'll have empirical evidence rather than guesses.

Frequently Asked Questions

Does Macie store copies of my S3 object content?

No. Macie reads S3 objects to scan them but does not retain copies of your data. What Macie stores are findings (metadata about what was detected, in which object, at what line numbers) and, optionally, discovery result records (which contain detection counts but not the actual sensitive data). Your S3 objects never leave your account's S3 service boundary during scanning.

How does Macie handle encrypted S3 objects?

Macie can scan objects encrypted with SSE-S3 (AES-256) and SSE-KMS — because Macie's service role has access to the KMS key via the S3 bucket policy. Objects encrypted with SSE-C (customer-provided keys) or client-side encryption cannot be scanned by Macie, because Macie does not have access to the encryption keys. Objects in SSE-C buckets appear in your Macie bucket inventory but are skipped during discovery jobs with an error status in the discovery result.

What is the difference between Macie policy findings and sensitive data findings?

Policy findings are generated continuously by Macie's bucket monitoring, without any discovery job. They alert on bucket configuration regressions — a bucket becoming public, encryption being disabled. Sensitive data findings require a discovery job to be explicitly created and run. Policy findings tell you about exposure risk; sensitive data findings tell you about data content. Both are necessary for a complete data security program.

Can Macie detect sensitive data in database exports or Parquet files?

Yes. Macie natively supports Parquet (a common format for analytics data in S3), Avro, CSV, TSV, and JSON Lines — all common formats for database exports and data lake storage. For Parquet files, Macie reads the column schema and inspects column values, making it highly effective at detecting SSNs or credit card numbers in specific columns of an analytics export.

How do I handle a finding where the sensitive data is legitimate business data?

Create a Macie allow list for the specific S3 bucket + prefix combination, or use a Security Hub automation rule to suppress the finding type for that resource ARN. Document the suppression in your security runbook with a business justification and review date. Macie's suppression at the findings filter level prevents the finding from triggering EventBridge rules, reducing alert fatigue without losing auditability — the finding still exists in Macie but is archived.