AWS Managed Grafana and Prometheus: Observability at Scale (2026)

Modern cloud-native applications running on Kubernetes generate enormous volumes of metrics, logs, and traces. Managing a self-hosted Prometheus and Grafana stack consumes engineering time that is better spent shipping features. Amazon Managed Service for Prometheus (AMP) and Amazon Managed Grafana (AMG) remove the operational burden by providing fully managed, scalable, and highly available observability infrastructure. This guide covers every layer — from workspace provisioning and EKS scrape configuration to PromQL deep-dives, dashboard building, alerting, Terraform automation, and cost planning — giving you a production-ready observability stack by the end.

1. The AWS Observability Stack: CloudWatch vs AMP vs AMG

AWS offers multiple overlapping observability services and choosing among them causes real confusion. The three you will encounter most often are Amazon CloudWatch, Amazon Managed Service for Prometheus (AMP), and Amazon Managed Grafana (AMG). Each solves a different problem at a different layer of the stack.

Amazon CloudWatch is the native AWS telemetry service. It ingests logs, metrics, and traces from virtually every AWS service automatically — EC2, Lambda, ECS, RDS, API Gateway, and hundreds more. If you need to monitor AWS infrastructure with zero configuration, CloudWatch is the default choice. Its query language (CloudWatch Metrics Insights and CloudWatch Logs Insights) is powerful but non-standard. Dashboards in CloudWatch are functional but limited compared with Grafana's panel ecosystem.

Amazon Managed Service for Prometheus (AMP) is a Prometheus-compatible, serverless metrics store. It accepts Prometheus remote_write from any instrumented workload — Kubernetes pods, EC2 instances, Lambda extensions, even on-premises servers via an AWS Distro for OpenTelemetry (ADOT) collector. The critical advantage is that it is 100% PromQL compatible: every query you wrote for self-hosted Prometheus works without modification.

Amazon Managed Grafana (AMG) is a fully managed Grafana Enterprise service. AWS handles upgrades, high availability, user management through IAM Identity Center (formerly AWS SSO), and plugin management. AMG connects natively to AMP, CloudWatch, X-Ray, Athena, OpenSearch, and dozens of community data sources in the same workspace, making it the ideal single pane of glass for a mixed AWS + Kubernetes environment.

Rule of thumb: Use CloudWatch for AWS-native resource metrics and log aggregation. Use AMP + AMG for Kubernetes application metrics, custom business metrics, and cross-account observability at scale. Many production teams run both simultaneously, with AMG querying both CloudWatch and AMP data sources in a single dashboard.

The typical data flow is: instrumented application pods expose a /metrics endpoint in Prometheus exposition format → a Prometheus server (or ADOT collector) scrapes those endpoints → remote_write forwards samples to AMP → AMG queries AMP via PromQL and renders dashboards. Alerts are evaluated either by the AMP ruler (managed alert rules) or by an Alertmanager sidecar in the cluster, which routes notifications to Slack, PagerDuty, or OpsGenie.

See our Amazon CloudWatch Monitoring guide for a full walkthrough of the CloudWatch side of this observability stack, including log groups, metric filters, and composite alarms.

2. Amazon Managed Service for Prometheus (AMP): Workspaces and Setup

An AMP workspace is the top-level resource. It is a dedicated, tenant-isolated Prometheus environment with its own endpoint, retention policy (default 150 days), and IAM permissions boundary. You can have multiple workspaces per AWS account — one per environment (dev, staging, production) is the recommended pattern.

Creating a workspace takes seconds using the AWS CLI. The command below creates a production workspace and tags it for cost allocation:

# Create an AMP workspace
aws amp create-workspace \
  --alias "techoral-production" \
  --tags Environment=production,Team=platform \
  --region us-east-1

# The response includes workspaceId and prometheusEndpoint
# Example output:
# {
#   "arn": "arn:aws:aps:us-east-1:123456789012:workspace/ws-abc123",
#   "status": { "statusCode": "CREATING" },
#   "workspaceId": "ws-abc123"
# }

# Describe the workspace to get the endpoint once status is ACTIVE
aws amp describe-workspace \
  --workspace-id ws-abc123 \
  --region us-east-1 \
  --query 'workspace.prometheusEndpoint' \
  --output text
# https://aps-workspaces.us-east-1.amazonaws.com/workspaces/ws-abc123/

Once active, the workspace exposes two HTTP API paths that matter:

  • /api/v1/remote_write — the ingest endpoint for Prometheus remote_write
  • /api/v1/query and /api/v1/query_range — the PromQL query endpoints (used by AMG)

Authentication uses AWS Signature Version 4 (SigV4). Any request to AMP must be signed with valid AWS credentials that carry the aps:RemoteWrite (for ingest) or aps:QueryMetrics (for queries) IAM permission. The AWS ADOT collector and the Prometheus community aws-sigv4-proxy both handle signing transparently so your Prometheus config never contains passwords.

IAM permissions required for remote_write: Attach a policy with aps:RemoteWrite to the IAM role used by your Prometheus pod or ADOT collector. Use IRSA (IAM Roles for Service Accounts) on EKS — never mount long-lived credentials as environment variables.

AMP also supports alert rules and recording rules stored as rule groups directly in the workspace. This moves rule evaluation off your cluster and into the managed service, which eliminates the risk of losing alerts if your Prometheus pod crashes:

# Upload a rule group to AMP
aws amp create-rule-groups-namespace \
  --workspace-id ws-abc123 \
  --name "k8s-rules" \
  --data fileb://rules.yaml \
  --region us-east-1

For multi-account architectures, you can use AMP cross-account data sharing via resource-based IAM policies, allowing a central observability account to query workspaces in every application account without replicating data.

3. Scraping EKS Metrics with the Prometheus Operator

The Prometheus Operator is the standard way to deploy and configure Prometheus on Kubernetes. It introduces custom resources — ServiceMonitor, PodMonitor, and PrometheusRule — that let you define scrape targets declaratively as Kubernetes objects rather than editing a monolithic prometheus.yaml.

Install the kube-prometheus-stack Helm chart, which bundles the Prometheus Operator, a pre-configured Prometheus instance, kube-state-metrics, node-exporter, and a set of default alert rules:

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

helm install kube-prometheus-stack prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --create-namespace \
  --set prometheus.prometheusSpec.remoteWrite[0].url="https://aps-workspaces.us-east-1.amazonaws.com/workspaces/ws-abc123/api/v1/remote_write" \
  --set prometheus.prometheusSpec.remoteWrite[0].sigv4.region="us-east-1" \
  --set prometheus.prometheusSpec.remoteWrite[0].sigv4.roleArn="arn:aws:iam::123456789012:role/PrometheusRemoteWriteRole" \
  --set prometheus.serviceAccount.annotations."eks\.amazonaws\.com/role-arn"="arn:aws:iam::123456789012:role/PrometheusRemoteWriteRole"

A ServiceMonitor tells the Prometheus Operator which Kubernetes services to scrape. The following ConfigMap-style YAML defines a ServiceMonitor for a Java Spring Boot application that exposes Actuator metrics on port 8080:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: spring-boot-app-monitor
  namespace: monitoring
  labels:
    release: kube-prometheus-stack   # must match Prometheus selector
spec:
  namespaceSelector:
    matchNames:
      - production
  selector:
    matchLabels:
      app: spring-boot-app
      tier: backend
  endpoints:
    - port: http
      path: /actuator/prometheus
      interval: 30s
      scrapeTimeout: 10s
      scheme: http
      relabelings:
        - sourceLabels: [__meta_kubernetes_pod_name]
          targetLabel: pod
        - sourceLabels: [__meta_kubernetes_namespace]
          targetLabel: namespace
        - sourceLabels: [__meta_kubernetes_pod_label_version]
          targetLabel: version
---
# PodMonitor for workloads that don't expose a Kubernetes Service
apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
  name: batch-job-monitor
  namespace: monitoring
  labels:
    release: kube-prometheus-stack
spec:
  namespaceSelector:
    matchNames:
      - production
  selector:
    matchLabels:
      app: batch-processor
  podMetricsEndpoints:
    - port: metrics
      path: /metrics
      interval: 60s

The relabelings block enriches every scraped sample with Kubernetes metadata labels such as pod name, namespace, and version. This metadata is invaluable in PromQL when you want to filter metrics by deployment version or isolate a single misbehaving pod.

See our Amazon EKS Kubernetes guide for cluster setup prerequisites, and our EKS Fargate guide if your workloads run in Fargate profiles where node-exporter is unavailable.

4. AMP remote_write Configuration in prometheus.yaml

If you manage your own Prometheus deployment rather than using the Operator, the raw prometheus.yaml configuration below shows how to wire up remote_write to AMP with SigV4 authentication and sensible queue tuning parameters. Correct queue tuning is critical: the default settings are too conservative for high-cardinality EKS clusters and will cause samples to back up in memory.

# prometheus.yaml — remote_write section for AMP
global:
  scrape_interval: 30s
  evaluation_interval: 30s
  external_labels:
    cluster: "techoral-prod-eks"
    region:  "us-east-1"
    environment: "production"

remote_write:
  - url: "https://aps-workspaces.us-east-1.amazonaws.com/workspaces/ws-abc123/api/v1/remote_write"
    sigv4:
      region: us-east-1
      # roleArn is optional when running on EC2/EKS with an attached IAM role
      role_arn: arn:aws:iam::123456789012:role/PrometheusRemoteWriteRole
    queue_config:
      # Number of in-memory samples before blocking scrapes
      capacity: 10000
      # Maximum number of samples per send batch
      max_samples_per_send: 5000
      # Concurrent HTTP connections to the AMP endpoint
      max_shards: 200
      # Minimum shards — keeps connections warm under low traffic
      min_shards: 5
      # Retry on 429 / 5xx after this delay
      min_backoff: 30ms
      max_backoff: 5s
    metadata_config:
      send: true
      send_interval: 1m
    # Drop high-cardinality labels that add cost without value
    write_relabel_configs:
      - source_labels: [__name__]
        regex: "go_gc_.*|process_open_fds|process_max_fds"
        action: drop
      - source_labels: [pod_template_hash]
        action: labeldrop

# Example scrape config for node-exporter
scrape_configs:
  - job_name: "node-exporter"
    kubernetes_sd_configs:
      - role: node
    relabel_configs:
      - action: labelmap
        regex: __meta_kubernetes_node_label_(.+)
      - target_label: __address__
        replacement: "kubernetes.default.svc:443"
      - source_labels: [__meta_kubernetes_node_name]
        regex: (.+)
        target_label: __metrics_path__
        replacement: "/api/v1/nodes/${1}/proxy/metrics"
Cost control tip: The write_relabel_configs drop block is important. AMP charges per sample ingested. Dropping noisy, low-value metrics like go_gc_* before remote_write can reduce your monthly AMP bill by 10–30% in large clusters without losing any actionable signal.

The external_labels block stamps every sample with cluster, region, and environment. These labels survive the round-trip to AMP and appear in AMG queries, enabling you to distinguish production data from staging data in a multi-cluster workspace without separate workspaces.

5. PromQL Examples: CPU, Memory, Request Rate, Error Rate

PromQL (Prometheus Query Language) is a functional query language for time-series data. It supports aggregation over labels, range vectors, subqueries, and offset modifiers. The following real-world queries cover the four golden signals — latency, traffic, errors, and saturation — applied to a Kubernetes production cluster.

CPU Utilization per Pod

# CPU usage as a percentage of the requested CPU for each pod
# This query is useful for identifying CPU-constrained pods
sum by (namespace, pod) (
  rate(container_cpu_usage_seconds_total{
    container!="",
    container!="POD",
    namespace=~"production|staging"
  }[5m])
)
/
sum by (namespace, pod) (
  kube_pod_container_resource_requests{
    resource="cpu",
    namespace=~"production|staging"
  }
) * 100

Memory Pressure: Containers Near Their Limit

# Percentage of memory limit consumed — alert when above 85%
sum by (namespace, pod, container) (
  container_memory_working_set_bytes{
    container!="",
    container!="POD"
  }
)
/
sum by (namespace, pod, container) (
  kube_pod_container_resource_limits{resource="memory"}
) * 100

# Instant alert expression (use in AMP rule groups):
# (above_query) > 85

HTTP Request Rate per Service

# Requests per second per service and status code class, 5-minute window
sum by (service, namespace, status_code_class) (
  label_replace(
    rate(http_requests_total{namespace=~"production"}[5m]),
    "status_code_class",
    "$1xx",
    "code",
    "([0-9]).*"
  )
)

# Simpler version — total RPS across all status codes
sum by (service, namespace) (
  rate(http_requests_total{namespace="production"}[5m])
)

Error Rate and Apdex Score

# HTTP 5xx error rate as a fraction of total traffic
sum by (service, namespace) (
  rate(http_requests_total{code=~"5..", namespace="production"}[5m])
)
/
sum by (service, namespace) (
  rate(http_requests_total{namespace="production"}[5m])

# Apdex score (satisfied < 300ms, tolerating < 1200ms)
(
  sum by (service) (rate(http_request_duration_seconds_bucket{le="0.3",namespace="production"}[5m]))
  +
  sum by (service) (rate(http_request_duration_seconds_bucket{le="1.2",namespace="production"}[5m]))
) / 2
/
sum by (service) (rate(http_request_duration_seconds_count{namespace="production"}[5m]))
Recording rules for performance: Expensive aggregations like the Apdex query above should be precomputed as AMP recording rules. A recording rule evaluates the query every 30 seconds and stores the result as a new metric. Dashboard load time drops from seconds to milliseconds because AMG only needs to fetch the pre-aggregated series.

For node-level saturation queries — disk I/O, network throughput, open file descriptors — see the node_* metrics family exposed by node-exporter. Pair these with the CPU and memory queries above for a complete infrastructure health panel in AMG.

6. Amazon Managed Grafana (AMG): Workspace and SSO Setup

An AMG workspace is a fully managed Grafana instance that AWS operates on your behalf. It runs Grafana Enterprise, includes all enterprise plugins (including the AWS data source suite), and scales automatically to handle concurrent dashboard viewers. Authentication is handled through IAM Identity Center (formerly AWS SSO), which means your existing corporate directory (Active Directory, Okta, Google Workspace) can provide single sign-on without any Grafana LDAP configuration.

Before creating an AMG workspace, enable IAM Identity Center in your AWS account or organization. Then create the workspace:

# Create an AMG workspace with IAM Identity Center auth and service-managed permissions
aws grafana create-workspace \
  --workspace-name "techoral-observability" \
  --account-access-type CURRENT_ACCOUNT \
  --authentication-providers AWS_SSO \
  --permission-type SERVICE_MANAGED \
  --workspace-data-sources PROMETHEUS CLOUDWATCH XRAY \
  --workspace-notification-destinations SNS \
  --tags Environment=production \
  --region us-east-1

# The response returns a workspaceId, e.g. g-abc12345
# The Grafana endpoint will be:
# https://g-abc12345.grafana-workspace.us-east-1.amazonaws.com

After the workspace is ACTIVE, assign users or groups from IAM Identity Center:

# List users in IAM Identity Center to find their IDs
aws identitystore list-users \
  --identity-store-id d-1234567890 \
  --region us-east-1

# Assign a user as an ADMIN to the AMG workspace
aws grafana update-permissions \
  --workspace-id g-abc12345 \
  --update-instruction-batch \
    action=ADD,role=ADMIN,users=[{id=abc-user-id-123,type=SSO_USER}] \
  --region us-east-1

# Assign a group as VIEWER (read-only dashboards)
aws grafana update-permissions \
  --workspace-id g-abc12345 \
  --update-instruction-batch \
    action=ADD,role=VIEWER,users=[{id=grp-id-456,type=SSO_GROUP}] \
  --region us-east-1

AMG supports three roles: ADMIN (full workspace management), EDITOR (create and edit dashboards), and VIEWER (read-only access). Map these to your organizational roles — senior SREs as ADMIN, developers as EDITOR, stakeholders as VIEWER.

The SERVICE_MANAGED permission type means AWS automatically creates the IAM role that grants AMG read access to AMP, CloudWatch, X-Ray, and other specified data sources. If you prefer explicit control, use CUSTOMER_MANAGED and create the role yourself — useful when the workspace spans multiple AWS accounts. See our IAM Roles and Policies guide for details on constructing the least-privilege policy.

7. Connecting AMG to AMP as a Data Source

AMG connects to AMP using native AWS authentication — no API keys or passwords required. The workspace's IAM role (created by SERVICE_MANAGED or by you) must carry aps:QueryMetrics on the target workspace ARN. AWS handles the SigV4 request signing internally when AMG sends PromQL queries to AMP.

To add AMP as a data source in the Grafana UI, navigate to Configuration → Data Sources → Add data source → Amazon Managed Service for Prometheus. Enter the workspace endpoint URL:

https://aps-workspaces.us-east-1.amazonaws.com/workspaces/ws-abc123/

Set Auth Provider to AWS SDK Default and select the correct region. Click Save & Test — a green checkmark confirms AMG can reach and query your AMP workspace.

For automated or Terraform-driven setups, the data source can be pre-configured using the Grafana HTTP API. AMG exposes the same Grafana API at https://<workspace-id>.grafana-workspace.<region>.amazonaws.com/api. Generate a service account token in the workspace and use it with the API:

# Create a service account token in AMG for API automation
# (do this once inside the Grafana UI or via Terraform grafana provider)

# Add AMP data source via Grafana API
curl -X POST \
  -H "Authorization: Bearer glsa_your_service_account_token" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "AMP-Production",
    "type": "prometheus",
    "url": "https://aps-workspaces.us-east-1.amazonaws.com/workspaces/ws-abc123/",
    "access": "proxy",
    "jsonData": {
      "authType": "default",
      "defaultRegion": "us-east-1",
      "sigV4Auth": true,
      "sigV4AuthType": "default",
      "sigV4Region": "us-east-1"
    },
    "isDefault": true
  }' \
  https://g-abc12345.grafana-workspace.us-east-1.amazonaws.com/api/datasources

You can also connect AMG to CloudWatch as a second data source in the same workspace. This allows unified dashboards that show both Kubernetes pod metrics from AMP and AWS RDS or ALB metrics from CloudWatch side by side — a significant advantage over managing separate Grafana instances for each data source type.

8. Building a Kubernetes Dashboard in Grafana

A well-structured Kubernetes observability dashboard covers four tiers: cluster health, namespace summary, workload detail (Deployment / StatefulSet), and individual pod drill-down. Grafana's variable system drives this hierarchy — selecting a namespace in the top variable filters all panels on the page to that namespace, and selecting a workload further narrows the pod-level panels.

Start from the community dashboard Kubernetes / Compute Resources / Namespace (Pods) (ID 15760 on grafana.com). Import it via Dashboards → Import → ID 15760, select your AMP data source, and it renders immediately with zero PromQL authoring.

For a custom dashboard, define Grafana template variables first. These appear as dropdowns at the top of the dashboard:

# Grafana dashboard JSON variables section (excerpt)
# Add these in Dashboard Settings → Variables

# Variable 1: namespace
name: namespace
type: query
datasource: AMP-Production
query: label_values(kube_namespace_labels, namespace)
refresh: 2   # refresh on time range change
multi: false
includeAll: false
sort: 1

# Variable 2: workload
name: workload
type: query
datasource: AMP-Production
query: label_values(kube_deployment_labels{namespace="$namespace"}, deployment)
refresh: 2
multi: true
includeAll: true

# Variable 3: pod
name: pod
type: query
datasource: AMP-Production
query: label_values(kube_pod_info{namespace="$namespace",created_by_name=~"$workload.*"}, pod)
refresh: 2
multi: true
includeAll: true

With variables defined, add panels using PromQL expressions that reference $namespace, $workload, and $pod. A stat panel showing total cluster CPU utilization, a time series showing per-pod memory consumption, and a table panel listing pods sorted by CPU usage make for a complete workload view.

Alerts from dashboards: In AMG, dashboard panels support Grafana Unified Alerting. Right-click any panel → Edit → Alert tab to define a threshold. AMG evaluates the alert rule on a schedule (default every 1 minute) and fires notifications via the configured contact points — Slack, PagerDuty, SNS, or email. For large alert volumes, offload rule evaluation to AMP's ruler instead, which scales independently of Grafana.

Best practice: Import the kube-prometheus-stack default dashboards (shipped with the Helm chart) as the baseline. Customize and extend them rather than building from scratch. Community dashboards are battle-tested across thousands of clusters and cover corner cases — like pod restart storms and PVC capacity — that homegrown dashboards often miss.

9. Alertmanager: Routing Rules and Slack/PagerDuty Notifications

Alertmanager receives firing alerts from Prometheus (or from AMP's ruler) and routes them to the correct notification channel based on labels such as severity, team, and environment. The routing tree is defined in alertmanager.yaml. The following configuration routes critical alerts to PagerDuty, warning alerts to Slack, and silences any alert from the staging namespace outside business hours.

# alertmanager.yaml — production configuration
global:
  resolve_timeout: 5m
  slack_api_url: "https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXX"

# Templates for message formatting
templates:
  - '/etc/alertmanager/templates/*.tmpl'

route:
  # Default receiver for unmatched alerts
  receiver: slack-warnings
  group_by: ['alertname', 'cluster', 'namespace']
  group_wait: 30s          # wait before sending first notification in a group
  group_interval: 5m       # wait before sending new notifications for an existing group
  repeat_interval: 4h      # resend unresolved alerts every 4 hours
  routes:
    # Critical alerts → PagerDuty (immediate, no grouping delay)
    - match:
        severity: critical
      receiver: pagerduty-critical
      group_wait: 0s
      repeat_interval: 1h
      continue: false

    # Warning alerts in production → Slack #alerts channel
    - match_re:
        severity: "warning|info"
        namespace: "production"
      receiver: slack-warnings
      continue: true

    # All staging alerts → Slack #dev-noise (lower priority channel)
    - match:
        environment: staging
      receiver: slack-dev
      continue: false

receivers:
  - name: pagerduty-critical
    pagerduty_configs:
      - routing_key: "your-pagerduty-integration-key"
        description: '{{ template "pagerduty.default.description" . }}'
        severity: '{{ if .CommonLabels.severity }}{{ .CommonLabels.severity }}{{ else }}critical{{ end }}'
        details:
          cluster: '{{ .CommonLabels.cluster }}'
          namespace: '{{ .CommonLabels.namespace }}'
          runbook: 'https://techoral.com/runbooks/{{ .CommonLabels.alertname }}'

  - name: slack-warnings
    slack_configs:
      - channel: '#platform-alerts'
        send_resolved: true
        icon_url: 'https://avatars3.githubusercontent.com/u/3380462'
        title: '{{ if eq .Status "firing" }}:fire:{{ else }}:white_check_mark:{{ end }} {{ .CommonLabels.alertname }}'
        text: >-
          *Cluster:* {{ .CommonLabels.cluster }}
          *Namespace:* {{ .CommonLabels.namespace }}
          *Severity:* {{ .CommonLabels.severity }}
          *Summary:* {{ .CommonAnnotations.summary }}
          *Description:* {{ .CommonAnnotations.description }}
        actions:
          - type: button
            text: 'View in Grafana'
            url: '{{ .CommonAnnotations.grafana_url }}'

  - name: slack-dev
    slack_configs:
      - channel: '#dev-noise'
        send_resolved: false
        title: '[STAGING] {{ .CommonLabels.alertname }}'
        text: '{{ .CommonAnnotations.summary }}'

inhibit_rules:
  # Suppress warning alerts when a critical alert fires for the same service
  - source_match:
      severity: critical
    target_match:
      severity: warning
    equal: ['alertname', 'namespace', 'service']

Deploy Alertmanager as a Kubernetes Secret and reference it from the kube-prometheus-stack values. The inhibit_rules section is especially important in production: it prevents alert storms where a single pod crash triggers dozens of downstream warning alerts from dependent services, flooding your Slack channel and masking the root cause.

For alert rules that complement this routing config, store them as AMP rule groups. Example rule for high pod restart rate:

# rules.yaml — upload to AMP with aws amp create-rule-groups-namespace
groups:
  - name: kubernetes-workload-alerts
    interval: 1m
    rules:
      - alert: PodCrashLooping
        expr: |
          rate(kube_pod_container_status_restarts_total[15m]) * 60 > 1
        for: 5m
        labels:
          severity: critical
          team: platform
        annotations:
          summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} is crash-looping"
          description: "Container {{ $labels.container }} has restarted {{ $value | humanize }} times/min for 5 minutes."
          grafana_url: "https://g-abc12345.grafana-workspace.us-east-1.amazonaws.com/d/k8s-pods"

      - alert: HighMemoryUsage
        expr: |
          sum by (namespace, pod) (
            container_memory_working_set_bytes{container!=""}
          )
          /
          sum by (namespace, pod) (
            kube_pod_container_resource_limits{resource="memory"}
          ) > 0.90
        for: 10m
        labels:
          severity: warning
          team: platform
        annotations:
          summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} memory usage above 90%"
          description: "Memory utilization is {{ $value | humanizePercentage }} of the configured limit."

10. Terraform: Provision AMP and AMG Workspaces

Infrastructure as Code is the right way to manage AMP and AMG — especially in organizations with multiple accounts and environments. The Terraform AWS provider includes full support for both services. The following HCL provisions an AMP workspace, an AMG workspace, the necessary IAM roles, and wires them together.

# main.tf — AMP + AMG Terraform configuration

terraform {
  required_version = ">= 1.6"
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.50"
    }
  }
}

provider "aws" {
  region = var.aws_region
}

variable "aws_region"    { default = "us-east-1" }
variable "environment"   { default = "production" }
variable "eks_oidc_url"  { description = "EKS OIDC provider URL for IRSA" }
variable "account_id"    { description = "AWS account ID" }

# ── AMP Workspace ──────────────────────────────────────────────────────────────
resource "aws_prometheus_workspace" "main" {
  alias = "techoral-${var.environment}"

  logging_configuration {
    log_group_arn = "${aws_cloudwatch_log_group.amp_logs.arn}:*"
  }

  tags = {
    Environment = var.environment
    ManagedBy   = "terraform"
  }
}

resource "aws_cloudwatch_log_group" "amp_logs" {
  name              = "/aws/prometheus/techoral-${var.environment}"
  retention_in_days = 30
}

# ── IAM Role for Prometheus remote_write (IRSA) ────────────────────────────────
data "aws_iam_policy_document" "prometheus_assume_role" {
  statement {
    actions = ["sts:AssumeRoleWithWebIdentity"]
    principals {
      type        = "Federated"
      identifiers = ["arn:aws:iam::${var.account_id}:oidc-provider/${var.eks_oidc_url}"]
    }
    condition {
      test     = "StringEquals"
      variable = "${var.eks_oidc_url}:sub"
      values   = ["system:serviceaccount:monitoring:prometheus-server"]
    }
  }
}

resource "aws_iam_role" "prometheus_remote_write" {
  name               = "PrometheusRemoteWriteRole-${var.environment}"
  assume_role_policy = data.aws_iam_policy_document.prometheus_assume_role.json
}

resource "aws_iam_role_policy" "prometheus_remote_write" {
  name = "PrometheusRemoteWritePolicy"
  role = aws_iam_role.prometheus_remote_write.id

  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Effect   = "Allow"
        Action   = ["aps:RemoteWrite", "aps:GetSeries", "aps:GetLabels", "aps:GetMetricMetadata"]
        Resource = aws_prometheus_workspace.main.arn
      }
    ]
  })
}

# ── AMG Workspace ──────────────────────────────────────────────────────────────
resource "aws_grafana_workspace" "main" {
  name                     = "techoral-observability-${var.environment}"
  account_access_type      = "CURRENT_ACCOUNT"
  authentication_providers = ["AWS_SSO"]
  permission_type          = "SERVICE_MANAGED"

  data_sources = [
    "PROMETHEUS",
    "CLOUDWATCH",
    "XRAY",
    "AMAZON_OPENSEARCH_SERVICE"
  ]

  notification_destinations = ["SNS"]

  configuration = jsonencode({
    plugins = { pluginAdminEnabled = true }
    unifiedAlerting = { enabled = true }
  })

  tags = {
    Environment = var.environment
    ManagedBy   = "terraform"
  }
}

# ── Outputs ────────────────────────────────────────────────────────────────────
output "amp_workspace_id" {
  value = aws_prometheus_workspace.main.id
}

output "amp_endpoint" {
  value = aws_prometheus_workspace.main.prometheus_endpoint
}

output "amg_endpoint" {
  value = "https://${aws_grafana_workspace.main.endpoint}"
}

output "prometheus_irsa_role_arn" {
  value = aws_iam_role.prometheus_remote_write.arn
}

Apply with terraform init && terraform plan && terraform apply. After apply, feed the amp_endpoint and prometheus_irsa_role_arn outputs directly into your Helm values for the kube-prometheus-stack deployment. See our AWS Terraform Guide for state backend configuration and workspace organization best practices.

For CDK-based infrastructure pipelines, see our AWS CDK guide — CDK L2 constructs exist for both aws-aps (AMP) and aws-grafana (AMG).

11. Cost Analysis: AMP and AMG Pricing in 2026

Understanding AMP and AMG pricing prevents bill shock on high-cardinality clusters. Both services use consumption-based pricing with no upfront costs or reserved capacity commitments.

Amazon Managed Service for Prometheus (AMP) Pricing

AMP charges across three dimensions:

  • Metrics ingestion: $0.90 per billion samples ingested. A typical EKS cluster with 50 pods scraping every 30 seconds generates roughly 200 million samples/day, costing ~$0.18/day or ~$5.40/month.
  • Metrics storage: $0.03 per billion samples stored per month. With the 150-day retention default, ongoing storage cost for the cluster above is ~$0.09/month.
  • Metrics queried: $0.01 per billion samples queried. This is negligible for normal dashboard usage but can grow with aggressive alerting that evaluates expensive PromQL every 15 seconds.

The dominant cost driver is ingestion. Reducing scrape cardinality (fewer label dimensions), increasing scrape interval from 15s to 30s, and using write_relabel_configs to drop unused metrics are the three highest-leverage optimizations.

Amazon Managed Grafana (AMG) Pricing

AMG is priced per active user per month:

  • Editor/Admin users: $9.00 per active user/month
  • Viewer users: $5.00 per active user/month

An "active user" is any SSO user who logs into the workspace at least once during the billing month. Users who do not log in are not charged. A 20-person engineering team where 5 are editors and 15 are viewers pays $45 + $75 = $120/month for AMG — a fraction of the cost of operating a self-hosted Grafana Enterprise cluster on EKS.

Total observability cost estimate (mid-size production cluster):
AMP: ~$10–30/month | AMG: ~$50–150/month depending on team size
Compare: self-hosted Prometheus (3-replica HA) + Grafana on EKS: $150–400/month in EC2/EBS costs alone, before the engineering time to operate it.

See our AWS Cost Optimization guide for broader strategies around tagging, budgets, and Savings Plans that apply to your overall AWS bill.

12. Self-Hosted vs Managed Prometheus and Grafana on EKS

The managed vs self-hosted decision is not purely about cost. It involves operational complexity, scalability ceiling, compliance requirements, and team expertise. The table below summarizes the key trade-offs.

Dimension Self-Hosted (EKS) AMP + AMG (Managed)
Operational overhead High — upgrades, HA, storage management Near-zero — AWS handles all of it
Scalability Limited by node resources; requires Thanos/Cortex for scale Serverless — scales to billions of samples/day automatically
PromQL compatibility 100% native 100% compatible — no query changes needed
Cost (small cluster) Low — shares existing EKS nodes Slightly higher at low scale
Cost (large cluster) High — dedicated nodes + Thanos object storage Lower per-sample at scale
Multi-cluster support Complex — requires Thanos or federation Native — multiple remote_write sources to one workspace
Authentication Manual — LDAP, OAuth, or Grafana built-in auth IAM Identity Center SSO — integrates with corporate directory
Data residency Full control Stays in AWS region — suitable for most compliance frameworks
Custom Grafana plugins Any plugin installable Curated list (enterprise plugins included)

The recommendation for most organizations running production workloads on EKS is to adopt AMP + AMG. The operational savings are substantial, and the PromQL compatibility ensures your existing runbooks, alert rules, and dashboard queries work without modification. Teams that require on-premises data storage, specific custom plugins not available in AMG, or extremely tight cost control at very small scale may still prefer self-hosted.

For teams migrating from self-hosted Prometheus, the migration path is straightforward: add the remote_write block to your existing Prometheus config, point it at AMP, and run both systems in parallel for one week to validate data parity. Then cut AMG over to replace your Grafana instance, import existing dashboard JSON (it works without changes), and decommission the self-hosted stack.

For further reading on the EKS cluster underpinning this observability stack, see our Amazon EKS Kubernetes guide, our EKS Fargate guide, and our AWS Security Best Practices article for hardening the IAM and network policies around your monitoring infrastructure.