Kubernetes Operators: Automating Complex Application Management (2026)

Kubernetes Operators encode operational knowledge into software, letting you manage stateful and complex applications with the same declarative approach you use for native Kubernetes workloads. This guide covers the Operator pattern from first principles, building one with the Operator SDK, leader election, OLM, and real-world operators you should know in 2026.

What Is a Kubernetes Operator?

The Operator pattern extends Kubernetes by combining two things: Custom Resource Definitions (CRDs) that teach the API server about your domain objects, and a controller that watches those objects and drives the cluster toward your declared desired state.

A native Kubernetes Deployment controller watches Deployment resources and ensures the right number of Pods are running. An Operator does exactly the same thing — but for application-specific objects like PostgreSQLCluster, KafkaCluster, or ElasticsearchIndex. The controller encodes your operational runbook: how to provision the app, handle upgrades, scale replicas, create backups, respond to node failures, and rotate credentials.

The concept was introduced by CoreOS in 2016 and is now one of the most important patterns in the Kubernetes ecosystem. Over 300 operators are listed on OperatorHub.io, covering everything from databases to monitoring stacks to machine learning platforms.

Key Insight: An Operator is just a controller + CRD. If you can write a Kubernetes controller (a control loop that watches resources and reconciles state), you can write an Operator. The Operator SDK removes the boilerplate so you can focus on business logic.

Operator vs Helm: When to Use Which

This is the most common question teams ask when deciding how to package an application for Kubernetes. The short answer: Helm manages installation; Operators manage ongoing lifecycle. Use Helm to install an Operator, then let the Operator manage the application.

CapabilityHelm ChartKubernetes Operator
Install and upgrade appYesYes
Rollback a releaseYes (helm rollback)Yes (if implemented in controller)
Day-2 ops: backup, failover, rebalanceNoYes
Reactive to runtime failuresNoYes — watches and reconciles continuously
Domain-specific validationLimited (JSON schema only)Full (admission webhooks)
Custom status and health reportingNoYes (.status subresource)
Complexity to buildLow — YAML templatesHigh — Go/Ansible code required
Best suited forStateless apps, one-time installsStateful apps: databases, brokers, caches

Many production teams use both: a Helm chart to deploy the Operator itself, and then the Operator manages all subsequent lifecycle operations. This is exactly how the Strimzi Kafka operator and Prometheus Operator are installed.

CRD Anatomy and Status Subresource

Before writing any controller code, you define what your custom resource looks like in the API. Below is a complete CRD for a PostgreSQLCluster with OpenAPI v3 validation, a status subresource, and custom printer columns:

apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
  name: postgresclusters.db.techoral.com
spec:
  group: db.techoral.com
  versions:
    - name: v1alpha1
      served: true
      storage: true
      subresources:
        status: {}          # enables the /status subresource
      additionalPrinterColumns:
        - name: Instances
          type: integer
          jsonPath: .spec.instances
        - name: Phase
          type: string
          jsonPath: .status.phase
        - name: Age
          type: date
          jsonPath: .metadata.creationTimestamp
      schema:
        openAPIV3Schema:
          type: object
          properties:
            spec:
              type: object
              required: [instances, postgresVersion]
              properties:
                instances:
                  type: integer
                  minimum: 1
                  maximum: 10
                postgresVersion:
                  type: string
                  enum: ["14", "15", "16", "17"]
                storageSize:
                  type: string
                  default: "10Gi"
                backupEnabled:
                  type: boolean
                  default: false
            status:
              type: object
              properties:
                phase:
                  type: string
                readyInstances:
                  type: integer
                primaryEndpoint:
                  type: string
  scope: Namespaced
  names:
    plural: postgresclusters
    singular: postgrescluster
    kind: PostgreSQLCluster
    shortNames: [pgc]
Status Subresource: Enabling subresources: status: {} is critical. It means only the controller (via the /status endpoint) can update .status, while users update .spec. This prevents race conditions between the controller and kubectl edits, and is required for kubectl wait --for=condition=Ready to work correctly.

Building with the Operator SDK

The Operator SDK (maintained by Red Hat and the Operator Framework community) scaffolds a Go-based controller with watches, RBAC markers, and the manager wired up. You focus on the reconcile logic.

# Install Operator SDK CLI
export OPERATOR_SDK_VERSION=v1.38.0
curl -LO "https://github.com/operator-framework/operator-sdk/releases/download/${OPERATOR_SDK_VERSION}/operator-sdk_linux_amd64"
chmod +x operator-sdk_linux_amd64 && sudo mv operator-sdk_linux_amd64 /usr/local/bin/operator-sdk

# Scaffold a new Go-based operator project
mkdir postgres-operator && cd postgres-operator
operator-sdk init --domain techoral.com --repo github.com/techoral/postgres-operator

# Generate the API type and controller stub
operator-sdk create api \
  --group db \
  --version v1alpha1 \
  --kind PostgreSQLCluster \
  --resource \
  --controller

# Install CRDs into your current cluster
make install

# Run the controller locally against your cluster (for development)
make run ENABLE_WEBHOOKS=false

# Build and push the container image
make docker-build docker-push IMG=ghcr.io/techoral/postgres-operator:v0.1.0

# Deploy to cluster
make deploy IMG=ghcr.io/techoral/postgres-operator:v0.1.0

The SDK also supports Ansible-based operators (using Ansible roles/playbooks as the reconcile logic) and Helm-based operators (wrapping an existing Helm chart). Ansible operators are popular for teams that already maintain Ansible content for their application.

The Reconcile Loop

The reconcile function is the heart of every Operator. It receives a Request containing the namespace and name of a changed object, fetches current state, compares it to the desired state in spec, and takes actions to converge them. The following shows the structure of a PostgreSQL operator reconciler:

# Simplified reconcile logic (Go pseudocode in bash comments for readability)

# 1. Fetch the PostgreSQLCluster custom resource by namespace/name
# 2. If being deleted: run cleanup and remove finalizer, return
# 3. Add finalizer if not present (to handle cleanup on delete)
# 4. Ensure StatefulSet exists with correct replica count and image
#    - use CreateOrUpdate, never blindly Create
# 5. Ensure primary Service and headless Service exist
# 6. Ensure ConfigMap with postgresql.conf exists
# 7. If backupEnabled: ensure CronJob for pg_dump exists
# 8. Read actual StatefulSet readyReplicas and update .status
# 9. Return Result{RequeueAfter: 30s} for periodic health checks

# The reconciler is triggered by:
# - CREATE/UPDATE/DELETE of PostgreSQLCluster resources
# - CREATE/UPDATE/DELETE of child StatefulSets (via ownerReference watch)
# - Periodic requeue (every 30 seconds)
Idempotency Is Non-Negotiable: The reconcile function can be called dozens of times for the same object due to network retries, restarts, and watch re-syncs. Always use CreateOrUpdate (server-side apply or patch) instead of blindly creating resources. Return errors only for transient failures — not-found errors from child resources should trigger creation, not propagation upward.

Leader Election and HA

A single controller replica is a single point of failure. Operator SDK enables leader election via a Kubernetes Lease object. Only one replica holds the leader lock and executes reconciliation; standby replicas wait passively and take over within seconds if the leader crashes or becomes unresponsive.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: postgres-operator
  namespace: operators
spec:
  replicas: 3
  selector:
    matchLabels:
      app: postgres-operator
  template:
    metadata:
      labels:
        app: postgres-operator
    spec:
      serviceAccountName: postgres-operator
      containers:
      - name: manager
        image: ghcr.io/techoral/postgres-operator:v0.5.0
        args:
        - --leader-elect=true
        - --leader-election-id=postgres-operator-leader
        - --leader-election-namespace=operators
        resources:
          requests:
            cpu: 100m
            memory: 64Mi
          limits:
            cpu: 500m
            memory: 128Mi
        livenessProbe:
          httpGet:
            path: /healthz
            port: 8081
          initialDelaySeconds: 15

Real Example: PostgreSQL Database Operator

Once the CRD and controller are deployed, users declare what they want in plain YAML and the Operator handles everything else — creating the StatefulSet, Services, ConfigMaps, Secrets, PVCs, and backup CronJobs:

apiVersion: db.techoral.com/v1alpha1
kind: PostgreSQLCluster
metadata:
  name: myapp-db
  namespace: production
spec:
  instances: 3
  postgresVersion: "17"
  storageSize: "50Gi"
  backupEnabled: true
  resources:
    requests:
      cpu: "1"
      memory: "2Gi"
    limits:
      cpu: "2"
      memory: "4Gi"
  backup:
    schedule: "0 2 * * *"
    retentionDays: 14
    s3Bucket: "techoral-pg-backups"
    s3Region: "us-east-1"

After applying this resource, the cluster status is immediately visible through kubectl:

$ kubectl get pgc -n production
NAME        INSTANCES   PHASE     AGE
myapp-db    3           Running   4m

$ kubectl describe pgc myapp-db -n production | grep -A5 Status
Status:
  Phase:             Running
  Ready Instances:   3
  Primary Endpoint:  myapp-db-primary.production.svc.cluster.local
  Last Backup:       2026-06-06T02:00:00Z

OLM and OperatorHub

The Operator Lifecycle Manager (OLM) is a component that manages the installation, upgrade, and RBAC of Operators on a cluster. Think of it as apt or yum for Kubernetes Operators. OperatorHub.io lists hundreds of community and certified operators you can install with a single command.

# Install OLM into your cluster
curl -sL https://github.com/operator-framework/operator-lifecycle-manager/releases/download/v0.28.0/install.sh \
  | bash -s v0.28.0

# Install the Strimzi Kafka operator via OLM
kubectl create -f https://operatorhub.io/install/strimzi-kafka-operator.yaml

# Install Prometheus Operator via OLM
kubectl create -f https://operatorhub.io/install/prometheus.yaml

# Check all installed operators and their versions
kubectl get csv -A
kubectl get subscriptions -A

Notable operators worth knowing in 2026:

  • Strimzi — Production Kafka on Kubernetes. Manages brokers, KRaft quorum, topics, users, and MirrorMaker2 via CRDs.
  • Prometheus Operator — The foundation of kube-prometheus-stack. Manages Prometheus, Alertmanager, and ServiceMonitor resources.
  • CloudNativePG — Best-in-class PostgreSQL operator with automatic primary failover, backup to S3, and connection pooling via PgBouncer.
  • Rook-Ceph — Deploys and manages Ceph distributed storage clusters on bare Kubernetes nodes.
  • Cert-Manager — Automates TLS certificate provisioning and renewal from Let's Encrypt, Vault, and private CAs.

FAQ

Do I need to write Go to build an Operator?
No. The Operator SDK supports Ansible-based and Helm-based operators in addition to Go. Ansible operators are a good choice for teams that already have Ansible playbooks for their application lifecycle. Go gives the most flexibility and performance, but Ansible lowers the barrier significantly.
Can one Operator manage resources in multiple namespaces?
Yes. By default operators watch all namespaces. You can restrict scope using the WATCH_NAMESPACE environment variable or controller-runtime's Cache options to limit which namespaces the informers cover. OLM also supports namespace-scoped installations.
What happens if the Operator pod is down?
Existing child resources (Pods, StatefulSets, Services) continue running — Kubernetes manages them independently. Custom resources remain in the API server. When the Operator restarts, it re-lists all custom resources and reconciles each one. No state is lost; only ongoing day-2 operations like automated backups pause during downtime.
How do I handle CRD schema breaking changes across versions?
Use conversion webhooks. Mark both v1alpha1 and v1beta1 as served: true, set v1beta1 as storage: true, and implement a webhook that translates between the two representations. This lets old clients keep working while the API evolves. The Operator SDK generates the webhook scaffold for you.
Is the Operator pattern the right tool for a stateless web app?
No. Use a Helm chart or plain Kubernetes manifests for stateless applications. Operators add significant complexity: Go code, CRD maintenance, RBAC, and webhook TLS certificates. They pay off for stateful systems with complex day-2 operations — databases, message brokers, distributed caches, and search clusters.