Kubernetes Operators: Automating Complex Application Management (2026)
Kubernetes Operators encode operational knowledge into software, letting you manage stateful and complex applications with the same declarative approach you use for native Kubernetes workloads. This guide covers the Operator pattern from first principles, building one with the Operator SDK, leader election, OLM, and real-world operators you should know in 2026.
What Is a Kubernetes Operator?
The Operator pattern extends Kubernetes by combining two things: Custom Resource Definitions (CRDs) that teach the API server about your domain objects, and a controller that watches those objects and drives the cluster toward your declared desired state.
A native Kubernetes Deployment controller watches Deployment resources and ensures the right number of Pods are running. An Operator does exactly the same thing — but for application-specific objects like PostgreSQLCluster, KafkaCluster, or ElasticsearchIndex. The controller encodes your operational runbook: how to provision the app, handle upgrades, scale replicas, create backups, respond to node failures, and rotate credentials.
The concept was introduced by CoreOS in 2016 and is now one of the most important patterns in the Kubernetes ecosystem. Over 300 operators are listed on OperatorHub.io, covering everything from databases to monitoring stacks to machine learning platforms.
Operator vs Helm: When to Use Which
This is the most common question teams ask when deciding how to package an application for Kubernetes. The short answer: Helm manages installation; Operators manage ongoing lifecycle. Use Helm to install an Operator, then let the Operator manage the application.
| Capability | Helm Chart | Kubernetes Operator |
|---|---|---|
| Install and upgrade app | Yes | Yes |
| Rollback a release | Yes (helm rollback) | Yes (if implemented in controller) |
| Day-2 ops: backup, failover, rebalance | No | Yes |
| Reactive to runtime failures | No | Yes — watches and reconciles continuously |
| Domain-specific validation | Limited (JSON schema only) | Full (admission webhooks) |
| Custom status and health reporting | No | Yes (.status subresource) |
| Complexity to build | Low — YAML templates | High — Go/Ansible code required |
| Best suited for | Stateless apps, one-time installs | Stateful apps: databases, brokers, caches |
Many production teams use both: a Helm chart to deploy the Operator itself, and then the Operator manages all subsequent lifecycle operations. This is exactly how the Strimzi Kafka operator and Prometheus Operator are installed.
CRD Anatomy and Status Subresource
Before writing any controller code, you define what your custom resource looks like in the API. Below is a complete CRD for a PostgreSQLCluster with OpenAPI v3 validation, a status subresource, and custom printer columns:
apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
name: postgresclusters.db.techoral.com
spec:
group: db.techoral.com
versions:
- name: v1alpha1
served: true
storage: true
subresources:
status: {} # enables the /status subresource
additionalPrinterColumns:
- name: Instances
type: integer
jsonPath: .spec.instances
- name: Phase
type: string
jsonPath: .status.phase
- name: Age
type: date
jsonPath: .metadata.creationTimestamp
schema:
openAPIV3Schema:
type: object
properties:
spec:
type: object
required: [instances, postgresVersion]
properties:
instances:
type: integer
minimum: 1
maximum: 10
postgresVersion:
type: string
enum: ["14", "15", "16", "17"]
storageSize:
type: string
default: "10Gi"
backupEnabled:
type: boolean
default: false
status:
type: object
properties:
phase:
type: string
readyInstances:
type: integer
primaryEndpoint:
type: string
scope: Namespaced
names:
plural: postgresclusters
singular: postgrescluster
kind: PostgreSQLCluster
shortNames: [pgc]
subresources: status: {} is critical. It means only the controller (via the /status endpoint) can update .status, while users update .spec. This prevents race conditions between the controller and kubectl edits, and is required for kubectl wait --for=condition=Ready to work correctly.
Building with the Operator SDK
The Operator SDK (maintained by Red Hat and the Operator Framework community) scaffolds a Go-based controller with watches, RBAC markers, and the manager wired up. You focus on the reconcile logic.
# Install Operator SDK CLI
export OPERATOR_SDK_VERSION=v1.38.0
curl -LO "https://github.com/operator-framework/operator-sdk/releases/download/${OPERATOR_SDK_VERSION}/operator-sdk_linux_amd64"
chmod +x operator-sdk_linux_amd64 && sudo mv operator-sdk_linux_amd64 /usr/local/bin/operator-sdk
# Scaffold a new Go-based operator project
mkdir postgres-operator && cd postgres-operator
operator-sdk init --domain techoral.com --repo github.com/techoral/postgres-operator
# Generate the API type and controller stub
operator-sdk create api \
--group db \
--version v1alpha1 \
--kind PostgreSQLCluster \
--resource \
--controller
# Install CRDs into your current cluster
make install
# Run the controller locally against your cluster (for development)
make run ENABLE_WEBHOOKS=false
# Build and push the container image
make docker-build docker-push IMG=ghcr.io/techoral/postgres-operator:v0.1.0
# Deploy to cluster
make deploy IMG=ghcr.io/techoral/postgres-operator:v0.1.0
The SDK also supports Ansible-based operators (using Ansible roles/playbooks as the reconcile logic) and Helm-based operators (wrapping an existing Helm chart). Ansible operators are popular for teams that already maintain Ansible content for their application.
The Reconcile Loop
The reconcile function is the heart of every Operator. It receives a Request containing the namespace and name of a changed object, fetches current state, compares it to the desired state in spec, and takes actions to converge them. The following shows the structure of a PostgreSQL operator reconciler:
# Simplified reconcile logic (Go pseudocode in bash comments for readability)
# 1. Fetch the PostgreSQLCluster custom resource by namespace/name
# 2. If being deleted: run cleanup and remove finalizer, return
# 3. Add finalizer if not present (to handle cleanup on delete)
# 4. Ensure StatefulSet exists with correct replica count and image
# - use CreateOrUpdate, never blindly Create
# 5. Ensure primary Service and headless Service exist
# 6. Ensure ConfigMap with postgresql.conf exists
# 7. If backupEnabled: ensure CronJob for pg_dump exists
# 8. Read actual StatefulSet readyReplicas and update .status
# 9. Return Result{RequeueAfter: 30s} for periodic health checks
# The reconciler is triggered by:
# - CREATE/UPDATE/DELETE of PostgreSQLCluster resources
# - CREATE/UPDATE/DELETE of child StatefulSets (via ownerReference watch)
# - Periodic requeue (every 30 seconds)
CreateOrUpdate (server-side apply or patch) instead of blindly creating resources. Return errors only for transient failures — not-found errors from child resources should trigger creation, not propagation upward.
Leader Election and HA
A single controller replica is a single point of failure. Operator SDK enables leader election via a Kubernetes Lease object. Only one replica holds the leader lock and executes reconciliation; standby replicas wait passively and take over within seconds if the leader crashes or becomes unresponsive.
apiVersion: apps/v1
kind: Deployment
metadata:
name: postgres-operator
namespace: operators
spec:
replicas: 3
selector:
matchLabels:
app: postgres-operator
template:
metadata:
labels:
app: postgres-operator
spec:
serviceAccountName: postgres-operator
containers:
- name: manager
image: ghcr.io/techoral/postgres-operator:v0.5.0
args:
- --leader-elect=true
- --leader-election-id=postgres-operator-leader
- --leader-election-namespace=operators
resources:
requests:
cpu: 100m
memory: 64Mi
limits:
cpu: 500m
memory: 128Mi
livenessProbe:
httpGet:
path: /healthz
port: 8081
initialDelaySeconds: 15
Real Example: PostgreSQL Database Operator
Once the CRD and controller are deployed, users declare what they want in plain YAML and the Operator handles everything else — creating the StatefulSet, Services, ConfigMaps, Secrets, PVCs, and backup CronJobs:
apiVersion: db.techoral.com/v1alpha1
kind: PostgreSQLCluster
metadata:
name: myapp-db
namespace: production
spec:
instances: 3
postgresVersion: "17"
storageSize: "50Gi"
backupEnabled: true
resources:
requests:
cpu: "1"
memory: "2Gi"
limits:
cpu: "2"
memory: "4Gi"
backup:
schedule: "0 2 * * *"
retentionDays: 14
s3Bucket: "techoral-pg-backups"
s3Region: "us-east-1"
After applying this resource, the cluster status is immediately visible through kubectl:
$ kubectl get pgc -n production
NAME INSTANCES PHASE AGE
myapp-db 3 Running 4m
$ kubectl describe pgc myapp-db -n production | grep -A5 Status
Status:
Phase: Running
Ready Instances: 3
Primary Endpoint: myapp-db-primary.production.svc.cluster.local
Last Backup: 2026-06-06T02:00:00Z
OLM and OperatorHub
The Operator Lifecycle Manager (OLM) is a component that manages the installation, upgrade, and RBAC of Operators on a cluster. Think of it as apt or yum for Kubernetes Operators. OperatorHub.io lists hundreds of community and certified operators you can install with a single command.
# Install OLM into your cluster
curl -sL https://github.com/operator-framework/operator-lifecycle-manager/releases/download/v0.28.0/install.sh \
| bash -s v0.28.0
# Install the Strimzi Kafka operator via OLM
kubectl create -f https://operatorhub.io/install/strimzi-kafka-operator.yaml
# Install Prometheus Operator via OLM
kubectl create -f https://operatorhub.io/install/prometheus.yaml
# Check all installed operators and their versions
kubectl get csv -A
kubectl get subscriptions -A
Notable operators worth knowing in 2026:
- Strimzi — Production Kafka on Kubernetes. Manages brokers, KRaft quorum, topics, users, and MirrorMaker2 via CRDs.
- Prometheus Operator — The foundation of kube-prometheus-stack. Manages Prometheus, Alertmanager, and ServiceMonitor resources.
- CloudNativePG — Best-in-class PostgreSQL operator with automatic primary failover, backup to S3, and connection pooling via PgBouncer.
- Rook-Ceph — Deploys and manages Ceph distributed storage clusters on bare Kubernetes nodes.
- Cert-Manager — Automates TLS certificate provisioning and renewal from Let's Encrypt, Vault, and private CAs.
FAQ
- Do I need to write Go to build an Operator?
- No. The Operator SDK supports Ansible-based and Helm-based operators in addition to Go. Ansible operators are a good choice for teams that already have Ansible playbooks for their application lifecycle. Go gives the most flexibility and performance, but Ansible lowers the barrier significantly.
- Can one Operator manage resources in multiple namespaces?
- Yes. By default operators watch all namespaces. You can restrict scope using the
WATCH_NAMESPACEenvironment variable or controller-runtime'sCacheoptions to limit which namespaces the informers cover. OLM also supports namespace-scoped installations. - What happens if the Operator pod is down?
- Existing child resources (Pods, StatefulSets, Services) continue running — Kubernetes manages them independently. Custom resources remain in the API server. When the Operator restarts, it re-lists all custom resources and reconciles each one. No state is lost; only ongoing day-2 operations like automated backups pause during downtime.
- How do I handle CRD schema breaking changes across versions?
- Use conversion webhooks. Mark both v1alpha1 and v1beta1 as
served: true, set v1beta1 asstorage: true, and implement a webhook that translates between the two representations. This lets old clients keep working while the API evolves. The Operator SDK generates the webhook scaffold for you. - Is the Operator pattern the right tool for a stateless web app?
- No. Use a Helm chart or plain Kubernetes manifests for stateless applications. Operators add significant complexity: Go code, CRD maintenance, RBAC, and webhook TLS certificates. They pay off for stateful systems with complex day-2 operations — databases, message brokers, distributed caches, and search clusters.