We built our first Kubernetes Operator. Here’s what we learned.

What is an Operator?

An Operator is a custom controller that extends Kubernetes to manage complex applications.

Think of it as:

  • Custom Resource Definition (CRD) = New resource type
  • Controller = Logic to manage that resource

Why Operators?

We run PostgreSQL in Kubernetes. Managing it manually is painful:

  • Backups
  • Failover
  • Scaling
  • Upgrades

An Operator automates all of this.

How Operators Work

  1. Define a Custom Resource:
apiVersion: database.example.com/v1
kind: PostgreSQL
metadata:
  name: my-database
spec:
  version: "11"
  replicas: 3
  storage: 100Gi
  backup:
    schedule: "0 2 * * *"
  1. Operator watches for these resources
  2. Operator creates/updates Pods, Services, etc.
  3. Operator handles failures, backups, etc.

Building an Operator

We used the Operator SDK:

# Install
brew install operator-sdk

# Create project
operator-sdk new postgresql-operator --repo=github.com/myorg/postgresql-operator

# Add API
operator-sdk add api --api-version=database.example.com/v1 --kind=PostgreSQL

# Add controller
operator-sdk add controller --api-version=database.example.com/v1 --kind=PostgreSQL

The Controller Logic

func (r *ReconcilePostgreSQL) Reconcile(request reconcile.Request) (reconcile.Result, error) {
    // Fetch the PostgreSQL instance
    instance := &databasev1.PostgreSQL{}
    err := r.client.Get(context.TODO(), request.NamespacedName, instance)
    if err != nil {
        return reconcile.Result{}, err
    }
    
    // Create StatefulSet if it doesn't exist
    found := &appsv1.StatefulSet{}
    err = r.client.Get(context.TODO(), types.NamespacedName{
        Name:      instance.Name,
        Namespace: instance.Namespace,
    }, found)
    
    if err != nil && errors.IsNotFound(err) {
        // Create StatefulSet
        sts := r.statefulSetForPostgreSQL(instance)
        err = r.client.Create(context.TODO(), sts)
        if err != nil {
            return reconcile.Result{}, err
        }
        return reconcile.Result{Requeue: true}, nil
    }
    
    // Update StatefulSet if spec changed
    if !reflect.DeepEqual(found.Spec, r.statefulSetForPostgreSQL(instance).Spec) {
        found.Spec = r.statefulSetForPostgreSQL(instance).Spec
        err = r.client.Update(context.TODO(), found)
        if err != nil {
            return reconcile.Result{}, err
        }
    }
    
    // Handle backups
    if instance.Spec.Backup != nil {
        err = r.ensureBackupCronJob(instance)
        if err != nil {
            return reconcile.Result{}, err
        }
    }
    
    return reconcile.Result{}, nil
}

Features We Implemented

1. Automated Backups

apiVersion: database.example.com/v1
kind: PostgreSQL
metadata:
  name: my-database
spec:
  backup:
    schedule: "0 2 * * *"  # Daily at 2 AM
    retention: 7            # Keep 7 days
    s3Bucket: "backups"

Operator creates a CronJob that runs pg_dump and uploads to S3.

2. Automatic Failover

When primary pod fails:

  1. Operator detects failure
  2. Promotes replica to primary
  3. Updates Service to point to new primary
  4. Creates new replica

All automatic.

3. Rolling Upgrades

spec:
  version: "12"  # Upgrade from 11 to 12

Operator:

  1. Creates new pods with version 12
  2. Migrates data
  3. Switches traffic
  4. Removes old pods

Zero downtime.

4. Monitoring

Operator exposes Prometheus metrics:

  • Database size
  • Connection count
  • Replication lag
  • Backup status

Deployment

# Build operator image
operator-sdk build registry.example.com/postgresql-operator:v1.0.0

# Push image
docker push registry.example.com/postgresql-operator:v1.0.0

# Deploy CRD
kubectl apply -f deploy/crds/database.example.com_postgresqls_crd.yaml

# Deploy operator
kubectl apply -f deploy/operator.yaml

Using the Operator

apiVersion: database.example.com/v1
kind: PostgreSQL
metadata:
  name: app-database
spec:
  version: "11"
  replicas: 3
  storage: 100Gi
  resources:
    requests:
      memory: "1Gi"
      cpu: "500m"
    limits:
      memory: "2Gi"
      cpu: "1000m"
  backup:
    schedule: "0 2 * * *"
    retention: 7
    s3Bucket: "my-backups"
kubectl apply -f app-database.yaml

Operator creates:

  • StatefulSet (3 replicas)
  • Service (for connections)
  • PersistentVolumeClaims (100Gi each)
  • CronJob (for backups)
  • ConfigMap (for PostgreSQL config)

Benefits

Before Operator:

  • Manual setup (2 hours)
  • Manual backups
  • Manual failover (30 minutes downtime)
  • Manual upgrades (4 hours)

After Operator:

  • Automated setup (5 minutes)
  • Automated backups
  • Automated failover (< 1 minute)
  • Automated upgrades (30 minutes, zero downtime)

Challenges

1. Complexity

Operators are complex. Lots of edge cases to handle.

2. Testing

Testing operators is hard. We use:

  • Unit tests for controller logic
  • Integration tests with kind (Kubernetes in Docker)
  • Manual testing in staging

3. Debugging

When something goes wrong, debugging is tricky. Good logging is essential.

  • Prometheus Operator: Manages Prometheus
  • Elasticsearch Operator: Manages Elasticsearch
  • Kafka Operator: Manages Kafka
  • PostgreSQL Operator: (Zalando’s, Crunchy Data’s)

Check OperatorHub.io for more.

Should You Build an Operator?

Yes, if:

  • Managing complex stateful applications
  • Need automation for operations tasks
  • Have time to invest in development

No, if:

  • Application is simple
  • Helm charts are sufficient
  • Don’t have Go expertise

Our Verdict

Building an Operator was worth it. It automated hours of manual work and reduced errors.

But it’s not trivial. Plan for 2-4 weeks of development.

For simple apps, stick with Helm. For complex stateful apps, consider an Operator.

Questions? Ask away!