Kubernetes Operators: Automating Complex Applications
We built our first Kubernetes Operator. Here’s what we learned.
What is an Operator?
An Operator is a custom controller that extends Kubernetes to manage complex applications.
Think of it as:
- Custom Resource Definition (CRD) = New resource type
- Controller = Logic to manage that resource
Why Operators?
We run PostgreSQL in Kubernetes. Managing it manually is painful:
- Backups
- Failover
- Scaling
- Upgrades
An Operator automates all of this.
How Operators Work
- Define a Custom Resource:
apiVersion: database.example.com/v1
kind: PostgreSQL
metadata:
name: my-database
spec:
version: "11"
replicas: 3
storage: 100Gi
backup:
schedule: "0 2 * * *"
- Operator watches for these resources
- Operator creates/updates Pods, Services, etc.
- Operator handles failures, backups, etc.
Building an Operator
We used the Operator SDK:
# Install
brew install operator-sdk
# Create project
operator-sdk new postgresql-operator --repo=github.com/myorg/postgresql-operator
# Add API
operator-sdk add api --api-version=database.example.com/v1 --kind=PostgreSQL
# Add controller
operator-sdk add controller --api-version=database.example.com/v1 --kind=PostgreSQL
The Controller Logic
func (r *ReconcilePostgreSQL) Reconcile(request reconcile.Request) (reconcile.Result, error) {
// Fetch the PostgreSQL instance
instance := &databasev1.PostgreSQL{}
err := r.client.Get(context.TODO(), request.NamespacedName, instance)
if err != nil {
return reconcile.Result{}, err
}
// Create StatefulSet if it doesn't exist
found := &appsv1.StatefulSet{}
err = r.client.Get(context.TODO(), types.NamespacedName{
Name: instance.Name,
Namespace: instance.Namespace,
}, found)
if err != nil && errors.IsNotFound(err) {
// Create StatefulSet
sts := r.statefulSetForPostgreSQL(instance)
err = r.client.Create(context.TODO(), sts)
if err != nil {
return reconcile.Result{}, err
}
return reconcile.Result{Requeue: true}, nil
}
// Update StatefulSet if spec changed
if !reflect.DeepEqual(found.Spec, r.statefulSetForPostgreSQL(instance).Spec) {
found.Spec = r.statefulSetForPostgreSQL(instance).Spec
err = r.client.Update(context.TODO(), found)
if err != nil {
return reconcile.Result{}, err
}
}
// Handle backups
if instance.Spec.Backup != nil {
err = r.ensureBackupCronJob(instance)
if err != nil {
return reconcile.Result{}, err
}
}
return reconcile.Result{}, nil
}
Features We Implemented
1. Automated Backups
apiVersion: database.example.com/v1
kind: PostgreSQL
metadata:
name: my-database
spec:
backup:
schedule: "0 2 * * *" # Daily at 2 AM
retention: 7 # Keep 7 days
s3Bucket: "backups"
Operator creates a CronJob that runs pg_dump and uploads to S3.
2. Automatic Failover
When primary pod fails:
- Operator detects failure
- Promotes replica to primary
- Updates Service to point to new primary
- Creates new replica
All automatic.
3. Rolling Upgrades
spec:
version: "12" # Upgrade from 11 to 12
Operator:
- Creates new pods with version 12
- Migrates data
- Switches traffic
- Removes old pods
Zero downtime.
4. Monitoring
Operator exposes Prometheus metrics:
- Database size
- Connection count
- Replication lag
- Backup status
Deployment
# Build operator image
operator-sdk build registry.example.com/postgresql-operator:v1.0.0
# Push image
docker push registry.example.com/postgresql-operator:v1.0.0
# Deploy CRD
kubectl apply -f deploy/crds/database.example.com_postgresqls_crd.yaml
# Deploy operator
kubectl apply -f deploy/operator.yaml
Using the Operator
apiVersion: database.example.com/v1
kind: PostgreSQL
metadata:
name: app-database
spec:
version: "11"
replicas: 3
storage: 100Gi
resources:
requests:
memory: "1Gi"
cpu: "500m"
limits:
memory: "2Gi"
cpu: "1000m"
backup:
schedule: "0 2 * * *"
retention: 7
s3Bucket: "my-backups"
kubectl apply -f app-database.yaml
Operator creates:
- StatefulSet (3 replicas)
- Service (for connections)
- PersistentVolumeClaims (100Gi each)
- CronJob (for backups)
- ConfigMap (for PostgreSQL config)
Benefits
Before Operator:
- Manual setup (2 hours)
- Manual backups
- Manual failover (30 minutes downtime)
- Manual upgrades (4 hours)
After Operator:
- Automated setup (5 minutes)
- Automated backups
- Automated failover (< 1 minute)
- Automated upgrades (30 minutes, zero downtime)
Challenges
1. Complexity
Operators are complex. Lots of edge cases to handle.
2. Testing
Testing operators is hard. We use:
- Unit tests for controller logic
- Integration tests with kind (Kubernetes in Docker)
- Manual testing in staging
3. Debugging
When something goes wrong, debugging is tricky. Good logging is essential.
Popular Operators
- Prometheus Operator: Manages Prometheus
- Elasticsearch Operator: Manages Elasticsearch
- Kafka Operator: Manages Kafka
- PostgreSQL Operator: (Zalando’s, Crunchy Data’s)
Check OperatorHub.io for more.
Should You Build an Operator?
Yes, if:
- Managing complex stateful applications
- Need automation for operations tasks
- Have time to invest in development
No, if:
- Application is simple
- Helm charts are sufficient
- Don’t have Go expertise
Our Verdict
Building an Operator was worth it. It automated hours of manual work and reduced errors.
But it’s not trivial. Plan for 2-4 weeks of development.
For simple apps, stick with Helm. For complex stateful apps, consider an Operator.
Questions? Ask away!