Running Stateful Applications on Kubernetes: StatefulSets Deep Dive
For the past year, we’ve been running all our stateless services on Kubernetes. But we kept our databases on traditional VMs because “you shouldn’t run databases in containers.”
Last quarter, I decided to challenge that assumption. We migrated our PostgreSQL and Redis instances to Kubernetes using StatefulSets. Here’s what I learned.
Table of Contents
Why StatefulSets?
Regular Kubernetes Deployments are great for stateless apps, but they have problems for databases:
- Pods get random names -
postgres-7d8f9c-xk2p9changes on every restart - No stable network identity - IP addresses change
- No ordered deployment - Pods start in random order
- Storage is ephemeral - Data disappears when pod dies
StatefulSets solve all of these:
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: postgres
spec:
serviceName: postgres
replicas: 3
selector:
matchLabels:
app: postgres
template:
metadata:
labels:
app: postgres
spec:
containers:
- name: postgres
image: postgres:10.5
ports:
- containerPort: 5432
name: postgres
volumeMounts:
- name: data
mountPath: /var/lib/postgresql/data
volumeClaimTemplates:
- metadata:
name: data
spec:
accessModes: [ "ReadWriteOnce" ]
resources:
requests:
storage: 100Gi
This creates pods named postgres-0, postgres-1, postgres-2 with stable identities.
Persistent Storage Setup
The biggest challenge was storage. We’re running on AWS, so I used EBS volumes via the AWS EBS CSI driver.
First, install the CSI driver:
kubectl apply -k "github.com/kubernetes-sigs/aws-ebs-csi-driver/deploy/kubernetes/overlays/stable/?ref=master"
Create a StorageClass:
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: fast-ebs
provisioner: ebs.csi.aws.com
parameters:
type: gp2
fsType: ext4
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true
The volumeBindingMode: WaitForFirstConsumer is crucial - it ensures the EBS volume is created in the same availability zone as the pod.
Then reference it in the StatefulSet:
volumeClaimTemplates:
- metadata:
name: data
spec:
accessModes: [ "ReadWriteOnce" ]
storageClassName: fast-ebs
resources:
requests:
storage: 100Gi
Each pod gets its own PersistentVolumeClaim, which creates a dedicated EBS volume.
Networking and Service Discovery
StatefulSets create a headless service for stable network identities:
apiVersion: v1
kind: Service
metadata:
name: postgres
spec:
clusterIP: None # Headless service
selector:
app: postgres
ports:
- port: 5432
name: postgres
Now each pod is accessible via DNS:
postgres-0.postgres.default.svc.cluster.localpostgres-1.postgres.default.svc.cluster.localpostgres-2.postgres.default.svc.cluster.local
This is perfect for database replication where you need to address specific instances.
PostgreSQL Replication Setup
I set up streaming replication with one master and two replicas. The tricky part is initializing replicas from the master.
I used an init container to handle this:
initContainers:
- name: init-postgres
image: postgres:10.5
command:
- bash
- "-c"
- |
set -ex
# If data directory exists, skip initialization
[[ -d /var/lib/postgresql/data/pgdata ]] && exit 0
# postgres-0 is the master
if [[ $HOSTNAME == "postgres-0" ]]; then
echo "Initializing master"
exit 0
fi
# Replicas: clone from master
echo "Cloning from master"
until pg_basebackup -h postgres-0.postgres -D /var/lib/postgresql/data/pgdata -U replication -v -P
do
echo "Waiting for master..."
sleep 5
done
volumeMounts:
- name: data
mountPath: /var/lib/postgresql/data
The main container then starts with appropriate configuration:
containers:
- name: postgres
image: postgres:10.5
env:
- name: POSTGRES_PASSWORD
valueFrom:
secretKeyRef:
name: postgres-secret
key: password
- name: PGDATA
value: /var/lib/postgresql/data/pgdata
command:
- bash
- "-c"
- |
set -ex
# Master configuration
if [[ $HOSTNAME == "postgres-0" ]]; then
echo "Starting as master"
cat >> /var/lib/postgresql/data/pgdata/postgresql.conf <<EOF
wal_level = replica
max_wal_senders = 3
wal_keep_segments = 8
EOF
cat >> /var/lib/postgresql/data/pgdata/pg_hba.conf <<EOF
host replication replication 0.0.0.0/0 md5
EOF
else
# Replica configuration
echo "Starting as replica"
cat > /var/lib/postgresql/data/pgdata/recovery.conf <<EOF
standby_mode = on
primary_conninfo = 'host=postgres-0.postgres port=5432 user=replication password=$POSTGRES_PASSWORD'
trigger_file = '/tmp/promote'
EOF
fi
exec docker-entrypoint.sh postgres
This setup gives us:
- Automatic failover capability (promote replica by creating
/tmp/promote) - Read replicas for scaling reads
- Data redundancy
Redis Cluster
For Redis, I used a different approach - Redis Cluster mode:
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: redis
spec:
serviceName: redis
replicas: 6
selector:
matchLabels:
app: redis
template:
metadata:
labels:
app: redis
spec:
containers:
- name: redis
image: redis:5.0-alpine
command:
- redis-server
- --cluster-enabled
- "yes"
- --cluster-config-file
- /data/nodes.conf
- --cluster-node-timeout
- "5000"
- --appendonly
- "yes"
ports:
- containerPort: 6379
name: client
- containerPort: 16379
name: gossip
volumeMounts:
- name: data
mountPath: /data
volumeClaimTemplates:
- metadata:
name: data
spec:
accessModes: [ "ReadWriteOnce" ]
storageClassName: fast-ebs
resources:
requests:
storage: 10Gi
After deploying, initialize the cluster:
kubectl exec -it redis-0 -- redis-cli --cluster create \
redis-0.redis:6379 \
redis-1.redis:6379 \
redis-2.redis:6379 \
redis-3.redis:6379 \
redis-4.redis:6379 \
redis-5.redis:6379 \
--cluster-replicas 1
This creates a 3-master, 3-replica cluster with automatic sharding.
Backup Strategy
Running databases in Kubernetes doesn’t mean you can skip backups. I set up automated backups using CronJobs.
PostgreSQL backup:
apiVersion: batch/v1beta1
kind: CronJob
metadata:
name: postgres-backup
spec:
schedule: "0 2 * * *" # 2 AM daily
jobTemplate:
spec:
template:
spec:
containers:
- name: backup
image: postgres:10.5
command:
- bash
- "-c"
- |
BACKUP_FILE="/backup/postgres-$(date +%Y%m%d-%H%M%S).sql.gz"
pg_dump -h postgres-0.postgres -U postgres | gzip > $BACKUP_FILE
# Upload to S3
aws s3 cp $BACKUP_FILE s3://my-backups/postgres/
# Keep only last 7 days locally
find /backup -name "postgres-*.sql.gz" -mtime +7 -delete
env:
- name: PGPASSWORD
valueFrom:
secretKeyRef:
name: postgres-secret
key: password
volumeMounts:
- name: backup
mountPath: /backup
volumes:
- name: backup
persistentVolumeClaim:
claimName: backup-pvc
restartPolicy: OnFailure
Monitoring and Alerts
I use Prometheus to monitor the databases. For PostgreSQL, I deployed the postgres_exporter:
- name: exporter
image: wrouesnel/postgres_exporter:latest
env:
- name: DATA_SOURCE_NAME
value: "postgresql://postgres:password@localhost:5432/postgres?sslmode=disable"
ports:
- containerPort: 9187
name: metrics
Key metrics I monitor:
- Connection count
- Replication lag
- Disk usage
- Query performance
Alerting rules:
groups:
- name: postgres
rules:
- alert: PostgresDown
expr: pg_up == 0
for: 1m
annotations:
summary: "PostgreSQL is down"
- alert: ReplicationLag
expr: pg_replication_lag > 10
for: 5m
annotations:
summary: "Replication lag is {{ $value }} seconds"
Lessons Learned
After three months in production:
What worked well:
- Stable identities - No more IP address chasing
- Automated provisioning - New environments spin up in minutes
- Resource limits - Better resource utilization than VMs
- Backup automation - CronJobs make this trivial
What was challenging:
- Initial setup complexity - Took 2 weeks to get right
- Storage performance - EBS IOPS limits required tuning
- Disaster recovery - Restoring from backup is slower than VMs
- Debugging - Logs are scattered across pods
What I’d do differently:
- Use an operator - Look at operators like Zalando’s postgres-operator
- Test failover thoroughly - We had issues during first real failover
- Monitor storage more closely - We hit IOPS limits unexpectedly
- Document runbooks - Recovery procedures are different from VMs
Should You Run Databases on Kubernetes?
Honestly, it depends:
Yes, if:
- You need rapid provisioning of database instances
- You want consistent deployment across environments
- Your team is comfortable with Kubernetes
- You have good monitoring and backup strategies
No, if:
- You need maximum performance (bare metal is still faster)
- You have a small team without Kubernetes expertise
- You’re running massive databases (multi-TB)
- You can’t afford any downtime during learning curve
For us, it’s been worth it. The operational benefits outweigh the complexity. But we spent significant time getting it right, and we still keep critical production data on managed RDS as a safety net.
Conclusion
StatefulSets make running databases on Kubernetes viable, but not trivial. You need to understand storage, networking, and database replication deeply.
If you’re considering this, start small. Run a non-critical database first, test failover scenarios thoroughly, and have a rollback plan. Don’t migrate your production database on a Friday afternoon (I learned this the hard way).
The future is probably Kubernetes operators that handle all this complexity for you. But understanding StatefulSets is still valuable - it’s the foundation everything else builds on.