Redis Cluster: High Availability and Horizontal Scaling

Our single Redis instance was the bottleneck. 100K requests/sec, 90% CPU, response time degrading. We needed to scale.

I set up Redis Cluster with 6 nodes. Now we handle 1M+ requests/sec, automatic failover, and zero downtime during node failures.

The Problem

Single Redis instance:

100K requests/sec (maxed out)
90% CPU usage
16GB memory limit
Single point of failure
No horizontal scaling

We hit the wall.

Redis Cluster Overview

Features:

Sharding: Data split across nodes
Replication: Each master has replicas
Automatic failover: Replica promotes to master
No single point of failure

Minimum: 6 nodes (3 masters + 3 replicas)

Installing Redis

Redis 5.0.5:

wget http://download.redis.io/releases/redis-5.0.5.tar.gz
tar xzf redis-5.0.5.tar.gz
cd redis-5.0.5
make
sudo make install

Cluster Configuration

Create 6 config files:

redis-7000.conf:

port 7000
cluster-enabled yes
cluster-config-file nodes-7000.conf
cluster-node-timeout 5000
appendonly yes
dir /var/lib/redis/7000

Repeat for ports 7001-7005.

Starting Nodes

redis-server /etc/redis/redis-7000.conf &
redis-server /etc/redis/redis-7001.conf &
redis-server /etc/redis/redis-7002.conf &
redis-server /etc/redis/redis-7003.conf &
redis-server /etc/redis/redis-7004.conf &
redis-server /etc/redis/redis-7005.conf &

Creating Cluster

redis-cli --cluster create \
  127.0.0.1:7000 \
  127.0.0.1:7001 \
  127.0.0.1:7002 \
  127.0.0.1:7003 \
  127.0.0.1:7004 \
  127.0.0.1:7005 \
  --cluster-replicas 1

Output:

>>> Performing hash slots allocation on 6 nodes...
Master[0] -> Slots 0 - 5460
Master[1] -> Slots 5461 - 10922
Master[2] -> Slots 10923 - 16383
Adding replica 127.0.0.1:7004 to 127.0.0.1:7000
Adding replica 127.0.0.1:7005 to 127.0.0.1:7001
Adding replica 127.0.0.1:7003 to 127.0.0.1:7002

3 masters, 3 replicas!

Hash Slots

16384 slots total, divided among masters:

Master 1: slots 0-5460
Master 2: slots 5461-10922
Master 3: slots 10923-16383

Key hashing:

HASH_SLOT = CRC16(key) mod 16384

Connecting to Cluster

redis-cli -c -p 7000

-c enables cluster mode (follows redirects).

127.0.0.1:7000> SET user:1000 "John"
-> Redirected to slot [11143] located at 127.0.0.1:7002
OK

127.0.0.1:7002> GET user:1000
"John"

Python Client

from rediscluster import RedisCluster

startup_nodes = [
    {"host": "127.0.0.1", "port": "7000"},
    {"host": "127.0.0.1", "port": "7001"},
    {"host": "127.0.0.1", "port": "7002"},
]

rc = RedisCluster(startup_nodes=startup_nodes, decode_responses=True)

rc.set("user:1000", "John")
print(rc.get("user:1000"))  # John

Client handles redirects automatically!

Hash Tags

Force keys to same slot:

# Different slots
rc.set("user:1000", "John")
rc.set("orders:1000", "Order1")

# Same slot (using hash tag)
rc.set("user:{1000}", "John")
rc.set("orders:{1000}", "Order1")

{1000} is the hash tag. Both keys go to same slot.

Enables multi-key operations:

rc.mget("user:{1000}", "orders:{1000}")

Replication

Each master has replica:

Master 7000 -> Replica 7004
Master 7001 -> Replica 7005
Master 7002 -> Replica 7003

Replicas sync from masters automatically.

Automatic Failover

Simulate master failure:

redis-cli -p 7000 DEBUG SEGFAULT

Cluster detects failure and promotes replica:

redis-cli -p 7001 CLUSTER NODES

Output:

7004... master - 0 1554710400000 7 connected 0-5460
7000... master,fail - 1554710395000 1 disconnected

Replica 7004 promoted to master!

Adding Nodes

Add new master:

redis-server /etc/redis/redis-7006.conf &

redis-cli --cluster add-node 127.0.0.1:7006 127.0.0.1:7000

Rebalance slots:

redis-cli --cluster rebalance 127.0.0.1:7000

Add replica:

redis-server /etc/redis/redis-7007.conf &

redis-cli --cluster add-node 127.0.0.1:7007 127.0.0.1:7000 \
  --cluster-slave \
  --cluster-master-id <master-node-id>

Removing Nodes

Remove replica:

redis-cli --cluster del-node 127.0.0.1:7000 <node-id>

Remove master (resharding required):

# Reshard slots to other masters
redis-cli --cluster reshard 127.0.0.1:7000

# Then remove
redis-cli --cluster del-node 127.0.0.1:7000 <node-id>

Monitoring

Cluster info:

redis-cli -p 7000 CLUSTER INFO

Output:

cluster_state:ok
cluster_slots_assigned:16384
cluster_slots_ok:16384
cluster_slots_pfail:0
cluster_slots_fail:0
cluster_known_nodes:6
cluster_size:3

Node info:

redis-cli -p 7000 CLUSTER NODES

Production Setup

6 servers (3 masters + 3 replicas):

Server 1 (Master):

# /etc/redis/redis.conf
port 6379
cluster-enabled yes
cluster-config-file nodes.conf
cluster-node-timeout 5000
appendonly yes
maxmemory 8gb
maxmemory-policy allkeys-lru
bind 0.0.0.0
protected-mode no
requirepass your_password

Server 2 (Replica of Server 1):

# Same config, different server

Create cluster:

redis-cli --cluster create \
  server1:6379 \
  server2:6379 \
  server3:6379 \
  server4:6379 \
  server5:6379 \
  server6:6379 \
  --cluster-replicas 1 \
  -a your_password

Client Configuration

from rediscluster import RedisCluster

startup_nodes = [
    {"host": "server1", "port": "6379"},
    {"host": "server2", "port": "6379"},
    {"host": "server3", "port": "6379"},
]

rc = RedisCluster(
    startup_nodes=startup_nodes,
    decode_responses=True,
    password="your_password",
    skip_full_coverage_check=True,
    max_connections_per_node=50
)

Monitoring with Prometheus

Redis exporter:

docker run -d \
  --name redis-exporter \
  -p 9121:9121 \
  oliver006/redis_exporter \
  --redis.addr=redis://server1:6379

Prometheus config:

scrape_configs:
  - job_name: 'redis'
    static_configs:
      - targets:
        - server1:9121
        - server2:9121
        - server3:9121

Backup Strategy

RDB snapshots:

save 900 1
save 300 10
save 60 10000

AOF persistence:

appendonly yes
appendfsync everysec

Backup script:

#!/bin/bash
for port in 7000 7001 7002; do
  redis-cli -p $port BGSAVE
  sleep 60
  cp /var/lib/redis/$port/dump.rdb /backup/redis-$port-$(date +%Y%m%d).rdb
done

Performance Tuning

Kernel settings:

# /etc/sysctl.conf
vm.overcommit_memory = 1
net.core.somaxconn = 65535

Redis config:

tcp-backlog 511
timeout 0
tcp-keepalive 300
maxclients 10000

Results

Before (single instance):

100K requests/sec
90% CPU
16GB memory limit
Single point of failure

After (cluster):

1M+ requests/sec (10x)
30% CPU per node
48GB total memory (3x16GB)
Automatic failover

Lessons Learned

Plan capacity - 3 masters minimum
Use hash tags - For multi-key operations
Monitor closely - Watch for slot migrations
Test failover - Before production
Backup regularly - RDB + AOF

Conclusion

Redis Cluster provides horizontal scaling and high availability. Essential for high-traffic applications.

Key takeaways:

Sharding across multiple masters
Replication for high availability
Automatic failover
Hash tags for multi-key ops
Monitor and backup

Scale Redis properly. Your application will thank you.

Table of Contents

The Problem

Redis Cluster Overview

Installing Redis

Cluster Configuration

Starting Nodes

Creating Cluster

Hash Slots

Connecting to Cluster

Python Client

Hash Tags

Replication

Automatic Failover

Adding Nodes

Removing Nodes

Monitoring

Production Setup

Client Configuration

Monitoring with Prometheus

Backup Strategy

Performance Tuning

Results

Lessons Learned

Conclusion