Scaling Microservices to 10K RPS with Kubernetes - A Production Journey
Last quarter, our platform experienced explosive growth. Traffic increased from 500 RPS to over 10,000 RPS in just 8 weeks. Here’s how we scaled our Kubernetes-based microservices architecture to handle this growth without downtime.
Table of contents
The Challenge
Our e-commerce platform runs on a microservices architecture with 15 services:
- User service
- Product catalog
- Order processing
- Payment gateway
- Inventory management
- Notification service
- And 9 more supporting services
Initial architecture handled 500 RPS comfortably, but started showing cracks at 2,000 RPS.
Initial Architecture
Cluster Setup
# Original cluster configuration
apiVersion: v1
kind: Cluster
metadata:
name: production-cluster
spec:
nodes:
- type: n1-standard-4
count: 5
version: 1.22.6
Resources:
- 5 nodes (4 vCPUs, 15GB RAM each)
- Total capacity: 20 vCPUs, 75GB RAM
- Cost: $600/month
Service Configuration
Typical service deployment:
apiVersion: apps/v1
kind: Deployment
metadata:
name: user-service
spec:
replicas: 2
template:
spec:
containers:
- name: user-service
image: user-service:v1.2.0
resources:
requests:
memory: "512Mi"
cpu: "500m"
limits:
memory: "1Gi"
cpu: "1000m"
Performance Bottlenecks
1. Database Connection Pool Exhaustion
At 2,500 RPS, we started seeing:
Error: connection pool exhausted
Active connections: 100/100
Wait time: 5000ms
Solution: Implemented connection pooling with PgBouncer:
apiVersion: v1
kind: ConfigMap
metadata:
name: pgbouncer-config
data:
pgbouncer.ini: |
[databases]
userdb = host=postgres-primary port=5432 dbname=users
[pgbouncer]
pool_mode = transaction
max_client_conn = 1000
default_pool_size = 25
reserve_pool_size = 5
Result: Reduced database connections by 60%, improved response time by 40%.
2. CPU Throttling
Monitoring revealed CPU throttling:
kubectl top pods -n production
NAME CPU(cores) MEMORY(bytes)
user-service-abc123 980m 856Mi
user-service-def456 995m 892Mi
Services were hitting CPU limits, causing request queuing.
Solution: Adjusted resource limits and implemented HPA:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: user-service-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: user-service
minReplicas: 3
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
3. Network Latency
Inter-service communication was slow:
Average latency between services: 45ms
P95 latency: 120ms
P99 latency: 250ms
Solution: Implemented service mesh with Istio:
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
name: user-service
spec:
hosts:
- user-service
http:
- timeout: 3s
retries:
attempts: 3
perTryTimeout: 1s
route:
- destination:
host: user-service
subset: v1
weight: 100
Result: Reduced P95 latency to 35ms, P99 to 80ms.
Scaling Strategy
Phase 1: Vertical Scaling (Week 1-2)
Upgraded node types:
# New node configuration
nodes:
- type: n1-standard-8 # Doubled CPU/RAM
count: 5
Impact:
- Capacity: 500 RPS → 1,500 RPS
- Cost: $600 → $1,200/month
- Downtime: 0 (rolling node replacement)
Phase 2: Horizontal Scaling (Week 3-4)
Added more nodes and implemented auto-scaling:
apiVersion: autoscaling.k8s.io/v1
kind: ClusterAutoscaler
metadata:
name: cluster-autoscaler
spec:
minNodes: 8
maxNodes: 30
scaleDownDelay: 10m
scaleDownUnneededTime: 10m
Impact:
- Capacity: 1,500 RPS → 5,000 RPS
- Cost: $1,200 → $2,400/month (average)
- Elasticity: Automatic scaling based on load
Phase 3: Optimization (Week 5-8)
Implemented caching and async processing:
# Redis caching layer
from redis import Redis
from functools import wraps
redis_client = Redis(host='redis-cluster', port=6379)
def cache_result(ttl=300):
def decorator(func):
@wraps(func)
async def wrapper(*args, **kwargs):
cache_key = f"{func.__name__}:{args}:{kwargs}"
# Try cache first
cached = redis_client.get(cache_key)
if cached:
return json.loads(cached)
# Execute function
result = await func(*args, **kwargs)
# Cache result
redis_client.setex(
cache_key,
ttl,
json.dumps(result)
)
return result
return wrapper
return decorator
@cache_result(ttl=600)
async def get_product_details(product_id: str):
return await db.products.find_one({'id': product_id})
Impact:
- Cache hit rate: 75%
- Database load: Reduced by 60%
- Response time: Improved by 50%
Final Architecture
Cluster Configuration
# Production cluster (final)
apiVersion: v1
kind: Cluster
metadata:
name: production-cluster
spec:
nodePools:
- name: general-pool
type: n1-standard-8
minNodes: 8
maxNodes: 25
- name: memory-intensive-pool
type: n1-highmem-8
minNodes: 2
maxNodes: 8
taints:
- key: workload
value: memory-intensive
effect: NoSchedule
Service Deployment Pattern
apiVersion: apps/v1
kind: Deployment
metadata:
name: user-service
labels:
app: user-service
version: v2.0.0
spec:
replicas: 5
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 2
maxUnavailable: 1
template:
metadata:
labels:
app: user-service
version: v2.0.0
spec:
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- user-service
topologyKey: kubernetes.io/hostname
containers:
- name: user-service
image: user-service:v2.0.0
ports:
- containerPort: 8080
resources:
requests:
memory: "1Gi"
cpu: "1000m"
limits:
memory: "2Gi"
cpu: "2000m"
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
Performance Metrics
Before vs After
| Metric | Before | After | Improvement |
|---|---|---|---|
| Max RPS | 500 | 10,000 | 20x |
| Avg Response Time | 250ms | 45ms | 82% |
| P95 Response Time | 800ms | 120ms | 85% |
| P99 Response Time | 2000ms | 300ms | 85% |
| Error Rate | 0.5% | 0.05% | 90% |
| CPU Utilization | 85% | 65% | Better headroom |
| Memory Utilization | 78% | 60% | Better headroom |
Cost Analysis
- Infrastructure: $600 → $2,800/month (average)
- Cost per 1K requests: $0.12 → $0.028
- ROI: 76% reduction in per-request cost
Key Lessons Learned
1. Monitor Everything
We implemented comprehensive monitoring:
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-config
data:
prometheus.yml: |
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
2. Plan for Failure
Implemented circuit breakers:
from circuitbreaker import circuit
@circuit(failure_threshold=5, recovery_timeout=60)
async def call_external_service(data):
async with aiohttp.ClientSession() as session:
async with session.post(
'https://external-api.com/endpoint',
json=data,
timeout=aiohttp.ClientTimeout(total=3)
) as response:
return await response.json()
3. Gradual Rollouts
Used canary deployments:
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
name: user-service
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: user-service
progressDeadlineSeconds: 60
service:
port: 8080
analysis:
interval: 1m
threshold: 5
maxWeight: 50
stepWeight: 10
metrics:
- name: request-success-rate
thresholdRange:
min: 99
- name: request-duration
thresholdRange:
max: 500
Conclusion
Scaling from 500 to 10,000 RPS required:
- Technical changes: Auto-scaling, caching, optimization
- Architectural improvements: Service mesh, connection pooling
- Operational maturity: Monitoring, alerting, gradual rollouts
The journey taught us that scaling is not just about adding more resources—it’s about building resilient, observable, and efficient systems.
Key takeaways:
- Start with monitoring and observability
- Identify bottlenecks before scaling
- Use horizontal scaling over vertical when possible
- Implement caching strategically
- Plan for failure at every level
- Automate everything
Our platform now handles 10,000+ RPS with room to grow to 50,000 RPS using the same architecture.