Scaling Microservices to 10K RPS with Kubernetes

Last quarter, our platform experienced explosive growth. Traffic increased from 500 RPS to over 10,000 RPS in just 8 weeks. Here’s how we scaled our Kubernetes-based microservices architecture to handle this growth without downtime.

The Challenge

Our e-commerce platform runs on a microservices architecture with 15 services:

User service
Product catalog
Order processing
Payment gateway
Inventory management
Notification service
And 9 more supporting services

Initial architecture handled 500 RPS comfortably, but started showing cracks at 2,000 RPS.

Initial Architecture

Cluster Setup

# Original cluster configuration
apiVersion: v1
kind: Cluster
metadata:
  name: production-cluster
spec:
  nodes:
    - type: n1-standard-4
      count: 5
  version: 1.22.6

Resources:

5 nodes (4 vCPUs, 15GB RAM each)
Total capacity: 20 vCPUs, 75GB RAM
Cost: $600/month

Service Configuration

Typical service deployment:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: user-service
spec:
  replicas: 2
  template:
    spec:
      containers:
      - name: user-service
        image: user-service:v1.2.0
        resources:
          requests:
            memory: "512Mi"
            cpu: "500m"
          limits:
            memory: "1Gi"
            cpu: "1000m"

Performance Bottlenecks

1. Database Connection Pool Exhaustion

At 2,500 RPS, we started seeing:

Error: connection pool exhausted
Active connections: 100/100
Wait time: 5000ms

Solution: Implemented connection pooling with PgBouncer:

apiVersion: v1
kind: ConfigMap
metadata:
  name: pgbouncer-config
data:
  pgbouncer.ini: |
    [databases]
    userdb = host=postgres-primary port=5432 dbname=users
    
    [pgbouncer]
    pool_mode = transaction
    max_client_conn = 1000
    default_pool_size = 25
    reserve_pool_size = 5

Result: Reduced database connections by 60%, improved response time by 40%.

2. CPU Throttling

Monitoring revealed CPU throttling:

kubectl top pods -n production

NAME                    CPU(cores)   MEMORY(bytes)
user-service-abc123     980m         856Mi
user-service-def456     995m         892Mi

Services were hitting CPU limits, causing request queuing.

Solution: Adjusted resource limits and implemented HPA:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: user-service-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: user-service
  minReplicas: 3
  maxReplicas: 20
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80

3. Network Latency

Inter-service communication was slow:

Average latency between services: 45ms
P95 latency: 120ms
P99 latency: 250ms

Solution: Implemented service mesh with Istio:

apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: user-service
spec:
  hosts:
  - user-service
  http:
  - timeout: 3s
    retries:
      attempts: 3
      perTryTimeout: 1s
    route:
    - destination:
        host: user-service
        subset: v1
      weight: 100

Result: Reduced P95 latency to 35ms, P99 to 80ms.

Scaling Strategy

Phase 1: Vertical Scaling (Week 1-2)

Upgraded node types:

# New node configuration
nodes:
  - type: n1-standard-8  # Doubled CPU/RAM
    count: 5

Impact:

Capacity: 500 RPS → 1,500 RPS
Cost: $600 → $1,200/month
Downtime: 0 (rolling node replacement)

Phase 2: Horizontal Scaling (Week 3-4)

Added more nodes and implemented auto-scaling:

apiVersion: autoscaling.k8s.io/v1
kind: ClusterAutoscaler
metadata:
  name: cluster-autoscaler
spec:
  minNodes: 8
  maxNodes: 30
  scaleDownDelay: 10m
  scaleDownUnneededTime: 10m

Impact:

Capacity: 1,500 RPS → 5,000 RPS
Cost: $1,200 → $2,400/month (average)
Elasticity: Automatic scaling based on load

Phase 3: Optimization (Week 5-8)

Implemented caching and async processing:

# Redis caching layer
from redis import Redis
from functools import wraps

redis_client = Redis(host='redis-cluster', port=6379)

def cache_result(ttl=300):
    def decorator(func):
        @wraps(func)
        async def wrapper(*args, **kwargs):
            cache_key = f"{func.__name__}:{args}:{kwargs}"
            
            # Try cache first
            cached = redis_client.get(cache_key)
            if cached:
                return json.loads(cached)
            
            # Execute function
            result = await func(*args, **kwargs)
            
            # Cache result
            redis_client.setex(
                cache_key,
                ttl,
                json.dumps(result)
            )
            
            return result
        return wrapper
    return decorator

@cache_result(ttl=600)
async def get_product_details(product_id: str):
    return await db.products.find_one({'id': product_id})

Impact:

Cache hit rate: 75%
Database load: Reduced by 60%
Response time: Improved by 50%

Final Architecture

Cluster Configuration

# Production cluster (final)
apiVersion: v1
kind: Cluster
metadata:
  name: production-cluster
spec:
  nodePools:
    - name: general-pool
      type: n1-standard-8
      minNodes: 8
      maxNodes: 25
      
    - name: memory-intensive-pool
      type: n1-highmem-8
      minNodes: 2
      maxNodes: 8
      taints:
        - key: workload
          value: memory-intensive
          effect: NoSchedule

Service Deployment Pattern

apiVersion: apps/v1
kind: Deployment
metadata:
  name: user-service
  labels:
    app: user-service
    version: v2.0.0
spec:
  replicas: 5
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 2
      maxUnavailable: 1
  template:
    metadata:
      labels:
        app: user-service
        version: v2.0.0
    spec:
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: app
                  operator: In
                  values:
                  - user-service
              topologyKey: kubernetes.io/hostname
      containers:
      - name: user-service
        image: user-service:v2.0.0
        ports:
        - containerPort: 8080
        resources:
          requests:
            memory: "1Gi"
            cpu: "1000m"
          limits:
            memory: "2Gi"
            cpu: "2000m"
        livenessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 30
          periodSeconds: 10
        readinessProbe:
          httpGet:
            path: /ready
            port: 8080
          initialDelaySeconds: 5
          periodSeconds: 5

Performance Metrics

Before vs After

Metric	Before	After	Improvement
Max RPS	500	10,000	20x
Avg Response Time	250ms	45ms	82%
P95 Response Time	800ms	120ms	85%
P99 Response Time	2000ms	300ms	85%
Error Rate	0.5%	0.05%	90%
CPU Utilization	85%	65%	Better headroom
Memory Utilization	78%	60%	Better headroom

Cost Analysis

Infrastructure: $600 → $2,800/month (average)
Cost per 1K requests: $0.12 → $0.028
ROI: 76% reduction in per-request cost

Key Lessons Learned

1. Monitor Everything

We implemented comprehensive monitoring:

apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-config
data:
  prometheus.yml: |
    global:
      scrape_interval: 15s
    
    scrape_configs:
      - job_name: 'kubernetes-pods'
        kubernetes_sd_configs:
          - role: pod
        relabel_configs:
          - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
            action: keep
            regex: true

2. Plan for Failure

Implemented circuit breakers:

from circuitbreaker import circuit

@circuit(failure_threshold=5, recovery_timeout=60)
async def call_external_service(data):
    async with aiohttp.ClientSession() as session:
        async with session.post(
            'https://external-api.com/endpoint',
            json=data,
            timeout=aiohttp.ClientTimeout(total=3)
        ) as response:
            return await response.json()

3. Gradual Rollouts

Used canary deployments:

apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: user-service
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: user-service
  progressDeadlineSeconds: 60
  service:
    port: 8080
  analysis:
    interval: 1m
    threshold: 5
    maxWeight: 50
    stepWeight: 10
    metrics:
    - name: request-success-rate
      thresholdRange:
        min: 99
    - name: request-duration
      thresholdRange:
        max: 500

Conclusion

Scaling from 500 to 10,000 RPS required:

Technical changes: Auto-scaling, caching, optimization
Architectural improvements: Service mesh, connection pooling
Operational maturity: Monitoring, alerting, gradual rollouts

The journey taught us that scaling is not just about adding more resources—it’s about building resilient, observable, and efficient systems.

Key takeaways:

Start with monitoring and observability
Identify bottlenecks before scaling
Use horizontal scaling over vertical when possible
Implement caching strategically
Plan for failure at every level
Automate everything

Our platform now handles 10,000+ RPS with room to grow to 50,000 RPS using the same architecture.

Scaling Microservices to 10K RPS with Kubernetes - A Production Journey

Table of contents

The Challenge

Initial Architecture

Cluster Setup

Service Configuration

Performance Bottlenecks

1. Database Connection Pool Exhaustion

2. CPU Throttling

3. Network Latency

Scaling Strategy

Phase 1: Vertical Scaling (Week 1-2)

Phase 2: Horizontal Scaling (Week 3-4)

Phase 3: Optimization (Week 5-8)

Final Architecture

Cluster Configuration

Service Deployment Pattern

Performance Metrics

Before vs After

Cost Analysis

Key Lessons Learned

1. Monitor Everything

2. Plan for Failure

3. Gradual Rollouts

Conclusion