Last quarter, our platform experienced explosive growth. Traffic increased from 500 RPS to over 10,000 RPS in just 8 weeks. Here’s how we scaled our Kubernetes-based microservices architecture to handle this growth without downtime.

Table of contents

The Challenge

Our e-commerce platform runs on a microservices architecture with 15 services:

  • User service
  • Product catalog
  • Order processing
  • Payment gateway
  • Inventory management
  • Notification service
  • And 9 more supporting services

Initial architecture handled 500 RPS comfortably, but started showing cracks at 2,000 RPS.

Initial Architecture

Cluster Setup

# Original cluster configuration
apiVersion: v1
kind: Cluster
metadata:
  name: production-cluster
spec:
  nodes:
    - type: n1-standard-4
      count: 5
  version: 1.22.6

Resources:

  • 5 nodes (4 vCPUs, 15GB RAM each)
  • Total capacity: 20 vCPUs, 75GB RAM
  • Cost: $600/month

Service Configuration

Typical service deployment:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: user-service
spec:
  replicas: 2
  template:
    spec:
      containers:
      - name: user-service
        image: user-service:v1.2.0
        resources:
          requests:
            memory: "512Mi"
            cpu: "500m"
          limits:
            memory: "1Gi"
            cpu: "1000m"

Performance Bottlenecks

1. Database Connection Pool Exhaustion

At 2,500 RPS, we started seeing:

Error: connection pool exhausted
Active connections: 100/100
Wait time: 5000ms

Solution: Implemented connection pooling with PgBouncer:

apiVersion: v1
kind: ConfigMap
metadata:
  name: pgbouncer-config
data:
  pgbouncer.ini: |
    [databases]
    userdb = host=postgres-primary port=5432 dbname=users
    
    [pgbouncer]
    pool_mode = transaction
    max_client_conn = 1000
    default_pool_size = 25
    reserve_pool_size = 5

Result: Reduced database connections by 60%, improved response time by 40%.

2. CPU Throttling

Monitoring revealed CPU throttling:

kubectl top pods -n production

NAME                    CPU(cores)   MEMORY(bytes)
user-service-abc123     980m         856Mi
user-service-def456     995m         892Mi

Services were hitting CPU limits, causing request queuing.

Solution: Adjusted resource limits and implemented HPA:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: user-service-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: user-service
  minReplicas: 3
  maxReplicas: 20
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80

3. Network Latency

Inter-service communication was slow:

Average latency between services: 45ms
P95 latency: 120ms
P99 latency: 250ms

Solution: Implemented service mesh with Istio:

apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: user-service
spec:
  hosts:
  - user-service
  http:
  - timeout: 3s
    retries:
      attempts: 3
      perTryTimeout: 1s
    route:
    - destination:
        host: user-service
        subset: v1
      weight: 100

Result: Reduced P95 latency to 35ms, P99 to 80ms.

Scaling Strategy

Phase 1: Vertical Scaling (Week 1-2)

Upgraded node types:

# New node configuration
nodes:
  - type: n1-standard-8  # Doubled CPU/RAM
    count: 5

Impact:

  • Capacity: 500 RPS → 1,500 RPS
  • Cost: $600 → $1,200/month
  • Downtime: 0 (rolling node replacement)

Phase 2: Horizontal Scaling (Week 3-4)

Added more nodes and implemented auto-scaling:

apiVersion: autoscaling.k8s.io/v1
kind: ClusterAutoscaler
metadata:
  name: cluster-autoscaler
spec:
  minNodes: 8
  maxNodes: 30
  scaleDownDelay: 10m
  scaleDownUnneededTime: 10m

Impact:

  • Capacity: 1,500 RPS → 5,000 RPS
  • Cost: $1,200 → $2,400/month (average)
  • Elasticity: Automatic scaling based on load

Phase 3: Optimization (Week 5-8)

Implemented caching and async processing:

# Redis caching layer
from redis import Redis
from functools import wraps

redis_client = Redis(host='redis-cluster', port=6379)

def cache_result(ttl=300):
    def decorator(func):
        @wraps(func)
        async def wrapper(*args, **kwargs):
            cache_key = f"{func.__name__}:{args}:{kwargs}"
            
            # Try cache first
            cached = redis_client.get(cache_key)
            if cached:
                return json.loads(cached)
            
            # Execute function
            result = await func(*args, **kwargs)
            
            # Cache result
            redis_client.setex(
                cache_key,
                ttl,
                json.dumps(result)
            )
            
            return result
        return wrapper
    return decorator

@cache_result(ttl=600)
async def get_product_details(product_id: str):
    return await db.products.find_one({'id': product_id})

Impact:

  • Cache hit rate: 75%
  • Database load: Reduced by 60%
  • Response time: Improved by 50%

Final Architecture

Cluster Configuration

# Production cluster (final)
apiVersion: v1
kind: Cluster
metadata:
  name: production-cluster
spec:
  nodePools:
    - name: general-pool
      type: n1-standard-8
      minNodes: 8
      maxNodes: 25
      
    - name: memory-intensive-pool
      type: n1-highmem-8
      minNodes: 2
      maxNodes: 8
      taints:
        - key: workload
          value: memory-intensive
          effect: NoSchedule

Service Deployment Pattern

apiVersion: apps/v1
kind: Deployment
metadata:
  name: user-service
  labels:
    app: user-service
    version: v2.0.0
spec:
  replicas: 5
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 2
      maxUnavailable: 1
  template:
    metadata:
      labels:
        app: user-service
        version: v2.0.0
    spec:
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: app
                  operator: In
                  values:
                  - user-service
              topologyKey: kubernetes.io/hostname
      containers:
      - name: user-service
        image: user-service:v2.0.0
        ports:
        - containerPort: 8080
        resources:
          requests:
            memory: "1Gi"
            cpu: "1000m"
          limits:
            memory: "2Gi"
            cpu: "2000m"
        livenessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 30
          periodSeconds: 10
        readinessProbe:
          httpGet:
            path: /ready
            port: 8080
          initialDelaySeconds: 5
          periodSeconds: 5

Performance Metrics

Before vs After

MetricBeforeAfterImprovement
Max RPS50010,00020x
Avg Response Time250ms45ms82%
P95 Response Time800ms120ms85%
P99 Response Time2000ms300ms85%
Error Rate0.5%0.05%90%
CPU Utilization85%65%Better headroom
Memory Utilization78%60%Better headroom

Cost Analysis

  • Infrastructure: $600 → $2,800/month (average)
  • Cost per 1K requests: $0.12 → $0.028
  • ROI: 76% reduction in per-request cost

Key Lessons Learned

1. Monitor Everything

We implemented comprehensive monitoring:

apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-config
data:
  prometheus.yml: |
    global:
      scrape_interval: 15s
    
    scrape_configs:
      - job_name: 'kubernetes-pods'
        kubernetes_sd_configs:
          - role: pod
        relabel_configs:
          - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
            action: keep
            regex: true

2. Plan for Failure

Implemented circuit breakers:

from circuitbreaker import circuit

@circuit(failure_threshold=5, recovery_timeout=60)
async def call_external_service(data):
    async with aiohttp.ClientSession() as session:
        async with session.post(
            'https://external-api.com/endpoint',
            json=data,
            timeout=aiohttp.ClientTimeout(total=3)
        ) as response:
            return await response.json()

3. Gradual Rollouts

Used canary deployments:

apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: user-service
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: user-service
  progressDeadlineSeconds: 60
  service:
    port: 8080
  analysis:
    interval: 1m
    threshold: 5
    maxWeight: 50
    stepWeight: 10
    metrics:
    - name: request-success-rate
      thresholdRange:
        min: 99
    - name: request-duration
      thresholdRange:
        max: 500

Conclusion

Scaling from 500 to 10,000 RPS required:

  • Technical changes: Auto-scaling, caching, optimization
  • Architectural improvements: Service mesh, connection pooling
  • Operational maturity: Monitoring, alerting, gradual rollouts

The journey taught us that scaling is not just about adding more resources—it’s about building resilient, observable, and efficient systems.

Key takeaways:

  1. Start with monitoring and observability
  2. Identify bottlenecks before scaling
  3. Use horizontal scaling over vertical when possible
  4. Implement caching strategically
  5. Plan for failure at every level
  6. Automate everything

Our platform now handles 10,000+ RPS with room to grow to 50,000 RPS using the same architecture.