Service Mesh Monitoring with Istio and Prometheus: Complete Observability

We had 30 microservices. Monitoring required instrumenting each service. Adding metrics meant code changes, testing, deployment. It took weeks.

I deployed Istio service mesh. Got traffic metrics, tracing, and service graph for all services instantly. Zero code changes. Complete observability in one day.

The Problem

Traditional monitoring:

Instrument each service manually
Different libraries for different languages
Inconsistent metrics
No automatic service dependencies
Code changes for every new metric

We needed better.

Istio Overview

Service mesh provides:

Traffic management: Routing, load balancing
Security: mTLS, authorization
Observability: Metrics, logs, traces

All without changing application code!

Installing Istio

Download:

curl -L https://istio.io/downloadIstio | ISTIO_VERSION=1.3.0 sh -
cd istio-1.3.0
export PATH=$PWD/bin:$PATH

Install:

istioctl manifest apply --set profile=demo

Enable sidecar injection:

kubectl label namespace default istio-injection=enabled

Automatic Metrics

Deploy app:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-app
spec:
  replicas: 3
  selector:
    matchLabels:
      app: web-app
  template:
    metadata:
      labels:
        app: web-app
    spec:
      containers:
      - name: web-app
        image: web-app:latest
        ports:
        - containerPort: 8080

Istio automatically injects sidecar and collects metrics!

Available Metrics

Istio provides:

Request metrics:

istio_requests_total
istio_request_duration_milliseconds
istio_request_bytes
istio_response_bytes

TCP metrics:

istio_tcp_sent_bytes_total
istio_tcp_received_bytes_total
istio_tcp_connections_opened_total
istio_tcp_connections_closed_total

Prometheus Queries

Request rate:

rate(istio_requests_total{destination_service="web-app.default.svc.cluster.local"}[5m])

Error rate:

rate(istio_requests_total{destination_service="web-app.default.svc.cluster.local",response_code=~"5.."}[5m])

Latency p95:

histogram_quantile(0.95,
  rate(istio_request_duration_milliseconds_bucket{destination_service="web-app.default.svc.cluster.local"}[5m])
)

Success rate:

sum(rate(istio_requests_total{destination_service="web-app.default.svc.cluster.local",response_code!~"5.."}[5m]))
/
sum(rate(istio_requests_total{destination_service="web-app.default.svc.cluster.local"}[5m]))

Service Dependencies

Automatic service graph:

# Incoming traffic
sum(rate(istio_requests_total{destination_service="web-app.default.svc.cluster.local"}[5m])) by (source_app)

# Outgoing traffic
sum(rate(istio_requests_total{source_app="web-app"}[5m])) by (destination_service)

Kiali Dashboard

Visualize service mesh:

kubectl apply -f samples/addons/kiali.yaml
kubectl port-forward svc/kiali 20001:20001 -n istio-system

Access: http://localhost:20001

Shows:

Service topology
Traffic flow
Health status
Request rates
Error rates

Grafana Dashboards

Istio includes pre-built dashboards:

kubectl apply -f samples/addons/grafana.yaml
kubectl port-forward svc/grafana 3000:3000 -n istio-system

Dashboards:

Istio Mesh Dashboard
Istio Service Dashboard
Istio Workload Dashboard
Istio Performance Dashboard

Distributed Tracing

Istio integrates with Jaeger:

kubectl apply -f samples/addons/jaeger.yaml
kubectl port-forward svc/jaeger-query 16686:16686 -n istio-system

Automatic trace propagation across services!

Custom Metrics

Add custom metrics without code changes:

apiVersion: config.istio.io/v1alpha2
kind: metric
metadata:
  name: doublerequestcount
spec:
  value: "2"
  dimensions:
    source: source.workload.name | "unknown"
    destination: destination.workload.name | "unknown"
  monitored_resource_type: '"UNSPECIFIED"'
---
apiVersion: config.istio.io/v1alpha2
kind: prometheus
metadata:
  name: doublehandler
spec:
  metrics:
  - name: doublerequestcount
    instance_name: doublerequestcount.metric.default
    kind: COUNTER
    label_names:
    - source
    - destination

Traffic Splitting Metrics

Monitor canary deployments:

apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: web-app
spec:
  hosts:
  - web-app
  http:
  - match:
    - headers:
        canary:
          exact: "true"
    route:
    - destination:
        host: web-app
        subset: v2
  - route:
    - destination:
        host: web-app
        subset: v1
      weight: 90
    - destination:
        host: web-app
        subset: v2
      weight: 10

Query canary metrics:

# v1 traffic
rate(istio_requests_total{destination_version="v1"}[5m])

# v2 traffic
rate(istio_requests_total{destination_version="v2"}[5m])

# v2 error rate
rate(istio_requests_total{destination_version="v2",response_code=~"5.."}[5m])
/
rate(istio_requests_total{destination_version="v2"}[5m])

Alerting Rules

Prometheus alerts for Istio:

groups:
- name: istio
  rules:
  - alert: HighErrorRate
    expr: |
      sum(rate(istio_requests_total{response_code=~"5.."}[5m])) by (destination_service)
      /
      sum(rate(istio_requests_total[5m])) by (destination_service)
      > 0.05
    for: 5m
    annotations:
      summary: "High error rate for {{ $labels.destination_service }}"
  
  - alert: HighLatency
    expr: |
      histogram_quantile(0.95,
        rate(istio_request_duration_milliseconds_bucket[5m])
      ) > 1000
    for: 5m
    annotations:
      summary: "High latency for {{ $labels.destination_service }}"

mTLS Monitoring

Monitor mutual TLS:

# mTLS connections
sum(rate(istio_requests_total{connection_security_policy="mutual_tls"}[5m]))

# Non-mTLS connections
sum(rate(istio_requests_total{connection_security_policy!="mutual_tls"}[5m]))

Resource Usage

Monitor Envoy sidecar resources:

# CPU usage
rate(container_cpu_usage_seconds_total{container="istio-proxy"}[5m])

# Memory usage
container_memory_working_set_bytes{container="istio-proxy"}

# Network I/O
rate(container_network_transmit_bytes_total{pod=~".*istio-proxy.*"}[5m])

Performance Impact

Measure Istio overhead:

# Latency added by sidecar
histogram_quantile(0.95,
  rate(istio_request_duration_milliseconds_bucket[5m])
)
-
histogram_quantile(0.95,
  rate(http_request_duration_seconds_bucket[5m])
) * 1000

Typical overhead: 1-3ms

Troubleshooting

High latency:

# Check Envoy stats
kubectl exec -it pod-name -c istio-proxy -- curl localhost:15000/stats

# Check config
istioctl proxy-config cluster pod-name

Missing metrics:

# Verify sidecar injection
kubectl get pod pod-name -o jsonpath='{.spec.containers[*].name}'

# Check Prometheus targets
kubectl port-forward svc/prometheus 9090:9090 -n istio-system
# Visit http://localhost:9090/targets

Real-World Dashboard

Our production Grafana dashboard:

{
  "panels": [
    {
      "title": "Request Rate",
      "targets": [{
        "expr": "sum(rate(istio_requests_total[5m])) by (destination_service)"
      }]
    },
    {
      "title": "Error Rate",
      "targets": [{
        "expr": "sum(rate(istio_requests_total{response_code=~\"5..\"}[5m])) by (destination_service)"
      }]
    },
    {
      "title": "P95 Latency",
      "targets": [{
        "expr": "histogram_quantile(0.95, rate(istio_request_duration_milliseconds_bucket[5m]))"
      }]
    },
    {
      "title": "Service Dependencies",
      "type": "graph",
      "targets": [{
        "expr": "sum(rate(istio_requests_total[5m])) by (source_app, destination_service)"
      }]
    }
  ]
}

Results

Before:

Manual instrumentation
Weeks to add metrics
Inconsistent across services
No service dependencies

After:

Automatic metrics
Zero code changes
Consistent metrics
Complete service graph
Deployed in 1 day

Lessons Learned

Start with Istio early - Easier than retrofitting
Monitor sidecar overhead - Usually minimal
Use Kiali - Visual service mesh understanding
Leverage built-in dashboards - Don’t reinvent
Combine with app metrics - Istio + custom metrics

Conclusion

Istio service mesh provides complete observability without code changes. Essential for microservices at scale.

Key takeaways:

Automatic traffic metrics
Distributed tracing
Service dependency graph
Zero application changes
Consistent observability

Deploy Istio. Get instant observability for all your services.

Table of Contents