Posts tagged "monitoring"

August 22, 2025

Production AI Agents: Lessons from Running 10 Agents at Scale

Deployed 10 AI agents serving 100K users/day - monitoring, error handling, cost optimization, and scaling strategies that actually work

November 18, 2022

Setting up comprehensive monitoring and alerting for production systems using Prometheus, Grafana, and Alertmanager.

September 15, 2022

Implementing distributed tracing and metrics collection across microservices using OpenTelemetry, Jaeger, and Prometheus.

November 30, 2021

Building a complete monitoring and alerting stack with Prometheus and Grafana for microservices architecture.

September 15, 2020

Built advanced Grafana dashboards - variables, annotations, alerts, custom panels. Reduced MTTR from 30min to 5min

March 20, 2020

Fixed Prometheus high cardinality issue - reduced time series from 10M to 100K (99% reduction). Query performance improved 50x

July 15, 2019

Monitoring Istio service mesh - traffic metrics, distributed tracing, service dependencies, and debugging microservices with zero code changes

January 20, 2019

Building highly available Prometheus setup with Thanos - unlimited retention, global queries, and surviving datacenter failures

January 15, 2018

Configuring Alertmanager for production - routing rules, inhibition, silencing, and integrating with Slack and PagerDuty

October 12, 2017

How we set up Prometheus and Grafana for monitoring our microservices architecture.

September 12, 2017

Setting up Elasticsearch, Logstash, and Kibana for centralized logging - collecting logs from 15 microservices and making them searchable

February 14, 2017

Creating effective Grafana dashboards with Prometheus - from basic graphs to advanced alerting and team dashboards

April 20, 2016

Setting up Prometheus to monitor 5 microservices - metrics collection, alerting, and our first production incident caught by monitoring