Tag Archives: SRE

Fixing Unbalanced Shards in Elasticsearch: Why One Node Holds All Your Data

When we run Elasticsearch in production, one of the common issues is imbalance in "shards". There may be one node in the cluster that is out of disk space, while a few nodes with no shards on them. For example, here is a node with all the shards: Node Shards Disk Used Disk % Free Space PESD222 957 329.1 GB 32% 694.2...

by Chetan Singh
Tag: SRE

15-Mar-2026

DevOps

Rolling Node Replacement: The Safest Way to Upgrade Kubernetes

Introduction What if upgrading your Kubernetes cluster required no downtime at all? Imagine if you could upgrade your Kubernetes cluster and keep everything running smoothly, with zero downtime. Sounds pretty great, right? A lot of teams worry that upgrading will mean their apps go offline, but with solid planning, it's actually...

by Chetan Singh
Tag: SRE

15-Mar-2026

DevOps

How to Centralize AWS Monitoring: A Guide to CloudWatch Cross-Account Metrics

It is painfully inefficient to check metrics across a large collection of AWS accounts (development, staging, uat, production, etc.). This is a major time waster, not just a small irritation. In addition to wasting valuable engineering time, you run a much higher risk of missing an alert that could result in a full-blown outage every time...

by Rahul Singh
Tag: SRE

12-Mar-2026

MSP

Chaos Engineering: Simulating Network Latency using AWS FIS

Introduction Modern applications have distributed systems consisting of multiple services, containers, and infrastructure components. While it improves scalability, security and reliability, it also increases the chances of unexpected failures and downtime. Application testing methods majorly focus on application functionality, but...

by Rauf Khan
Tag: SRE

10-Mar-2026

MSP

Agentic AI in SRE: Rethinking Reliability in the Age of Autonomous Systems

Introduction For years, Site Reliability Engineering (SRE) has been built around a simple mission: keep systems reliable at scale. We measure SLOs, manage error budgets, write runbooks, respond to incidents, and automate toil wherever possible. But even with automation, most SRE work remains fundamentally reactive: Alerts wake us...

by Aasim Zaidi
Tag: SRE

09-Mar-2026

MSP

Agentic AI in SRE: Rethinking Reliability in the Age of Autonomous Systems

by Aasim Zaidi
Tag: SRE

20-Feb-2026

DevOps

I Left This AWS Task Half-Done for 2 Weeks – Here’s What It Taught Me

Introduction When you work with AWS infrastructure for some time, you realise that not all problems announce themselves with alerts or outages. Some problems stay quiet, blend into the background, and only reveal themselves later-usually when someone asks a question you can’t answer clearly. This is one such experience from my early...

by Vivek Tiwary
Tag: SRE

15-Feb-2026

DevOps

DevOps Is Not a One-Time Setup: First-Year Lessons from the Field

Introduction When teams start on their DevOps journey, the excitement is real. CI/CD pipelines, faster deployments, cloud-native tools, automation everywhere - it feels like everything is finally going to be smooth. But in reality, the first year of DevOps is rarely smooth. It’s messy, experimental, and full of learning. [caption...

by Karandeep Singh
Tag: SRE

14-Jan-2026

DevOps

Paying to Ping? We Switched to Uptime Kuma and Saved Big

Introduction We used to rely on Pingdom for uptime monitoring. It worked well with simple checks, nice UI, and reliable and clean alerts. But one day, someone on our DevOps team casually said: "Hey, why are we paying for something that only pings URLs?" And that kicked off a big conversation. The Cost Wake-Up Pingdom wasn’t...

by Karandeep Singh
Tag: SRE

07-Aug-2025

Blogs

Tips for writing a blog

Learn how to write a caption