DevOps

How to Centralize AWS Monitoring: A Guide to CloudWatch Cross-Account Metrics

It is painfully inefficient to check metrics across a large collection of AWS accounts (development, staging, uat, production, etc.). This is a major time waster, not just a small irritation. In addition to wasting valuable engineering time, you run a much higher risk of missing an alert that could result in a full-blown outage every time...

by Rahul Singh
Tag: SRE
12-Mar-2026

MSP

Chaos Engineering: Simulating Network Latency using AWS FIS

Introduction Modern applications have distributed systems consisting of multiple services, containers, and infrastructure components. While it improves scalability, security and reliability, it also increases the chances of unexpected failures and downtime. Application testing methods majorly focus on application functionality, but...

by Rauf Khan
Tag: SRE
10-Mar-2026

MSP

Agentic AI in SRE: Rethinking Reliability in the Age of Autonomous Systems

Introduction For years, Site Reliability Engineering (SRE) has been built around a simple mission: keep systems reliable at scale. We measure SLOs, manage error budgets, write runbooks, respond to incidents, and automate toil wherever possible. But even with automation, most SRE work remains fundamentally reactive: Alerts wake us...

by Aasim Zaidi
Tag: SRE
09-Mar-2026

DevOps

I Left This AWS Task Half-Done for 2 Weeks – Here’s What It Taught Me

Introduction When you work with AWS infrastructure for some time, you realise that not all problems announce themselves with alerts or outages. Some problems stay quiet, blend into the background, and only reveal themselves later-usually when someone asks a question you can’t answer clearly. This is one such experience from my early...

by Vivek Tiwary
Tag: SRE
15-Feb-2026

DevOps

DevOps Is Not a One-Time Setup: First-Year Lessons from the Field

Introduction When teams start on their DevOps journey, the excitement is real. CI/CD pipelines, faster deployments, cloud-native tools, automation everywhere - it feels like everything is finally going to be smooth. But in reality, the first year of DevOps is rarely smooth. It’s messy, experimental, and full of learning. [caption...

by Karandeep Singh
Tag: SRE
14-Jan-2026

DevOps

Paying to Ping? We Switched to Uptime Kuma and Saved Big

Introduction We used to rely on Pingdom for uptime monitoring. It worked well with simple checks, nice UI, and reliable and clean alerts. But one day, someone on our DevOps team casually said: "Hey, why are we paying for something that only pings URLs?" And that kicked off a big conversation. The Cost Wake-Up Pingdom wasn’t...

by Karandeep Singh
Tag: SRE
07-Aug-2025