Read SRE blog posts at TO THE NEW Blog

Tag Archives: SRE

From Zero to Hundreds: Onboarding Your Entire AWS Fleet to Centralized CloudWatch in Under an Hour

If you’ve ever had to jump between six different AWS accounts just to figure out why one Lambda function is behaving oddly – you already know the pain. Multi-account AWS environments are great for security and governance, but they can turn basic monitoring into a logistical nightmare. The good news? AWS gives you everything you […]

Rahul Singh April 20, 2026

Read

DevOps

Fixing Unbalanced Shards in Elasticsearch: Why One Node Holds All Your Data

When we run Elasticsearch in production, one of the common issues is imbalance in “shards”. There may be one node in the cluster that is out of disk space, while a few nodes with no shards on them. For example, here is a node with all the shards: Node Shards Disk Used Disk % Free […]

Chetan Singh March 15, 2026

Read

DevOps

Rolling Node Replacement: The Safest Way to Upgrade Kubernetes

Introduction What if upgrading your Kubernetes cluster required no downtime at all? Imagine if you could upgrade your Kubernetes cluster and keep everything running smoothly, with zero downtime. Sounds pretty great, right? A lot of teams worry that upgrading will mean their apps go offline, but with solid planning, it’s actually possible to have safe […]

Chetan Singh March 15, 2026

Read

DevOps

How to Centralize AWS Monitoring: A Guide to CloudWatch Cross-Account Metrics

It is painfully inefficient to check metrics across a large collection of AWS accounts (development, staging, uat, production, etc.). This is a major time waster, not just a small irritation. In addition to wasting valuable engineering time, you run a much higher risk of missing an alert that could result in a full-blown outage every […]

Rahul Singh March 12, 2026

Read

MSP

Chaos Engineering: Simulating Network Latency using AWS FIS

Introduction Modern applications have distributed systems consisting of multiple services, containers, and infrastructure components. While it improves scalability, security and reliability, it also increases the chances of unexpected failures and downtime. Application testing methods majorly focus on application functionality, but they rarely test how systems behave in real-world failures such as instance crashes, network latency, […]

Rauf Khan March 10, 2026

Read

MSP

Agentic AI in SRE: Rethinking Reliability in the Age of Autonomous Systems

Introduction For years, Site Reliability Engineering (SRE) has been built around a simple mission: keep systems reliable at scale. We measure SLOs, manage error budgets, write runbooks, respond to incidents, and automate toil wherever possible. But even with automation, most SRE work remains fundamentally reactive: Alerts wake us up. We investigate dashboards. We correlate logs […]

Aasim Zaidi March 9, 2026

Read

MSP

Agentic AI in SRE: Rethinking Reliability in the Age of Autonomous Systems

Aasim Zaidi February 20, 2026

Read

DevOps

I Left This AWS Task Half-Done for 2 Weeks – Here’s What It Taught Me

Introduction When you work with AWS infrastructure for some time, you realise that not all problems announce themselves with alerts or outages. Some problems stay quiet, blend into the background, and only reveal themselves later-usually when someone asks a question you can’t answer clearly. This is one such experience from my early days of working […]

Vivek Tiwary February 15, 2026

Read

DevOps

DevOps Is Not a One-Time Setup: First-Year Lessons from the Field

Introduction When teams start on their DevOps journey, the excitement is real. CI/CD pipelines, faster deployments, cloud-native tools, automation everywhere – it feels like everything is finally going to be smooth. But in reality, the first year of DevOps is rarely smooth. It’s messy, experimental, and full of learning. At To The New, while working […]

Karandeep Singh January 14, 2026

Read

Tips for writing a blog

Learn how to write a caption