Introduction Modern cloud-native systems generate millions of log entries every day. But here’s the real question — are we truly extracting meaningful insights from those logs, or just storing them? AI-Powered Log Monitoring in Microsoft Azure combines Azure Monitor, Azure Log Analytics, and Azure OpenAI Service to transform raw...
Introduction We started seeing repeated OutOfMemoryError exceptions in a Spring Boot service running on Amazon ECS in EC2 mode. The impact of the OutOfMemoryError was serious:- JVM threads crashed, including SQS listeners, HTTP threads, and AWS SDK threads. Messages were retried and eventually sent to SQS Dead Letter Queues. ...
Introduction In the first part, we explored Palo Alto firewalls, their use cases, and different ways to achieve high availability in AWS. To learn more click here. In this second part, we’ll walk through a complete end-to-end setup of an Active/Passive Palo Alto HA deployment within the same Availability Zone. Architecture ...
Introduction I’ll be honest when I say running a high-traffic production environment on AWS is fun…. until you see the cloud bill. At first, you overprovision a bit of memory “just to be safe.” Containers stay up a little longer than needed. Logs? Oh, we log everything because, you know, one day you might need it. And cross-AZ...
Introduction Large Language Models (LLMs) are transforming the way that users interact with applications, and they introduce observability challenges that require new approaches. Unlike deterministic APIs that return predictable results, LLMs have variable performance, unpredictable outputs, and complex failure modes. Observing these...
Introduction Private clusters in Google Kubernetes Engine improve security by preventing public access to the Kubernetes control plane, but this also makes remote management more difficult.This step-by-step guide will walks you through how to configure Tinyproxy on a private bastion host and how to use Identity-Aware Proxy (IAP) to...
Introduction In ad-tech, logs are not “nice to have.” They are the product’s heartbeat. Every impression, every click, every bid request — everything generates logs. Multiply that by millions of requests per minute, and you’re suddenly dealing with millions of events and TB’s of logs per day. That’s exactly where one of our...
Introduction If you’ve worked in production long enough, you’ve probably heard this: “Let’s right-size the services and reduce the AWS bill.” So we do it. We check CPU and memory metrics for a week. We reduce task sizes. Costs drop. Everyone’s happy. And then…. six months later, the bill increases again....
Introduction When we started with Amazon ECS on AWS Fargate, it felt simple. No EC2 to manage. No AMIs. No cluster scaling headaches. Then the number of services grew. Working for the ad-tech client from last 5 years and running their workload on ECS Fargate has taught us many things. Different traffic patterns. Different...
When we run Elasticsearch in production, one of the common issues is imbalance in "shards". There may be one node in the cluster that is out of disk space, while a few nodes with no shards on them. For example, here is a node with all the shards: Node Shards Disk Used Disk % Free Space PESD222 957 329.1 GB 32% 694.2...
Introduction What if upgrading your Kubernetes cluster required no downtime at all? Imagine if you could upgrade your Kubernetes cluster and keep everything running smoothly, with zero downtime. Sounds pretty great, right? A lot of teams worry that upgrading will mean their apps go offline, but with solid planning, it's actually...
It is painfully inefficient to check metrics across a large collection of AWS accounts (development, staging, uat, production, etc.). This is a major time waster, not just a small irritation. In addition to wasting valuable engineering time, you run a much higher risk of missing an alert that could result in a full-blown outage every time...