DevOps

Fixing JVM OutOfMemoryError on ECS (EC2 Based)

Introduction We started seeing repeated OutOfMemoryError exceptions in a Spring Boot service running on Amazon ECS in EC2 mode. The impact of the OutOfMemoryError was serious:- JVM threads crashed, including SQS listeners, HTTP threads, and AWS SDK threads. Messages were retried and eventually sent to SQS Dead Letter Queues. The service became unstable under load. […]

Ahmad Ali
Ahmad Ali
Read

DevOps

HA (high availability) Active/Passive Palo Alto on AWS

Introduction In the first part, we explored Palo Alto firewalls, their use cases, and different ways to achieve high availability in AWS. To learn more click here. In this second part, we’ll walk through a complete end-to-end setup of an Active/Passive Palo Alto HA deployment within the same Availability Zone. Architecture In this setup, traffic […]

DevOps

End-to-End Container Hardening on Amazon EKS (CIS Aligned Implementation)

Introduction With the growing adoption of containers and Kubernetes, securing containerized workloads has become a critical responsibility for DevOps and platform teams. Organizations running workloads on Kubernetes must ensure that their infrastructure, container images, runtime configurations, and resource governance follow security best practices. In this blog, we walk through the end-to-end container hardening approach implemented […]

DevOps

Real-World AWS Cost Optimization Strategies for High-Traffic Platforms

Introduction I’ll be honest when I say running a high-traffic production environment on AWS is fun…. until you see the cloud bill. At first, you overprovision a bit of memory “just to be safe.” Containers stay up a little longer than needed. Logs? Oh, we log everything because, you know, one day you might need […]

DevOps

Step-by-Step Guide to Build observability into an LLM application

Introduction Large Language Models (LLMs) are transforming the way that users interact with applications, and they introduce observability challenges that require new approaches. Unlike deterministic APIs that return predictable results, LLMs have variable performance, unpredictable outputs, and complex failure modes. Observing these systems effectively means collecting data that captures not just the performance of LLM […]

DevOps

Securely Access Private GKE Clusters Using Tinyproxy and Identity-Aware Proxy (IAP)

Introduction Private clusters in Google Kubernetes Engine improve security by preventing public access to the Kubernetes control plane, but this also makes remote management more difficult.This step-by-step guide will walks you through how to configure Tinyproxy on a private bastion host and how to use Identity-Aware Proxy (IAP) to safely access a private GKE cluster […]

DevOps

From Logstash to Fluent Bit: How We Streamlined Logging for an Ad Tech Client

Introduction In ad-tech, logs are not “nice to have.” They are the product’s heartbeat. Every impression, every click, every bid request — everything generates logs. Multiply that by millions of requests per minute, and you’re suddenly dealing with millions of events and TB’s of logs per day. That’s exactly where one of our platforms was. […]

DevOps

Why Right-Sizing Is Not a One-Time Activity

Introduction If you’ve worked in production long enough, you’ve probably heard this: “Let’s right-size the services and reduce the AWS bill.” So we do it. We check CPU and memory metrics for a week. We reduce task sizes. Costs drop. Everyone’s happy. And then…. six months later, the bill increases again. Nothing “dramatic” changed. No […]

DevOps

ECS Fargate at Scale: Lessons from Running Multiple Microservices in Production

Introduction When we started with Amazon ECS on AWS Fargate, it felt simple. No EC2 to manage. No AMIs. No cluster scaling headaches. Then the number of services grew. Working for the ad-tech client from last 5 years and running their workload on ECS Fargate has taught us many things. Different traffic patterns. Different scaling […]

Services