DevOps

From Hot Brokers to S3: Optimizing Kafka Storage with Tiered Storage

Introduction If you’ve ever worked with Kafka, you know the problem: data grows fast. Every click, impression, or event adds up, and before you know it, your Kafka broker’s disks are full. Disk is not very cheap on AWS, and storing everything on expensive broker storage is costly, and scaling up to handle growth feels […]

Python

From 48 Hours to 6 Minutes: My Journey Optimizing a Pandas Reconciliation Process for Large-Scale Data

Introduction I recently went through a wild but rewarding ride optimizing a heavy pandas-based reconciliation workflow. What started as a slow, clunky piece of code eating up over 48 hours, ended up getting polished into a lean, mean 6-minutes machine. I wanted to share my story – not as a tutorial, but as a real-world […]

DevOps

How Amazon MSK Helped Us Stop Babysitting Kafka

Introduction For years, we have used Kafka in the Data Centre, then we moved to AWS and started using EC2 to run Kafka. However, the headaches increased along with our usage. We began to feel as though we were spending more time managing Kafka than creating anything of value due to broker upgrades, Zookeeper problems, […]

DevOps

AWS EMR Cost & Performance Optimization: A Business-Centric Approach

Introduction AWS Elastic MapReduce (EMR) is an AWS service for filtering big data using open-source tools such as Apache Spark, Apache Flink, Trino, Hadoop, Hive, Presto, and many more. It provides a platform to run your applications without thinking much about the management of the underlying infrastructure. AWS EMR is a multipurpose, easy-to-use, highly available, […]

Data Engineering

Configuring AWS Lambda as a Kafka Producer with SASL_SSL and Kerberos/GSSAPI for Secure Communication

Kafka is a distributed streaming platform designed for real-time data pipelines, stream processing, and data integration. AWS lambda, on the other hand, is a serverless compute service that executes your code in response to events, managing the underlying compute resources for you. In organizations where Kafka plays a central role in streaming and data integration, […]

Avinash Upreti
Avinash Upreti
Read

Data Engineering

Building Efficient Data ETL Pipelines: Key Best Practices [Part-2]

In the first part of ETL data pipelines, we explored the importance of ETL processes, and their core components, and discussed the different types of ETL pipelines. Now, in this second part, we will dive deeper into some of the key challenges faced when implementing data ETL pipelines, outline best practices to optimize these processes […]

Yogesh Kargeti
Yogesh Kargeti
Read

Data Engineering

Building Efficient Data ETL Pipelines: Anatomy of an ETL [PART-1]

In today’s data-driven world, businesses rely on timely, accurate information to make critical decisions. Data pipelines play a vital role in this process, seamlessly fetching, processing, and transferring data to centralized locations like data warehouses. These pipelines ensure the right data is available when needed, allowing organizations to analyze trends, forecast outcomes, and optimize their […]

Porush Goyal
Porush Goyal
Read

DevOps

Unlocking Seamless Data Integration in the Cloud with Azure Data Factory

Introduction In today’s data-driven world, managing and transforming data from various sources is a very cumbersome task for organizations. Azure Data Factory (ADF) stands out as an extensive and robust ETL and cloud-based data integration service that helps enable businesses to streamline their complex data-driven workflows timely and with ease.  Azure Data Factory provides a […]

Chhavi Sharma
Chhavi Sharma
Read

Big DataData & AnalyticsDevOps

Enhancing Workflows with Apache Airflow and Docker

In today’s world, handling complex tasks and automating them is crucial. Apache Airflow is a powerful tool that helps with this. It’s like a conductor for tasks, making everything work smoothly. When we use Airflow with Docker, it becomes even better because it’s flexible and can be easily moved around. In this blog, we’ll explain […]