Introduction If you’ve ever worked with Kafka, you know the problem: data grows fast. Every click, impression, or event adds up, and before you know it, your Kafka broker’s disks are full. Disk is not very cheap on AWS, and storing everything on expensive broker storage is costly, and scaling up to handle growth feels […]
Introduction I recently went through a wild but rewarding ride optimizing a heavy pandas-based reconciliation workflow. What started as a slow, clunky piece of code eating up over 48 hours, ended up getting polished into a lean, mean 6-minutes machine. I wanted to share my story – not as a tutorial, but as a real-world […]
Introduction For years, we have used Kafka in the Data Centre, then we moved to AWS and started using EC2 to run Kafka. However, the headaches increased along with our usage. We began to feel as though we were spending more time managing Kafka than creating anything of value due to broker upgrades, Zookeeper problems, […]
Introduction AWS Elastic MapReduce (EMR) is an AWS service for filtering big data using open-source tools such as Apache Spark, Apache Flink, Trino, Hadoop, Hive, Presto, and many more. It provides a platform to run your applications without thinking much about the management of the underlying infrastructure. AWS EMR is a multipurpose, easy-to-use, highly available, […]
Kafka is a distributed streaming platform designed for real-time data pipelines, stream processing, and data integration. AWS lambda, on the other hand, is a serverless compute service that executes your code in response to events, managing the underlying compute resources for you. In organizations where Kafka plays a central role in streaming and data integration, […]
In the first part of ETL data pipelines, we explored the importance of ETL processes, and their core components, and discussed the different types of ETL pipelines. Now, in this second part, we will dive deeper into some of the key challenges faced when implementing data ETL pipelines, outline best practices to optimize these processes […]
In today’s data-driven world, businesses rely on timely, accurate information to make critical decisions. Data pipelines play a vital role in this process, seamlessly fetching, processing, and transferring data to centralized locations like data warehouses. These pipelines ensure the right data is available when needed, allowing organizations to analyze trends, forecast outcomes, and optimize their […]
Introduction In today’s data-driven world, managing and transforming data from various sources is a very cumbersome task for organizations. Azure Data Factory (ADF) stands out as an extensive and robust ETL and cloud-based data integration service that helps enable businesses to streamline their complex data-driven workflows timely and with ease. Azure Data Factory provides a […]
Big DataData & AnalyticsDevOps
In today’s world, handling complex tasks and automating them is crucial. Apache Airflow is a powerful tool that helps with this. It’s like a conductor for tasks, making everything work smoothly. When we use Airflow with Docker, it becomes even better because it’s flexible and can be easily moved around. In this blog, we’ll explain […]