{"id":61712,"date":"2024-05-14T02:57:39","date_gmt":"2024-05-13T21:27:39","guid":{"rendered":"https:\/\/www.tothenew.com\/blog\/?p=61712"},"modified":"2024-07-08T10:30:01","modified_gmt":"2024-07-08T05:00:01","slug":"driving-efficiency-and-cost-reduction-kafka-migration-to-aws-msk-for-a-leading-advertising-firm","status":"publish","type":"post","link":"https:\/\/www.tothenew.com\/blog\/driving-efficiency-and-cost-reduction-kafka-migration-to-aws-msk-for-a-leading-advertising-firm\/","title":{"rendered":"Driving Efficiency and Cost Reduction: Kafka Migration to AWS MSK for a Leading Advertising Firm"},"content":{"rendered":"<h1><b>Introduction<\/b><\/h1>\n<p>In the world of data management, companies seek to streamline operations and enhance scalability. One key journey involves migrating self-managed Apache Kafka clusters from AWS EC2 to Amazon MSK. We executed such a migration for a client with zero downtime, offering insights and strategies in this blog.<\/p>\n<h2><b>Motivations Behind Migration<\/b><\/h2>\n<ol>\n<li><b> Scalability Limitations<\/b><b>: <\/b>Scaling self-managed Kafka clusters for increasing data volumes and processing demands was challenging, requiring manual intervention for deploying additional EC2 instances, configuring files, and rebalancing partitions.<\/li>\n<\/ol>\n<ol start=\"2\">\n<li><b> Operational Overhead: <\/b><span style=\"font-weight: 400;\">Self-managed Kafka system requires significant human efforts and knowledge to deploy, configure, and maintain effectively. Security patches, monitoring, and backup plans added to the administrative burden, diverting minds from real business requirements.<\/span><\/li>\n<\/ol>\n<ol start=\"3\">\n<li><b> Efficiency Issues: <\/b><span style=\"font-weight: 400;\">Underutilized resource utilization &amp; over-provisioned capacity, <\/span><span style=\"font-weight: 400;\">resulted in unnecessary costs and operational inefficiencies.<\/span><span style=\"font-weight: 400;\">\u00a0<\/span><\/li>\n<\/ol>\n<ol start=\"4\">\n<li><b> Upgrading Issues: <\/b><span style=\"font-weight: 400;\">Upgrading Kafka versions in self-managed environments was difficult and time-consuming, requiring careful planning, testing, and coordination to minimize blast radius and ensure compatibility with existing applications and infrastructure.<\/span><\/li>\n<\/ol>\n<ol start=\"5\">\n<li><b> Security and Compliance Reasons: <\/b><span style=\"font-weight: 400;\">Of course not upgrading the Kafka cluster resulted in many security &amp; compliance issues. The Kafka version was old and had many security vulnerabilities.<\/span><\/li>\n<\/ol>\n<h2><b>Existing Setup and Cost Analysis<\/b><\/h2>\n<ol>\n<li style=\"font-weight: 400;\"><b>Kafka Cluster Configurations<\/b><span style=\"font-weight: 400;\">: The setup included <strong>15<\/strong> Kafka nodes with <strong>m5.2xlarge<\/strong> instances, totaling around <strong>200 TB<\/strong> of disk space. Apache Kafka Version: 0.8.2.<\/span><\/li>\n<li style=\"font-weight: 400;\"><b>Zookeeper Node Configurations<\/b><span style=\"font-weight: 400;\"><span style=\"font-weight: 400;\">: 3 <\/span><\/span>Kafka Zookeeper nodes, using <strong>c5.xlarge<\/strong> instances with 100 GB disks, managed the Kafka cluster.<\/li>\n<li style=\"font-weight: 400;\"><b>Replication Factor in Kafka:<\/b><span style=\"font-weight: 400;\"> The replication factor refers to the number of copies maintained for each Kafka topic partition across the cluster. In our setup, each message in Kafka was triplicated across 3 brokers, maintaining RF of 3.<\/span><\/li>\n<li style=\"font-weight: 400;\"><b>Inter-AZ Data Transfer Costs:<\/b><span style=\"font-weight: 400;\">\u00a0 Architecture was spanned across 3 AZs in the US-West-2 region on AWS for high availability. However, AWS imposes inter-AZ data transfer costs for communication between Kafka brokers located in different availability zones. The architecture, designed to ensure HA, inadvertently incurred substantial inter-AZ data transfer costs, contributing significantly to monthly AWS expenses.<\/span><\/li>\n<\/ol>\n<h2><b>Problem Statement<\/b><\/h2>\n<p>Adding all these factors &amp; infrastructure was costing us monthly expenditures, totaling approximately 40,000 US Dollars. The primary cost driver was from inter-AZ data transfer charges, reflecting the substantial volume of data exchanged between Kafka brokers across availability zones. To address this issue, we explored migrating to Amazon MSK for benefits like high availability, quick scaling, and no data transfer costs between brokers across different AZs.<\/p>\n<h2><b>Open Questions<\/b><\/h2>\n<ul>\n<li style=\"font-weight: 400;\"><span style=\"font-weight: 400;\">What applications and other adjustments are necessary to support the new MSK cluster?<\/span><\/li>\n<li style=\"font-weight: 400;\"><span style=\"font-weight: 400;\">How can we guarantee the performance of the MSK cluster meets our requirements?<\/span><\/li>\n<li style=\"font-weight: 400;\"><span style=\"font-weight: 400;\">What MSK configurations are optimal for both production and non-production environments?<\/span><\/li>\n<li style=\"font-weight: 400;\"><span style=\"font-weight: 400;\">How do we reset Data-pipeline offsets to &#8216;earliest&#8217; before reading from MSK?<\/span><\/li>\n<li style=\"font-weight: 400;\"><span style=\"font-weight: 400;\">Does MSK require any form of pre-warming before full-scale data transmission just like load balancers in AWS?<\/span><\/li>\n<li style=\"font-weight: 400;\"><span style=\"font-weight: 400;\">What Rollback plans should be in place if issues arise post-switch to MSK?<\/span><\/li>\n<\/ul>\n<p><span style=\"font-weight: 400;\">While these questions might vary depending on your specific scenario, we will address them comprehensively throughout this blog.\u00a0<\/span><\/p>\n<h2><b>Migration Overview Diagram\u00a0<\/b><\/h2>\n<ul>\n<li style=\"font-weight: 400;\"><span style=\"font-weight: 400;\"><span style=\"font-weight: 400;\">Existing Setup<br \/>\n<\/span><\/span><\/p>\n<p><div style=\"width: 923px\" class=\"wp-caption alignnone\"><img decoding=\"async\" loading=\"lazy\" class=\"\" src=\"https:\/\/www.tothenew.com\/blog\/wp-ttn-blog\/uploads\/2024\/05\/image5.png\" alt=\"Migration Overview Diagram\u00a0of our existing AWS system\" width=\"913\" height=\"518\" \/><p class=\"wp-caption-text\">Existing AWS Setup<\/p><\/div><\/li>\n<\/ul>\n<ul>\n<li style=\"font-weight: 400;\"><span style=\"font-weight: 400;\"><span style=\"font-weight: 400;\">During Migration<br \/>\n<\/span><\/span><\/p>\n<p><div style=\"width: 1033px\" class=\"wp-caption alignnone\"><img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/www.tothenew.com\/blog\/wp-ttn-blog\/uploads\/2024\/05\/image2.png\" alt=\"Overview Migration diagram during migration\" width=\"1023\" height=\"580\" \/><p class=\"wp-caption-text\">AWS account during migration<\/p><\/div><\/li>\n<\/ul>\n<ul>\n<li style=\"font-weight: 400;\"><span style=\"font-weight: 400;\"><span style=\"font-weight: 400;\">Post-migration<br \/>\n<\/span><\/span><\/p>\n<p><div style=\"width: 1147px\" class=\"wp-caption alignnone\"><img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/www.tothenew.com\/blog\/wp-ttn-blog\/uploads\/2024\/05\/image6.png\" alt=\"Overview Migration diagram post migration\" width=\"1137\" height=\"645\" \/><p class=\"wp-caption-text\">AWS account post migration<\/p><\/div><\/li>\n<\/ul>\n<h2><b>Migration Plan<\/b><\/h2>\n<ol>\n<li><b> Provision and Configure MSK Cluster<\/b><span style=\"font-weight: 400;\">: We started by setting up the MSK cluster in lower environments like QA &amp; dev. Configured it according to your requirements, ensuring optimal performance, security, and scalability. <\/span><span style=\"font-weight: 400;\">To determine the right number of brokers for your MSK cluster and understand costs, see the <\/span><a href=\"https:\/\/view.officeapps.live.com\/op\/view.aspx?src=https%3A%2F%2Fdy7oqpxkwhskb.cloudfront.net%2FMSK_Sizing_Pricing.xlsx&amp;wdOrigin=BROWSELINK\"><span style=\"font-weight: 400;\">MSK Sizing and Pricing<\/span><\/a><span style=\"font-weight: 400;\"> spreadsheet. This spreadsheet provides an estimate for sizing an MSK cluster and the associated costs of Amazon MSK compared to a similar, self-managed, EC2-based Apache Kafka cluster.<\/span><\/li>\n<li><b> Updated Kafka Secrets and Configurations<\/b><span style=\"font-weight: 400;\">: Updated the AWS Secret Manager and configurations in applications (both consumer &amp; producer) to point to the newly provisioned MSK cluster.\u00a0<\/span><\/li>\n<li><b> Performance Testing<\/b><span style=\"font-weight: 400;\">: We did performance testing using MirrorMaker to validate the performance of the MSK cluster. Evaluated its throughput, latency, and scalability under various load conditions to ensure it meets our organization&#8217;s requirements and expectations.<\/span><\/li>\n<li><b> Validate Application Functionality<\/b><span style=\"font-weight: 400;\">: Thoroughly tested the functionality of deployed applications on the MSK cluster. Verified the data ingestion, processing, and communication between components function as expected.<\/span><\/li>\n<li><b> Monitoring MSK Cluster Performance<\/b><span style=\"font-weight: 400;\">: Continuously monitor the performance of the MSK cluster using tools like AWS CloudWatch and custom monitoring solutions. Kept an eye on key metrics such as throughput, latency, and error rates to identify any anomalies or performance issues.<\/span><\/li>\n<li><b> Troubleshooting Issues<\/b><span style=\"font-weight: 400;\">: Actively troubleshoot issues that arise during the deployment and validation process.\u00a0<\/span><\/li>\n<li><b> Validate Results<\/b><span style=\"font-weight: 400;\">: Validate the successful deployment and operation of applications on the MSK cluster. Verified that application logs are free of Kafka-related errors, and data ingestion and processing rates met expectations.<\/span><\/li>\n<\/ol>\n<h2><b>Performance Testing Plan<\/b><\/h2>\n<p><span style=\"font-weight: 400;\">We explored the possibility of using MirrorMaker to replicate data from the old Kafka cluster to the new MSK cluster. While MirrorMaker can be used for performance testing, it doesn&#8217;t replicate offsets, making it unsuitable for the actual migration process.<\/span><\/p>\n<h3><b>MirrorMaker:<\/b><\/h3>\n<p><span style=\"font-weight: 400;\">Kafka MirrorMaker is a component of Apache Kafka that helps in data replication between Kafka clusters. It enables the replication of topics from one Kafka cluster to another. We used Mirrormaker to replicate data from the production environment kafka running on EC2\u00a0 to the testing environment AWS MSK &amp; pointed the testing environment applications to it. It helped us in the validation &amp; testing compatibility of applications with AWS MSK with real-time production data.<\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\">Check out this blog for more information related to mirrormaker: <\/span><a href=\"https:\/\/www.tothenew.com\/blog\/mirror-maker-for-kafka-migration\/\"><span style=\"font-weight: 400;\">https:\/\/www.tothenew.com\/blog\/mirror-maker-for-kafka-migration\/<\/span><\/a><\/p>\n<p><b>Consumer Lag<\/b><span style=\"font-weight: 400;\">:\u00a0<\/span><\/p>\n<ul>\n<li style=\"font-weight: 400;\"><span style=\"font-weight: 400;\">The difference between how the producers place records on the brokers and when consumers read those messages.<\/span><\/li>\n<\/ul>\n<p><span style=\"font-weight: 400;\">We used the Consumer Offset Checker tool to calculate and monitor consumer lag for each partition of a topic &amp; plotted the lag on cloudwatch, helping identify potential bottlenecks or performance issues in consumer processing. Here\u2019s the <\/span><a href=\"https:\/\/github.com\/karannnn-exe\/Automations\/blob\/main\/python_scripts\/check_lag.py\"><span style=\"font-weight: 400;\">Python script<\/span><\/a><span style=\"font-weight: 400;\"> for calculating consumer lag.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">We automated the execution of Python script using Jenkins &amp; AWS SSM, employing a scheduled cron job for seamless integration into our workflow. This automation enabled us to execute the script at predefined intervals. Link To <\/span><a href=\"https:\/\/github.com\/karannnn-exe\/Automations\/blob\/main\/Jenkins\/Jenkinsfile\"><span style=\"font-weight: 400;\">Jenkinsfile<\/span><\/a><span style=\"font-weight: 400;\">. <\/span><span style=\"font-weight: 400;\">We created a Cloudwatch dashboard with consumer lag for all topics.<\/span><\/p>\n<div style=\"width: 1286px\" class=\"wp-caption alignnone\"><img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/www.tothenew.com\/blog\/wp-ttn-blog\/uploads\/2024\/05\/image3.png\" alt=\"Cloudwatch dashboard with consumer lag for all topics\" width=\"1276\" height=\"596\" \/><p class=\"wp-caption-text\">Cloudwatch dashboard-1<\/p><\/div>\n<div style=\"width: 1298px\" class=\"wp-caption alignnone\"><img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/www.tothenew.com\/blog\/wp-ttn-blog\/uploads\/2024\/05\/image4.png\" alt=\"Cloudwatch dashboard with consumer lag for all topics\" width=\"1288\" height=\"436\" \/><p class=\"wp-caption-text\">Cloudwatch dashboard- 2<\/p><\/div>\n<p><span style=\"font-weight: 400;\">After conducting data verification between the production and integration environments across various timeframes, it became evident that the applications were successfully able to both produce and consume data. With confirmation of data integrity and functionality, we proceeded with the cutover process in the production environment.<\/span><\/p>\n<div id=\"attachment_62676\" style=\"width: 673px\" class=\"wp-caption aligncenter\"><img aria-describedby=\"caption-attachment-62676\" decoding=\"async\" loading=\"lazy\" class=\"wp-image-62676 \" src=\"https:\/\/www.tothenew.com\/blog\/wp-ttn-blog\/uploads\/2024\/06\/data.png\" alt=\"functional testing\" width=\"663\" height=\"172\" srcset=\"\/blog\/wp-ttn-blog\/uploads\/2024\/06\/data.png 995w, \/blog\/wp-ttn-blog\/uploads\/2024\/06\/data-300x78.png 300w, \/blog\/wp-ttn-blog\/uploads\/2024\/06\/data-768x199.png 768w, \/blog\/wp-ttn-blog\/uploads\/2024\/06\/data-624x162.png 624w\" sizes=\"(max-width: 663px) 100vw, 663px\" \/><p id=\"caption-attachment-62676\" class=\"wp-caption-text\">functional testing<\/p><\/div>\n<h2><b>Post-Migration &amp; Cleanup<\/b><\/h2>\n<p><span style=\"font-weight: 400;\">After completing the migration, there are several post-migration tasks to ensure a smooth transition and clean up any residual resources:<\/span><\/p>\n<ul>\n<li style=\"font-weight: 400;\"><span style=\"font-weight: 400;\">Update application configurations and remove references to the old Kafka cluster.<\/span><\/li>\n<li style=\"font-weight: 400;\"><span style=\"font-weight: 400;\">Validate data integrity and consistency across applications.<\/span><\/li>\n<li style=\"font-weight: 400;\"><span style=\"font-weight: 400;\">Decommission old Kafka clusters and associated resources.<\/span><\/li>\n<li style=\"font-weight: 400;\"><span style=\"font-weight: 400;\">Monitor MSK cluster performance in production and make any necessary adjustments.<\/span><\/li>\n<\/ul>\n<h2><b>Conclusion<\/b><\/h2>\n<p><span style=\"font-weight: 400;\">By following a structured migration plan and addressing key considerations, organizations can seamlessly transition their Kafka workloads to AWS MSK and unlock the full potential of real-time data streaming on the cloud. <\/span><span style=\"font-weight: 400;\">Partnering with a managed cloud services provider who can help you choose the right migration path is one way to overcome these kinds of migration challenges.<\/span><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Introduction In the world of data management, companies seek to streamline operations and enhance scalability. One key journey involves migrating self-managed Apache Kafka clusters from AWS EC2 to Amazon MSK. We executed such a migration for a client with zero downtime, offering insights and strategies in this blog. Motivations Behind Migration Scalability Limitations: Scaling self-managed [&hellip;]<\/p>\n","protected":false},"author":1601,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"iawp_total_views":115},"categories":[2348],"tags":[5916,5209,1604,1703],"aioseo_notices":[],"_links":{"self":[{"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/posts\/61712"}],"collection":[{"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/users\/1601"}],"replies":[{"embeddable":true,"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/comments?post=61712"}],"version-history":[{"count":7,"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/posts\/61712\/revisions"}],"predecessor-version":[{"id":62851,"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/posts\/61712\/revisions\/62851"}],"wp:attachment":[{"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/media?parent=61712"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/categories?post=61712"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/tags?post=61712"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}