{"id":72441,"date":"2025-06-04T13:50:12","date_gmt":"2025-06-04T08:20:12","guid":{"rendered":"https:\/\/www.tothenew.com\/blog\/?p=72441"},"modified":"2025-06-04T13:54:00","modified_gmt":"2025-06-04T08:24:00","slug":"how-rack-awareness-in-amazon-msk-saved-36k-year-for-a-leading-adtech-company","status":"publish","type":"post","link":"https:\/\/www.tothenew.com\/blog\/how-rack-awareness-in-amazon-msk-saved-36k-year-for-a-leading-adtech-company\/","title":{"rendered":"How Rack Awareness in Amazon MSK Saved $36K\/Year for a Leading AdTech Company"},"content":{"rendered":"<h2><span style=\"text-decoration: underline;\">Introduction<\/span><\/h2>\n<p>Building and deploying applications that use Apache Kafka for real-time data processing is made simple with <strong>Amazon MSK (Managed Streaming for Apache Kafka)<\/strong>, a fully managed service by AWS. Rack awareness is one overlooked configuration change that can greatly increase fault tolerance and cost effectiveness, even though MSK takes care of a large portion of the infrastructure for you.<\/p>\n<div id=\"attachment_72440\" style=\"width: 802px\" class=\"wp-caption aligncenter\"><img aria-describedby=\"caption-attachment-72440\" decoding=\"async\" loading=\"lazy\" class=\"wp-image-72440 size-full\" src=\"https:\/\/www.tothenew.com\/blog\/wp-ttn-blog\/uploads\/2025\/06\/unnamed-28.png\" alt=\"MSK Rack Awarenress\" width=\"792\" height=\"660\" srcset=\"\/blog\/wp-ttn-blog\/uploads\/2025\/06\/unnamed-28.png 792w, \/blog\/wp-ttn-blog\/uploads\/2025\/06\/unnamed-28-300x250.png 300w, \/blog\/wp-ttn-blog\/uploads\/2025\/06\/unnamed-28-768x640.png 768w, \/blog\/wp-ttn-blog\/uploads\/2025\/06\/unnamed-28-624x520.png 624w\" sizes=\"(max-width: 792px) 100vw, 792px\" \/><p id=\"caption-attachment-72440\" class=\"wp-caption-text\">MSK Rack Awareness<\/p><\/div>\n<p>In this blog, we will explore what rack awareness is, how Kafka or MSK uses it, and how configuring your Kafka consumers to read from the closest replica rather than the partition leader can result in major cost savings. We did this activity for the <strong>Global Advertising Management Platform client<\/strong>, a powerhouse in advertising and connected TVs. With a state-of-the-art Connected TV Advertising Management Platform, they needed a trusted partner to control their cloud bills. In our case, savings were almost<strong> $100 per day<\/strong> in the cross-AZ data transfer bill. Let\u2019s get started and explore the solution.<\/p>\n<h2><span style=\"text-decoration: underline;\">What is Rack Awareness in Kafka?<\/span><\/h2>\n<p>Rack awareness feature spreads replicas of the same partition across racks (racks or availability zones). Enabling Kafka consumers to fetch data from the closest available replica lowers latency and cross-AZ traffic and guarantees that a failure in one place won&#8217;t destroy all partition replicas.<\/p>\n<h2><span style=\"text-decoration: underline;\"><strong>Advantages Of Rack Awareness<\/strong><\/span><\/h2>\n<ul>\n<li><span style=\"text-decoration: underline;\"><strong>High Availability<\/strong><\/span>: Even if an entire AZ fails, partitions continue to function. Data can be fetched from other AZ brokers.<\/li>\n<li><span style=\"text-decoration: underline;\"><strong>Resilience:<\/strong><\/span> No single AZ becomes a single point of failure.<\/li>\n<li><span style=\"text-decoration: underline;\"><strong>Low Data Transfer Costs<\/strong><\/span>: Consumers will fetch the data from the replicas in the same AZ, which will result in no cross-AZ data transfer costs.<\/li>\n<li><span style=\"text-decoration: underline;\"><strong>Performance<\/strong><\/span>: Reading data from the same AZ brokers\/replicas will result in low latency and high performance.\n<p><div id=\"attachment_72442\" style=\"width: 768px\" class=\"wp-caption aligncenter\"><img aria-describedby=\"caption-attachment-72442\" decoding=\"async\" loading=\"lazy\" class=\"size-full wp-image-72442\" src=\"https:\/\/www.tothenew.com\/blog\/wp-ttn-blog\/uploads\/2025\/06\/Screenshot-from-2025-06-01-22-47-35.png\" alt=\"Advantages of Rack Awareness\" width=\"758\" height=\"296\" srcset=\"\/blog\/wp-ttn-blog\/uploads\/2025\/06\/Screenshot-from-2025-06-01-22-47-35.png 758w, \/blog\/wp-ttn-blog\/uploads\/2025\/06\/Screenshot-from-2025-06-01-22-47-35-300x117.png 300w, \/blog\/wp-ttn-blog\/uploads\/2025\/06\/Screenshot-from-2025-06-01-22-47-35-624x244.png 624w\" sizes=\"(max-width: 758px) 100vw, 758px\" \/><p id=\"caption-attachment-72442\" class=\"wp-caption-text\">Advantages of Rack Awareness<\/p><\/div><\/li>\n<\/ul>\n<h2><span style=\"text-decoration: underline;\"><strong>How Amazon MSK Handles Rack Awareness<\/strong><\/span><\/h2>\n<p>Amazon MSK automatically maps brokers to different Availability Zones. Each broker has a broker. Rack property set by MSK based on its AZ.<br \/>\nLet\u2019s take a look at the example of the Oregon region in AWS :<\/p>\n<pre>Broker 1 \u2192 us-west-2a\r\n\r\nBroker 2 \u2192 us-west-2b\r\n\r\nBroker 3 \u2192 us-west-2c\r\n\r\nBroker 4 \u2192 us-west-2a\r\n\r\n.\r\n.\r\n.\r\n.\r\n\r\nBroker 15 \u2192 us-west-2c<\/pre>\n<p>When a topic is created with a replication factor of 3, MSK ensures that the replicas are distributed across different AZs.<\/p>\n<h2><span style=\"text-decoration: underline;\"><strong>Enabling Closest Replica Fetching<\/strong><\/span><\/h2>\n<p>While MSK takes care of broker rack assignments, to fully leverage rack awareness, your Kafka consumers must be configured to fetch from the closest replica. This requires configuration changes on both the broker and consumer sides.<\/p>\n<h3><span style=\"text-decoration: underline;\">Broker Configuration<\/span><\/h3>\n<p>Enable the following property in your MSK configuration:<\/p>\n<pre>replica.selector.class=org.apache.kafka.common.replica.RackAwareReplicaSelector<\/pre>\n<div id=\"attachment_72443\" style=\"width: 672px\" class=\"wp-caption aligncenter\"><img aria-describedby=\"caption-attachment-72443\" decoding=\"async\" loading=\"lazy\" class=\"size-full wp-image-72443\" src=\"https:\/\/www.tothenew.com\/blog\/wp-ttn-blog\/uploads\/2025\/06\/msk_configuration.png\" alt=\"msk configuration\" width=\"662\" height=\"300\" srcset=\"\/blog\/wp-ttn-blog\/uploads\/2025\/06\/msk_configuration.png 662w, \/blog\/wp-ttn-blog\/uploads\/2025\/06\/msk_configuration-300x136.png 300w, \/blog\/wp-ttn-blog\/uploads\/2025\/06\/msk_configuration-624x283.png 624w\" sizes=\"(max-width: 662px) 100vw, 662px\" \/><p id=\"caption-attachment-72443\" class=\"wp-caption-text\">msk configuration<\/p><\/div>\n<p>MSK will take ~15 minutes to apply this configuration, but it also depends on the cluster size and data. During this time, the cluster remains fully functional, so it\u2019s completely safe to apply this change anytime.<\/p>\n<p><span style=\"text-decoration: underline;\"><strong>Before enabling this configuration on the cluster:<\/strong><\/span><\/p>\n<div id=\"attachment_72444\" style=\"width: 1610px\" class=\"wp-caption aligncenter\"><img aria-describedby=\"caption-attachment-72444\" decoding=\"async\" loading=\"lazy\" class=\"size-full wp-image-72444\" src=\"https:\/\/www.tothenew.com\/blog\/wp-ttn-blog\/uploads\/2025\/06\/unnamed-29.png\" alt=\"before applying the configuration\" width=\"1600\" height=\"101\" srcset=\"\/blog\/wp-ttn-blog\/uploads\/2025\/06\/unnamed-29.png 1600w, \/blog\/wp-ttn-blog\/uploads\/2025\/06\/unnamed-29-300x19.png 300w, \/blog\/wp-ttn-blog\/uploads\/2025\/06\/unnamed-29-1024x65.png 1024w, \/blog\/wp-ttn-blog\/uploads\/2025\/06\/unnamed-29-768x48.png 768w, \/blog\/wp-ttn-blog\/uploads\/2025\/06\/unnamed-29-1536x97.png 1536w, \/blog\/wp-ttn-blog\/uploads\/2025\/06\/unnamed-29-624x39.png 624w\" sizes=\"(max-width: 1600px) 100vw, 1600px\" \/><p id=\"caption-attachment-72444\" class=\"wp-caption-text\">Before applying the configuration<\/p><\/div>\n<p><span style=\"text-decoration: underline;\"><strong>After enabling:<br \/>\n<\/strong><\/span><\/p>\n<div id=\"attachment_72445\" style=\"width: 1610px\" class=\"wp-caption aligncenter\"><img aria-describedby=\"caption-attachment-72445\" decoding=\"async\" loading=\"lazy\" class=\"size-full wp-image-72445\" src=\"https:\/\/www.tothenew.com\/blog\/wp-ttn-blog\/uploads\/2025\/06\/unnamed-30.png\" alt=\"After applying the configuration\" width=\"1600\" height=\"101\" srcset=\"\/blog\/wp-ttn-blog\/uploads\/2025\/06\/unnamed-30.png 1600w, \/blog\/wp-ttn-blog\/uploads\/2025\/06\/unnamed-30-300x19.png 300w, \/blog\/wp-ttn-blog\/uploads\/2025\/06\/unnamed-30-1024x65.png 1024w, \/blog\/wp-ttn-blog\/uploads\/2025\/06\/unnamed-30-768x48.png 768w, \/blog\/wp-ttn-blog\/uploads\/2025\/06\/unnamed-30-1536x97.png 1536w, \/blog\/wp-ttn-blog\/uploads\/2025\/06\/unnamed-30-624x39.png 624w\" sizes=\"(max-width: 1600px) 100vw, 1600px\" \/><p id=\"caption-attachment-72445\" class=\"wp-caption-text\">After applying the configuration<\/p><\/div>\n<p>We can see the property is now applied on cluster, and it now enables Kafka consumers to fetch data from the closest replica instead of always contacting the partition leader, reducing cross-AZ data transfer and latency. The broker uses this value to find the preferred read replica.<br \/>\nReference: <a href=\"https:\/\/docs.aws.amazon.com\/msk\/latest\/developerguide\/msk-configuration-properties.html\">https:\/\/docs.aws.amazon.com\/msk\/latest\/developerguide\/msk-configuration-properties.html<\/a><\/p>\n<h3><span style=\"text-decoration: underline;\"><strong>Consumer Configuration<\/strong><\/span><\/h3>\n<p>Set the consumer\u2019s rack ID dynamically by using EC2 Instance Metadata Service (IMDS). In our case, we run the consumer on an Amazon EMR cluster, and we retrieve the AZ ID at runtime using AWS Secrets Manager. This AZ ID is then passed to the client.rack property in the Kafka consumer configuration. For example:<\/p>\n<pre>client.rack=us-west-2b<\/pre>\n<p>This tells the Kafka client, in our case Java AWS SDK, to prefer replicas in the same AZ. Check out client-side configuration: https:\/\/kafka.apache.org\/35\/documentation.html#consumerconfigs_client.rack<br \/>\nVerifying Rack Awareness<br \/>\nYou can verify the broker rack configuration using:<\/p>\n<pre>.\/kafka-configs.sh --describe --entity-type brokers --bootstrap-server &lt;broker-endpoint&gt;<\/pre>\n<p>To check that clients are reading from the closest replica, you can monitor network usage patterns or use Kafka metrics via CloudWatch dashboards to identify shifts in traffic distribution after enabling rack awareness.<br \/>\nBefore enabling this optimization, most of our consumer traffic went to the partition leader, which often resided in a different AZ.<\/p>\n<div id=\"attachment_72446\" style=\"width: 1610px\" class=\"wp-caption aligncenter\"><img aria-describedby=\"caption-attachment-72446\" decoding=\"async\" loading=\"lazy\" class=\"size-full wp-image-72446\" src=\"https:\/\/www.tothenew.com\/blog\/wp-ttn-blog\/uploads\/2025\/06\/unnamed-34.png\" alt=\"Cloudwatch Metrics Before Enabling Rack Awareness\" width=\"1600\" height=\"398\" srcset=\"\/blog\/wp-ttn-blog\/uploads\/2025\/06\/unnamed-34.png 1600w, \/blog\/wp-ttn-blog\/uploads\/2025\/06\/unnamed-34-300x75.png 300w, \/blog\/wp-ttn-blog\/uploads\/2025\/06\/unnamed-34-1024x255.png 1024w, \/blog\/wp-ttn-blog\/uploads\/2025\/06\/unnamed-34-768x191.png 768w, \/blog\/wp-ttn-blog\/uploads\/2025\/06\/unnamed-34-1536x382.png 1536w, \/blog\/wp-ttn-blog\/uploads\/2025\/06\/unnamed-34-624x155.png 624w\" sizes=\"(max-width: 1600px) 100vw, 1600px\" \/><p id=\"caption-attachment-72446\" class=\"wp-caption-text\">Cloudwatch Metrics Before Enabling Rack Awareness<\/p><\/div>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"aligncenter size-full wp-image-72447\" src=\"https:\/\/www.tothenew.com\/blog\/wp-ttn-blog\/uploads\/2025\/06\/unnamed-33.png\" alt=\"\" width=\"1600\" height=\"527\" srcset=\"\/blog\/wp-ttn-blog\/uploads\/2025\/06\/unnamed-33.png 1600w, \/blog\/wp-ttn-blog\/uploads\/2025\/06\/unnamed-33-300x99.png 300w, \/blog\/wp-ttn-blog\/uploads\/2025\/06\/unnamed-33-1024x337.png 1024w, \/blog\/wp-ttn-blog\/uploads\/2025\/06\/unnamed-33-768x253.png 768w, \/blog\/wp-ttn-blog\/uploads\/2025\/06\/unnamed-33-1536x506.png 1536w, \/blog\/wp-ttn-blog\/uploads\/2025\/06\/unnamed-33-624x206.png 624w\" sizes=\"(max-width: 1600px) 100vw, 1600px\" \/> <img decoding=\"async\" loading=\"lazy\" class=\"aligncenter size-full wp-image-72448\" src=\"https:\/\/www.tothenew.com\/blog\/wp-ttn-blog\/uploads\/2025\/06\/unnamed-32.png\" alt=\"\" width=\"1600\" height=\"528\" srcset=\"\/blog\/wp-ttn-blog\/uploads\/2025\/06\/unnamed-32.png 1600w, \/blog\/wp-ttn-blog\/uploads\/2025\/06\/unnamed-32-300x99.png 300w, \/blog\/wp-ttn-blog\/uploads\/2025\/06\/unnamed-32-1024x338.png 1024w, \/blog\/wp-ttn-blog\/uploads\/2025\/06\/unnamed-32-768x253.png 768w, \/blog\/wp-ttn-blog\/uploads\/2025\/06\/unnamed-32-1536x507.png 1536w, \/blog\/wp-ttn-blog\/uploads\/2025\/06\/unnamed-32-624x206.png 624w\" sizes=\"(max-width: 1600px) 100vw, 1600px\" \/> <img decoding=\"async\" loading=\"lazy\" class=\"aligncenter size-full wp-image-72449\" src=\"https:\/\/www.tothenew.com\/blog\/wp-ttn-blog\/uploads\/2025\/06\/unnamed-31.png\" alt=\"\" width=\"1600\" height=\"528\" srcset=\"\/blog\/wp-ttn-blog\/uploads\/2025\/06\/unnamed-31.png 1600w, \/blog\/wp-ttn-blog\/uploads\/2025\/06\/unnamed-31-300x99.png 300w, \/blog\/wp-ttn-blog\/uploads\/2025\/06\/unnamed-31-1024x338.png 1024w, \/blog\/wp-ttn-blog\/uploads\/2025\/06\/unnamed-31-768x253.png 768w, \/blog\/wp-ttn-blog\/uploads\/2025\/06\/unnamed-31-1536x507.png 1536w, \/blog\/wp-ttn-blog\/uploads\/2025\/06\/unnamed-31-624x206.png 624w\" sizes=\"(max-width: 1600px) 100vw, 1600px\" \/><\/p>\n<p>After implementing rack-aware fetching:<\/p>\n<ul>\n<li>Consumers began reading from in-AZ replicas, and then we observed different cloudwatch metrics patterns now. The data flow and network packet transactions were high on the brokers, which were in the same AZ as of consumer.<\/li>\n<\/ul>\n<div id=\"attachment_72453\" style=\"width: 1610px\" class=\"wp-caption aligncenter\"><img aria-describedby=\"caption-attachment-72453\" decoding=\"async\" loading=\"lazy\" class=\"size-full wp-image-72453\" src=\"https:\/\/www.tothenew.com\/blog\/wp-ttn-blog\/uploads\/2025\/06\/unnamed-38.png\" alt=\"Cloudwatch Metrics After Enabling Rack Awareness\" width=\"1600\" height=\"393\" srcset=\"\/blog\/wp-ttn-blog\/uploads\/2025\/06\/unnamed-38.png 1600w, \/blog\/wp-ttn-blog\/uploads\/2025\/06\/unnamed-38-300x74.png 300w, \/blog\/wp-ttn-blog\/uploads\/2025\/06\/unnamed-38-1024x252.png 1024w, \/blog\/wp-ttn-blog\/uploads\/2025\/06\/unnamed-38-768x189.png 768w, \/blog\/wp-ttn-blog\/uploads\/2025\/06\/unnamed-38-1536x377.png 1536w, \/blog\/wp-ttn-blog\/uploads\/2025\/06\/unnamed-38-624x153.png 624w\" sizes=\"(max-width: 1600px) 100vw, 1600px\" \/><p id=\"caption-attachment-72453\" class=\"wp-caption-text\">Cloudwatch Metrics After Enabling Rack Awareness<\/p><\/div>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"aligncenter size-full wp-image-72454\" src=\"https:\/\/www.tothenew.com\/blog\/wp-ttn-blog\/uploads\/2025\/06\/unnamed-37.png\" alt=\"\" width=\"1600\" height=\"590\" srcset=\"\/blog\/wp-ttn-blog\/uploads\/2025\/06\/unnamed-37.png 1600w, \/blog\/wp-ttn-blog\/uploads\/2025\/06\/unnamed-37-300x111.png 300w, \/blog\/wp-ttn-blog\/uploads\/2025\/06\/unnamed-37-1024x378.png 1024w, \/blog\/wp-ttn-blog\/uploads\/2025\/06\/unnamed-37-768x283.png 768w, \/blog\/wp-ttn-blog\/uploads\/2025\/06\/unnamed-37-1536x566.png 1536w, \/blog\/wp-ttn-blog\/uploads\/2025\/06\/unnamed-37-624x230.png 624w\" sizes=\"(max-width: 1600px) 100vw, 1600px\" \/><\/p>\n<ul>\n<li>Cross AZ Data Transfer costs dropped substantially. We now save approximately $100 per day.<\/li>\n<li>This adds up to <strong>$36,500\/year<\/strong> in cost savings for just one use case, showing the real financial benefit of deep platform optimization.<\/li>\n<\/ul>\n<h2><span style=\"text-decoration: underline;\"><strong>Best Practices<\/strong><\/span><\/h2>\n<ul>\n<li><span style=\"text-decoration: underline;\"><strong>Use at least 3 AZs<\/strong><\/span>: Spread brokers across 3 AZs for better fault tolerance.<\/li>\n<li><span style=\"text-decoration: underline;\"><strong>Set proper replication factor<\/strong><\/span>: Ensure at least 3 replicas per partition.<\/li>\n<li><span style=\"text-decoration: underline;\"><strong>Enable rack-aware replica selector<\/strong><\/span>: On both brokers and clients.<\/li>\n<li><span style=\"text-decoration: underline;\"><strong>Fetch AZ dynamically<\/strong><\/span>: Especially if using EMR or autoscaling groups.<\/li>\n<li><span style=\"text-decoration: underline;\"><strong>Monitor with CloudWatch<\/strong><\/span><strong>:<\/strong> Set up dashboards to track data transfer and broker traffic.<\/li>\n<\/ul>\n<h2><span style=\"text-decoration: underline;\">Conclusion<\/span><\/h2>\n<p>Rack awareness in Amazon MSK is not just a high availability or resilience feature\u2014it\u2019s a powerful cost-saving mechanism when configured properly. By enabling consumers to fetch from the closest replica, organizations can reduce latency and save significantly on cross-AZ data transfer charges.<br \/>\nIf you&#8217;re using Amazon MSK or Open Source Kafka running on EC2 or some other platform and haven&#8217;t looked into rack-aware replica fetching, now is the time. Small configuration changes can lead to big wins in both performance and your AWS bill. Partnering with a managed cloud services provider like <a href=\"https:\/\/www.tothenew.com\/\">TO THE NEW<\/a> can help you adopt the right architecture and strategies to unlock these savings and improve your Kafka workloads on AWS. Our AWS Certified Architects and DevOps Engineers are committed to saving you time and resources while enhancing business efficiency and reliability.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Introduction Building and deploying applications that use Apache Kafka for real-time data processing is made simple with Amazon MSK (Managed Streaming for Apache Kafka), a fully managed service by AWS. Rack awareness is one overlooked configuration change that can greatly increase fault tolerance and cost effectiveness, even though MSK takes care of a large portion [&hellip;]<\/p>\n","protected":false},"author":1601,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"iawp_total_views":240},"categories":[2348],"tags":[7454,7447,7442,5550,7458,7462,7431,7450,6788,5916,7456,7452,7451,7461,6731,7459,7455,7449,6961,7457,7463,7460,7446,7444,7441,7445,7443,7453,7448,2987],"aioseo_notices":[],"_links":{"self":[{"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/posts\/72441"}],"collection":[{"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/users\/1601"}],"replies":[{"embeddable":true,"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/comments?post=72441"}],"version-history":[{"count":7,"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/posts\/72441\/revisions"}],"predecessor-version":[{"id":72567,"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/posts\/72441\/revisions\/72567"}],"wp:attachment":[{"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/media?parent=72441"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/categories?post=72441"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/tags?post=72441"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}