{"id":17637,"date":"2015-02-28T18:05:28","date_gmt":"2015-02-28T12:35:28","guid":{"rendered":"http:\/\/www.tothenew.com\/blog\/?p=17637"},"modified":"2024-01-02T17:49:28","modified_gmt":"2024-01-02T12:19:28","slug":"apache-flume-setup-best-practices","status":"publish","type":"post","link":"https:\/\/www.tothenew.com\/blog\/apache-flume-setup-best-practices\/","title":{"rendered":"Apache Flume : Setup &amp; Best Practices"},"content":{"rendered":"<p>Apache Flume is an open source project aimed at providing a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large volume of data. It is a complex task when moving data in large volume. We try to minimize the latency in transfer; this is achieved by specifically tweaking the configuration of Flume. First, we&#8217;ll see how to setup flume.<\/p>\n<p><em><strong>Setting Up Flume :<\/strong><\/em><\/p>\n<ol>\n<li>Download flume binary from <a title=\"Flume Download\" href=\"http:\/\/flume.apache.org\/download.html\">http:\/\/flume.apache.org\/download.html<\/a><\/li>\n<li>Extract and put the binary folder in globally accessible place. For e.g. \/usr\/local\/flume\u00a0\u00a0 (Use &#8220;<em>tar -xvf apache-flume-1.5.2-bin.tar.gz&#8221; <\/em>and <em>&#8220;mv apache-flume-1.5.2-bin \/usr\/local\/flume&#8221;<\/em>)<\/li>\n<li>Once done, set the global variables in the &#8220;<em>.bashrc<\/em>&#8221; file of the user accessing Flume.<a href=\"\/blog\/wp-ttn-blog\/uploads\/2015\/02\/Screenshot-from-2015-02-27-151944.png\"><img decoding=\"async\" loading=\"lazy\" class=\"aligncenter  wp-image-17640\" src=\"\/blog\/wp-ttn-blog\/uploads\/2015\/02\/Screenshot-from-2015-02-27-151944.png\" alt=\"Flume Variables\" width=\"401\" height=\"134\" \/><\/a><\/li>\n<li>Use &#8220;<em>source .bashrc&#8221; <\/em>to set the new variables in effect. Test by running command &#8220;<em>flume-ng version&#8221;.<\/em><\/li>\n<li>Once the command shows the right version, Flume is set to work on the current system. Please note &#8220;<em>flume-ng&#8221;<\/em> represents, &#8220;<strong>Flume Next-Gen<\/strong>&#8220;<\/li>\n<\/ol>\n<p>&nbsp;<\/p>\n<p>Flume Agent is the one which takes care of the whole process of taking data from &#8220;<strong><em>source<\/em><\/strong>&#8220;, putting it on to the &#8220;<em><strong>channel<\/strong><\/em>&#8220;, and finally dumping it in the &#8220;<em><strong>sink<\/strong><\/em>&#8220;. &#8220;<em><strong>Sink<\/strong><\/em>&#8221; usually is <em>HDFS.<\/em><\/p>\n<img decoding=\"async\" loading=\"lazy\" class=\"aligncenter\" src=\"\/blog\/wp-ttn-blog\/uploads\/2024\/01\/DevGuide_image00.png\" alt=\"\" width=\"520\" height=\"218\" \/>\n<p><em><strong>Basic Configuration<\/strong><\/em> :<\/p>\n<blockquote><p><em>Apache Flume takes a configuration file every time it runs a task. This task is kept alive in order to listen to any change in the <strong>source<\/strong>, and must be terminated manually by the user. A basic configuration to read data taking &#8220;<strong>local file system&#8221;<\/strong> as the source, keeping channel as &#8220;<\/em><strong>memory<\/strong>&#8220;, and &#8220;<em><strong>hdfs<\/strong><\/em>&#8221; as sink, could be:<\/p>\n<p>&nbsp;<\/p><\/blockquote>\n<p><a href=\"\/blog\/wp-ttn-blog\/uploads\/2015\/02\/Screenshot-from-2015-02-27-1519441.png\"><img decoding=\"async\" loading=\"lazy\" class=\"aligncenter size-full wp-image-17648\" src=\"\/blog\/wp-ttn-blog\/uploads\/2015\/02\/Screenshot-from-2015-02-27-1519441.png\" alt=\"Flume Config\" width=\"730\" height=\"476\" \/><\/a><\/p>\n<p>Sources could be anything like an Avro Client being used by Log4JAppender as well.<\/p>\n<p>It is worth noting, the above given configuration will produce separate files of 10 records each by default, taking timestamp ( <em><strong>an interceptor <\/strong><\/em>) as the point of reference of last update of the source.<\/p>\n<p>The file so created be saved as flume.conf, and can be run as :<\/p>\n<p><em><strong>flume-ng agent -f flume.conf -n source_agent<\/strong><\/em><\/p>\n<p><em><strong>Performance Measures, Issues and Comments:<\/strong><\/em><\/p>\n<ol>\n<li>Memory leaks in <strong>log4jappender<\/strong> when ingesting just 10000 records.\n<ul>\n<li>Solved by making thread sleep after every 500 records, thus decreasing the load on the channel<\/li>\n<\/ul>\n<\/li>\n<li>\u00a0GC Memory Leak (Flume level)\n<ul>\n<li>Solved by keeping transactionCapacity of channel low and capacity of channel high enough<\/li>\n<\/ul>\n<\/li>\n<li>Avro&#8217;s Optimized data serialization is not expolited when using log4jappender\n<ul>\n<li>Solved by directly pushing the file onto avro-client (inbuilt in flume-ng)<\/li>\n<\/ul>\n<\/li>\n<li>Due to memory leaks, could not write more than 12000 records from log4jappender\n<ul>\n<li>Solved in points 1 &amp; 2<\/li>\n<\/ul>\n<\/li>\n<li>Did performance analysis of data ingestion directly from file as well as avro client of 14000 records when only 10 records rolled up per file in HDFS\n<ul>\n<li>Using Avro Client\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 \u00a0\u00a0 &#8211;&gt; Total time &#8211; <strong>00:02:48<\/strong><\/li>\n<li>Directly from File System &#8211;&gt; Total time &#8211; <strong>00:02:44<\/strong><\/li>\n<\/ul>\n<\/li>\n<li>When using avro client, realtime update to the file was not taken into account\n<ul>\n<li>Unsolved problem<\/li>\n<\/ul>\n<\/li>\n<li>When ingesting directly from file, realtime updates were automatically registered, on the basis of timestamp for last modification\n<ul>\n<li>Used the following configuration<br \/>\nsource_agent.sources.test_source.interceptors = itime<br \/>\n# http:\/\/flume.apache.org\/FlumeUserGuide.html#timestamp-interceptor<br \/>\nsource_agent.sources.test_source.interceptors.itime.type = timestamp<\/li>\n<\/ul>\n<\/li>\n<li>Flume could only create files to write at max 10 records in a single file by default. This decreased the ingestion rate, thus increasing the ingestion time\n<ul>\n<li>Solved by changing the rollCount property of the HDFSSink to the desired number of events\/records per file; this increased the ingestion rate, thus decreasing the ingestion time.<\/li>\n<li>Also made rollSize and rollInterval as 0, so that they are not used.<\/li>\n<\/ul>\n<\/li>\n<li>While controlling the channel capacity, encountered Memory Leaks\n<ul>\n<li>Solved by changing the rollCount(HDFSSink) and transactionCapacity(channel) so that the channel is cleared for more data<\/li>\n<\/ul>\n<\/li>\n<li>Did performance analysis of data ingestion directly from file after applying solutions from 8 &amp; 9 of 20000 records\n<ul>\n<li>Total time &#8211; <strong>00:00:04 (<\/strong>Previously around <strong>00:02:30)<\/strong><\/li>\n<\/ul>\n<\/li>\n<li>How to trigger notification on HDFS update from flume\n<ul>\n<li>Can be potentially solved by Oozie Coordinator<\/li>\n<\/ul>\n<\/li>\n<\/ol>\n<p>&nbsp;<\/p>\n<p>This covers the very basics of setting up Flume and mitigating some of the common issues which one encounters while using it.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Apache Flume is an open source project aimed at providing a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large volume of data. It is a complex task when moving data in large volume. We try to minimize the latency in transfer; this is achieved by specifically tweaking the configuration of Flume. [&hellip;]<\/p>\n","protected":false},"author":158,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"iawp_total_views":11},"categories":[1395],"tags":[],"aioseo_notices":[],"_links":{"self":[{"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/posts\/17637"}],"collection":[{"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/users\/158"}],"replies":[{"embeddable":true,"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/comments?post=17637"}],"version-history":[{"count":1,"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/posts\/17637\/revisions"}],"predecessor-version":[{"id":59899,"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/posts\/17637\/revisions\/59899"}],"wp:attachment":[{"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/media?parent=17637"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/categories?post=17637"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/tags?post=17637"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}