{"id":16699,"date":"2015-01-20T14:24:47","date_gmt":"2015-01-20T08:54:47","guid":{"rendered":"http:\/\/www.tothenew.com\/blog\/?p=16699"},"modified":"2015-08-04T16:02:51","modified_gmt":"2015-08-04T10:32:51","slug":"spark-1o1-revamping-hadoop","status":"publish","type":"post","link":"https:\/\/www.tothenew.com\/blog\/spark-1o1-revamping-hadoop\/","title":{"rendered":"Spark 1O1 &#8211; Revamping Hadoop"},"content":{"rendered":"<p>Big Data has witnessed a tremendous movement and growth over the last couple of years.\u00a0As per the top\u00a0research agencies, Big Data has recently emerged as the most successful \u201claunch pad\u201d, \u00a0 giving a way to the maximum number of start-up ever. \u00a0As the space evolves further, more and more organizations of varied sizes and discrete\u00a0domains have realized the\u00a0real potential it can offer for their\u00a0business. Majority of these organizations\u00a0have already factored\u00a0Big Data\u00a0\u00a0in their strategic roadmap, as\u00a0the main stream technology.<\/p>\n<p>Hence, <em>It is no more a matter of &#8216;If&#8217; but &#8216;what&#8217;, when it comes to Big Data adoption.<\/em><\/p>\n<p>With the potential opportunities that\u00a0comes along with Big Data, it also poses a continuous push on the Big Data vendors and technology\u00a0providers &#8211;\u00a0to regularly innovate and upgrade their technology stacks. This is critical to survive\u00a0the every growing\u00a0competition, on the face of the rapidly changing <a href=\"http:\/\/www.tothenew.com\/analytics\">technology landscape of Big Data<\/a>.<\/p>\n<p>Recently, the\u00a0top vendors in the space\u00a0announced an upgrades of their respective core products\/platforms. It was primarily the\u00a0result of Spark graduating to the Apache top Level Project. Spark has been the most awaited technology ever since it&#8217;s existence and is\u00a0supposed to be a <strong>game-changer<\/strong>, pushing the boundaries\u00a0of Hadoop to next\u00a0level.<\/p>\n<p>Considering the above facts, we thought\u00a0to publish a series of blogs around Apache Spark that\u00a0will serve as quick start guide for anyone who wants to take a plunge into the <a href=\"http:\/\/www.tothenew.com\/analytics\">Spark development<\/a> technology and intends to grow\u00a0as a Spark programmer.\u00a0\u00a0We will be regularly adding blogs under this series, so that it evolves logically \u00a0as we move down the series.<\/p>\n<p><strong>BACKGROUND<\/strong><\/p>\n<p><strong>In 2009, the University of California at Berkeley&#8217;s AMPLab\u00a0<\/strong>formally kicked off the Spark development, with the intent to address the primary shortcomings around <a href=\"http:\/\/www.tothenew.com\/analytics\">Hadoop Development<\/a> (which I will take\u00a0up in the next section).<\/p>\n<p><strong>In June 2013,\u00a0Spark made its way to Apache Incubator<\/strong>.\u00a0 With the obvious potential that Spark offered &#8211; it was able to instantly attract a large\u00a0group of contributors from the community and also received contributions from the most prevalent organizations in the Big Data space.<\/p>\n<p><strong>In less than a year, on 27 February 2014<\/strong>\u2013 Spark graduated from the Incubator to Apache Top-Level Project (TLP). Since then, Spark has gathered\u00a0a lot of attention around the community and has\u00a0become one of the most active projects in the Apache Hadoop ecosystem. It has gained a massive adopting\u00a0across industry, with leading Big Data players\/organizations\u00a0including Spark as a core technology component of their respective products\/platforms\/tools.<\/p>\n<p><strong>HADOOP Vs\u00a0SPARK<\/strong><\/p>\n<p>Needless to say, Hadoop has been a first &#8216;concrete&#8217; revolutionary technology that\u00a0laid the foundation of Big Data as it exists now. It introduced\u00a0the world to a new programming paradigm which has literally changed the mindset\u00a0and approach of\u00a0dealing with huge data sets.<\/p>\n<p>Architecture wise, it was\u00a0composed\u00a0of two main parts\u00a0&#8211; <strong>HDFS<\/strong> (the storage or the distributed file system) and <strong>MapReduce<\/strong> (the compute or data processing engine). This\u00a0turned to\u00a0be a remarkable combo for crunching large data sets &#8211; in an efficient (parallel processing) and cost\u00a0effective (using commodity h\/w) manner.<\/p>\n<p>So where does\u00a0Spark fit in? \u00a0Overtime, people realized that Hadoop (rather &#8220;compute\u201d in particular) had some evident shortcomings\u00a0that constrained its adoption to a certain type of use cases only. Spark can be seen as new generation Hadoop which further takes\u00a0its\u00a0capability to a different dimension. Supporting\u00a0pseudo-distributed local mode, usually used only for testing or development purposes, It\u00a0not only provides speed to the existing\u00a0use-cases but opens up a whole new world of use-case types that have been out of\u00a0Hadoop&#8217;s reach.<\/p>\n<p>The table below shows a\u00a0cross-comparison between Hadoop &amp; Spark on some of the key parameters. Clearly, Spark out performs Hadoop in\u00a0every department.<\/p>\n<table width=\"686\">\n<tbody>\n<tr>\n<td width=\"177\"><strong><em>PARAMETER<\/em><\/strong><\/td>\n<td width=\"212\"><strong><em>HADOOP<\/em><\/strong><\/td>\n<td><strong><em>SPARK<\/em><\/strong><\/td>\n<\/tr>\n<tr>\n<td width=\"177\"><strong><em>Performance<\/em><\/strong><\/td>\n<td width=\"212\">Major issue arise as workflows become a little more complex ( MR sequencing is inefficient)<\/td>\n<td>Known for its performance. Benchmarks show even 100x performance compared to Hadoop.<\/td>\n<\/tr>\n<tr>\n<td width=\"177\"><strong><em>Machine Learning<\/em><\/strong><\/td>\n<td width=\"212\">Incapable of handling \u2018<strong>Iterative<\/strong>\u2019 algorithms, which are predominantly used\u00a0for\u00a0ML.<\/td>\n<td>Efficient use of all types of complex\u00a0algorithms.<\/td>\n<\/tr>\n<tr>\n<td width=\"177\"><strong><em>Streaming<\/em><\/strong><\/td>\n<td width=\"212\">Not designed for event streams<\/td>\n<td>Built for Real-time event streaming<\/td>\n<\/tr>\n<tr>\n<td width=\"177\"><strong><em>Implementation<\/em><\/strong><\/td>\n<td width=\"212\">MR being a Low level Api makes the implementation challenging\u00a0and error prone.<\/td>\n<td>Developer friendly\u00a0Api\u2019s &#8211; ease of implementation.<\/td>\n<\/tr>\n<tr>\n<td width=\"177\"><strong><em>Primary Memory Usage<\/em><\/strong><\/td>\n<td width=\"212\">Uses Disk I\/O\u2019s<\/td>\n<td>In-memory\u00a0processing engine.<\/td>\n<\/tr>\n<tr>\n<td width=\"177\"><strong><em>Analytics<\/em><\/strong><\/td>\n<td width=\"212\">Offline &#8211; Batch Based Analytics<\/td>\n<td>Interactive Analytic<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p><strong><br \/>\nSPARK ECOSYSTEM<\/strong><\/p>\n<p>One of the primary focus, right from Spark&#8217;s\u00a0inception was to build a unified ecosystem around the core,\u00a0where in each &#8216;component&#8217;\u00a0caters to one specific area of focus. \u00a0The image below depicts the\u00a0&#8216;Spark Ecosystem&#8217;.\u00a0Let&#8217;s quickly run through\u00a0each component one by one, for the sake of introduction. Later in the series, we will\u00a0have a dedicated post on each component, along with a hands on exercise\u00a0to\u00a0get a better hang on\u00a0the component and its capabilities.<\/p>\n<ol>\n<li><strong>Spark Streaming:<\/strong> It\u2019s the Real-time stream processing systems with the capability of auto- recovery from all kinds of failures. It is a batch based streaming solution.<\/li>\n<li><strong>Spark SQL:<\/strong> It\u2019s a SQL wrapper over Spark\u00a0for\u00a0querying the structured data sets. \u00a0The two important capabilities it comes with are: the directly support for direct access to Hive Queries and second, integration with various languages like\u00a0Python, Scala and Java. Apart from SQL queries it can efficiently\u00a0be used for\u00a0complex and interactive analytic algorithms. \u00a0Please note: Apache Shark is no more a supported product.<\/li>\n<li><strong>MLlib<\/strong>: It is the Machine Learning \u00a0\u00a0library for\u00a0Spark with\u00a0high-quality and performant algorithms. It goes a way beyond the two step Map &#8211; Reduce process. It is\u00a0interesting to note\u00a0that Apache Mahout which is currently the most stable and adopted ML library across globe &#8211;\u00a0has been\u00a0working closely with Spark to raise\u00a0the reach and performance of their algorithms.<\/li>\n<li><strong>GraphX<\/strong>: \u00a0It\u00a0unifies the ETL flow, exploratory analysis and iterative graph computation workflows by\u00a0leveraging the underlying Spark Api using RDD efficiently. It&#8217;s the core concept around which the Spark has been built. It requires\u00a0more than a paragraph to explain\u00a0how RDD&#8217;s function.\u00a0We will\u00a0talk about it more\u00a0in the coming Blogs.<\/li>\n<\/ol>\n<p><strong>CONCLUSION<\/strong><\/p>\n<p>&#8220;<em>Considering the overall value that Spark brings and keeping in mind the\u00a0pace, nature &amp; direction\u00a0of its adoption, it should certainly be considered as the substitute\/replacement of Hadoop<\/em>&#8221; &#8211; <strong>UNLESS<\/strong>\u00a0one has made a conscious decision after thoroughly evaluating the tradeoffs between &#8220;cost&#8221;\u00a0vs\u00a0&#8220;expected\u00a0value&#8221; (and moreover, has no long term plans to extend, enhance or build on top of the\u00a0existing stack).<\/p>\n<p>Probably there are many forums and blogs talking about various\u00a0use-cases where Hadoop might be a better choice over\u00a0Spark, but\u00a0didn&#8217;t\u00a0see a real gravity in the arguments &#8211; at least in the ones I&#8217;ve read, so far.<\/p>\n<p>With this note, I end with the first part of the\u00a0Spark Series and keep a along\u00a0for the next part (s) to come shortly&#8230;.. Till then, &#8216;Happy Sparking&#8217;<\/p>\n<p>&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Big Data has witnessed a tremendous movement and growth over the last couple of years.\u00a0As per the top\u00a0research agencies, Big Data has recently emerged as the most successful \u201claunch pad\u201d, \u00a0 giving a way to the maximum number of start-up ever. \u00a0As the space evolves further, more and more organizations of varied sizes and discrete\u00a0domains [&hellip;]<\/p>\n","protected":false},"author":153,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"iawp_total_views":4},"categories":[1395],"tags":[1515,1197,1398],"aioseo_notices":[],"_links":{"self":[{"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/posts\/16699"}],"collection":[{"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/users\/153"}],"replies":[{"embeddable":true,"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/comments?post=16699"}],"version-history":[{"count":0,"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/posts\/16699\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/media?parent=16699"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/categories?post=16699"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/tags?post=16699"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}