Spark 1O1 – Revamping Hadoop

20 / Jan / 2015 by Moonesh Kachroo 0 comments

Big Data has witnessed a tremendous movement and growth over the last couple of years. As per the top research agencies, Big Data has recently emerged as the most successful “launch pad”,   giving a way to the maximum number of start-up ever.  As the space evolves further, more and more organizations of varied sizes and discrete domains have realized the real potential it can offer for their business. Majority of these organizations have already factored Big Data  in their strategic roadmap, as the main stream technology.

Hence, It is no more a matter of ‘If’ but ‘what’, when it comes to Big Data adoption.

With the potential opportunities that comes along with Big Data, it also poses a continuous push on the Big Data vendors and technology providers – to regularly innovate and upgrade their technology stacks. This is critical to survive the every growing competition, on the face of the rapidly changing technology landscape of Big Data.

Recently, the top vendors in the space announced an upgrades of their respective core products/platforms. It was primarily the result of Spark graduating to the Apache top Level Project. Spark has been the most awaited technology ever since it’s existence and is supposed to be a game-changer, pushing the boundaries of Hadoop to next level.

Considering the above facts, we thought to publish a series of blogs around Apache Spark that will serve as quick start guide for anyone who wants to take a plunge into the Spark development technology and intends to grow as a Spark programmer.  We will be regularly adding blogs under this series, so that it evolves logically  as we move down the series.


In 2009, the University of California at Berkeley’s AMPLab formally kicked off the Spark development, with the intent to address the primary shortcomings around Hadoop Development (which I will take up in the next section).

In June 2013, Spark made its way to Apache Incubator.  With the obvious potential that Spark offered – it was able to instantly attract a large group of contributors from the community and also received contributions from the most prevalent organizations in the Big Data space.

In less than a year, on 27 February 2014– Spark graduated from the Incubator to Apache Top-Level Project (TLP). Since then, Spark has gathered a lot of attention around the community and has become one of the most active projects in the Apache Hadoop ecosystem. It has gained a massive adopting across industry, with leading Big Data players/organizations including Spark as a core technology component of their respective products/platforms/tools.


Needless to say, Hadoop has been a first ‘concrete’ revolutionary technology that laid the foundation of Big Data as it exists now. It introduced the world to a new programming paradigm which has literally changed the mindset and approach of dealing with huge data sets.

Architecture wise, it was composed of two main parts – HDFS (the storage or the distributed file system) and MapReduce (the compute or data processing engine). This turned to be a remarkable combo for crunching large data sets – in an efficient (parallel processing) and cost effective (using commodity h/w) manner.

So where does Spark fit in?  Overtime, people realized that Hadoop (rather “compute” in particular) had some evident shortcomings that constrained its adoption to a certain type of use cases only. Spark can be seen as new generation Hadoop which further takes its capability to a different dimension. Supporting pseudo-distributed local mode, usually used only for testing or development purposes, It not only provides speed to the existing use-cases but opens up a whole new world of use-case types that have been out of Hadoop’s reach.

The table below shows a cross-comparison between Hadoop & Spark on some of the key parameters. Clearly, Spark out performs Hadoop in every department.

Performance Major issue arise as workflows become a little more complex ( MR sequencing is inefficient) Known for its performance. Benchmarks show even 100x performance compared to Hadoop.
Machine Learning Incapable of handling ‘Iterative’ algorithms, which are predominantly used for ML. Efficient use of all types of complex algorithms.
Streaming Not designed for event streams Built for Real-time event streaming
Implementation MR being a Low level Api makes the implementation challenging and error prone. Developer friendly Api’s – ease of implementation.
Primary Memory Usage Uses Disk I/O’s In-memory processing engine.
Analytics Offline – Batch Based Analytics Interactive Analytic


One of the primary focus, right from Spark’s inception was to build a unified ecosystem around the core, where in each ‘component’ caters to one specific area of focus.  The image below depicts the ‘Spark Ecosystem’. Let’s quickly run through each component one by one, for the sake of introduction. Later in the series, we will have a dedicated post on each component, along with a hands on exercise to get a better hang on the component and its capabilities.

  1. Spark Streaming: It’s the Real-time stream processing systems with the capability of auto- recovery from all kinds of failures. It is a batch based streaming solution.
  2. Spark SQL: It’s a SQL wrapper over Spark for querying the structured data sets.  The two important capabilities it comes with are: the directly support for direct access to Hive Queries and second, integration with various languages like Python, Scala and Java. Apart from SQL queries it can efficiently be used for complex and interactive analytic algorithms.  Please note: Apache Shark is no more a supported product.
  3. MLlib: It is the Machine Learning   library for Spark with high-quality and performant algorithms. It goes a way beyond the two step Map – Reduce process. It is interesting to note that Apache Mahout which is currently the most stable and adopted ML library across globe – has been working closely with Spark to raise the reach and performance of their algorithms.
  4. GraphX:  It unifies the ETL flow, exploratory analysis and iterative graph computation workflows by leveraging the underlying Spark Api using RDD efficiently. It’s the core concept around which the Spark has been built. It requires more than a paragraph to explain how RDD’s function. We will talk about it more in the coming Blogs.


Considering the overall value that Spark brings and keeping in mind the pace, nature & direction of its adoption, it should certainly be considered as the substitute/replacement of Hadoop” – UNLESS one has made a conscious decision after thoroughly evaluating the tradeoffs between “cost” vs “expected value” (and moreover, has no long term plans to extend, enhance or build on top of the existing stack).

Probably there are many forums and blogs talking about various use-cases where Hadoop might be a better choice over Spark, but didn’t see a real gravity in the arguments – at least in the ones I’ve read, so far.

With this note, I end with the first part of the Spark Series and keep a along for the next part (s) to come shortly….. Till then, ‘Happy Sparking’



Leave a Reply

Your email address will not be published. Required fields are marked *