{"id":16665,"date":"2015-01-21T12:16:36","date_gmt":"2015-01-21T06:46:36","guid":{"rendered":"http:\/\/www.tothenew.com\/blog\/?p=16665"},"modified":"2016-12-19T15:17:36","modified_gmt":"2016-12-19T09:47:36","slug":"spark-1o2-hello-world","status":"publish","type":"post","link":"https:\/\/www.tothenew.com\/blog\/spark-1o2-hello-world\/","title":{"rendered":"Spark 1o2 &#8211;  &#8220;Hello World\u201d"},"content":{"rendered":"<p><span style=\"color: #444444\"><span style=\"font-family: Georgia, 'Bitstream Charter', serif\"><span style=\"font-size: small\">This is the second blog of the Spark series. This blog post include setup of Spark environment followed by a small word count program. The idea behind the blog is to get hands on in Spark setup and running simple program on Spark. If you want to know more about Spark history and it&#8217;s comparison with Hadoop, please refer <a href=\"http:\/\/www.tothenew.com\/blog\/spark-1o1-revamping-hadoop\/\">Spark 1o1.<\/a>\u00a0<\/span><\/span><\/span><\/p>\n<p><span style=\"color: #444444\"><span style=\"font-family: Georgia, 'Bitstream Charter', serif\"><span style=\"font-size: small\">Please note &#8211; This is a single node, local setup. I will cover Spark cluster setup on EC2 box in upcoming blogs.\u00a0<\/span><\/span><\/span><\/p>\n<p><strong><span style=\"color: #000000\"><span style=\"font-family: Georgia, 'Bitstream Charter', serif\"><span style=\"font-size: small\"><b>Setup <\/b><\/span><\/span><\/span><\/strong><span style=\"color: #444444\"><span style=\"font-family: Georgia, 'Bitstream Charter', serif\"><span style=\"font-size: small\">&#8211;\u00a0Before starting Spark setup make sure Scala 2.1 or higher version and Java 1.7 or higher version must be installed on you system. For This blog, I have used Java 8.<\/span><\/span><\/span><\/p>\n<p><strong><span style=\"color: #000000\"><span style=\"font-family: Georgia, 'Bitstream Charter', serif\"><span style=\"font-size: small\"><b>Steps &#8211;<\/b><\/span><\/span><\/span><\/strong><\/p>\n<ol>\n<li><span style=\"color: #444444\"><span style=\"font-family: Georgia, 'Bitstream Charter', serif\"><span style=\"font-size: small\">Spark supports loading data from HDFS. So, we need to install hadoop before installing Spark. Download and untar Hadoop distribution using terminal.<\/span><\/span><\/span>\n<ul>\n<li><span style=\"color: #444444\"><span style=\"font-family: Georgia, 'Bitstream Charter', serif\"><span style=\"font-size: small\">$ wget  2.5.0.tar.gz<\/span><\/span><\/span><\/li>\n<li><span style=\"color: #444444\"><span style=\"font-family: Georgia, 'Bitstream Charter', serif\"><span style=\"font-size: small\">$ tar -xvf hadoop- 2.5.0.tar.gz<\/span><\/span><\/span><\/li>\n<\/ul>\n<\/li>\n<li><span style=\"color: #444444\"><span style=\"font-family: Georgia, 'Bitstream Charter', serif\"><span style=\"font-size: small\">Downloads and untar Spark pre build distribution or you can download source and build using sbt tool.<\/span><\/span><\/span>\n<ul>\n<li><span style=\"color: #444444\"><span style=\"font-family: Georgia, 'Bitstream Charter', serif\"><span style=\"font-size: small\">$ wget http:\/\/d3kbcqa49mib13.cloudfront.net\/spark-1.0.2-bin-hadoop2.tgz<\/span><\/span><\/span><\/li>\n<li><span style=\"color: #444444\"><span style=\"font-family: Georgia, 'Bitstream Charter', serif\"><span style=\"font-size: small\">$ tar -xvf spark-1.0.2-bin-hadoop2.tgz<\/span><\/span><\/span><\/li>\n<li><span style=\"color: #444444\"><span style=\"font-family: Georgia, 'Bitstream Charter', serif\"><span style=\"font-size: small\">$ mv spark-1.0.2-bin-hadoop2 spark-1.0.2<\/span><\/span><\/span><\/li>\n<\/ul>\n<\/li>\n<li><span style=\"color: #444444\"><span style=\"font-family: Georgia, 'Bitstream Charter', serif\"><span style=\"font-size: small\">Open .bashrc file and add environment variable for Hadoop and Spark.<\/span><\/span><\/span>\n<ul>\n<li><span style=\"color: #444444\"><span style=\"font-family: Georgia, 'Bitstream Charter', serif\"><span style=\"font-size: small\">export HADOOP_HOME=\/home\/ec2-user\/hadoop-2.5.0<br \/>\nexport SPARK_HOME=\/home\/ec2-user\/spark-1.0.2<br \/>\nexport PATH=$PATH:$HADOOP_HOME\/bin:$SPARK_HOME\/bin<\/span><\/span><\/span><\/li>\n<\/ul>\n<\/li>\n<li><span style=\"color: #444444\"><span style=\"font-family: Georgia, 'Bitstream Charter', serif\"><span style=\"font-size: small\">To validate the set up load .bashrc file and echo PATH.<\/span><\/span><\/span>\n<ul>\n<li><span style=\"color: #444444\"><span style=\"font-family: Georgia, 'Bitstream Charter', serif\"><span style=\"font-size: small\">$ source ~\/.bashrc<\/span><\/span><\/span><\/li>\n<li><span style=\"color: #444444\"><span style=\"font-family: Georgia, 'Bitstream Charter', serif\"><span style=\"font-size: small\">$ echo $PATH<\/span><\/span><\/span><\/li>\n<\/ul>\n<\/li>\n<\/ol>\n<p>&nbsp;<\/p>\n<p><strong><span style=\"color: #000000\"><span style=\"font-family: Georgia, 'Bitstream Charter', serif\"><span style=\"font-size: small\"><b>Spark shell &#8211; <\/b><\/span><\/span><\/span><\/strong><span style=\"color: #444444\"><span style=\"font-family: Georgia, 'Bitstream Charter', serif\"><span style=\"font-size: small\">We are ready to run Spark shell, which is a command line interpreter for Spark. We\u00a0can execute arbitrary Spark syntax and interactively mine the data. \u00a0\u00a0<\/span><\/span><\/span><\/p>\n<ul>\n<li><span style=\"color: #444444\"><span style=\"font-family: Georgia, 'Bitstream Charter', serif\"><span style=\"font-size: small\">$ cd spark-1.0.2<\/span><\/span><\/span><\/li>\n<li><span style=\"color: #444444\"><span style=\"font-family: Georgia, 'Bitstream Charter', serif\"><span style=\"font-size: small\">$ bin\/spark-shell<\/span><\/span><\/span><\/li>\n<\/ul>\n<p><a href=\"\/blog\/wp-ttn-blog\/uploads\/2015\/01\/Screenshot-from-2015-01-15-224856.png\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone  wp-image-16679\" src=\"\/blog\/wp-ttn-blog\/uploads\/2015\/01\/Screenshot-from-2015-01-15-224856-300x168.png\" alt=\"Screenshot from 2015-01-15 22:48:56\" width=\"444\" height=\"249\" srcset=\"\/blog\/wp-ttn-blog\/uploads\/2015\/01\/Screenshot-from-2015-01-15-224856-300x168.png 300w, \/blog\/wp-ttn-blog\/uploads\/2015\/01\/Screenshot-from-2015-01-15-224856-1024x575.png 1024w, \/blog\/wp-ttn-blog\/uploads\/2015\/01\/Screenshot-from-2015-01-15-224856.png 1366w\" sizes=\"(max-width: 444px) 100vw, 444px\" \/><\/a><\/p>\n<p><strong><span style=\"color: #000000\"><span style=\"font-family: Georgia, 'Bitstream Charter', serif\"><span style=\"font-size: small\"><b>Program &#8211; \u00a0<\/b><\/span><\/span><\/span><\/strong><span style=\"color: #444444\"><span style=\"font-family: Georgia, 'Bitstream Charter', serif\"><span style=\"font-size: small\">In order to keep sprite of Hello World alive, I have changed the word count program to Hello World. In short the program will count number of words in a a text file.<\/span><\/span><\/span><\/p>\n<p>[java]<br \/>\npackage com.intelligrape.spark;<\/p>\n<p>import java.util.Arrays;<br \/>\nimport org.apache.spark.api.java.JavaPairRDD;<br \/>\nimport org.apache.spark.api.java.JavaRDD;<br \/>\nimport org.apache.spark.api.java.JavaSparkContext;<br \/>\nimport scala.Tuple2;<\/p>\n<p>\/**<br \/>\n * @author surendra.singh<br \/>\n **\/<br \/>\npublic class WordCount {<br \/>\n   public static void main(String[] args) {<\/p>\n<p>     \/\/Create Java Spark context object by passing SparkConfig Object<br \/>\n     JavaSparkContext sc = new JavaSparkContext(new SparkConf());<\/p>\n<p>     \/\/load sample data file containing the words using Spark context object.<br \/>\n     \/\/Spark will read file line by line and convert it in a RDD of Sting.<br \/>\n     \/\/each object in RDD represent a single line in data file.<br \/>\n     JavaRDD&lt;String&gt; lines = sc.textFile(&quot;\/home\/test-data\/Hello World.txt&quot;);<\/p>\n<p>     \/\/convert RDD of string to RDD of words by spliting line with space.<br \/>\n     JavaRDD&lt;String&gt; words = lines.flatMap(l -&gt; Arrays.asList(l.split(&quot; &quot;)));<\/p>\n<p>     \/\/Create tuple of each word having count &#8216;1&#8217;. Spark create tuple using Tuple2 class<br \/>\n     JavaPairRDD&lt;String,Integer&gt; tuple = words.mapToPair(w-&gt;new Tuple2&lt;&gt;(w, 1));<\/p>\n<p>     \/\/reduce all keys by adding their individual count<br \/>\n     JavaPairRDD&lt;String, Integer&gt; count = tuple.reduceByKey((a, b) -&gt; a + b).sortByKey();<\/p>\n<p>     \/\/Print the result<br \/>\n     for (Tuple2&lt;String, Integer&gt; tuple2 : tuple.toArray()) {<br \/>\n         System.out.println(tuple2._1() + &quot; &#8211; &quot; + tuple2._2());<br \/>\n     }<br \/>\n   }<br \/>\n}<br \/>\n[\/java]<\/p>\n<p><strong><span style=\"color: #000000\"><span style=\"font-family: Georgia, 'Bitstream Charter', serif\"><span style=\"font-size: small\"><b>Submitting Job in Spark <\/b><\/span><\/span><\/span><\/strong><span style=\"color: #444444\"><span style=\"font-family: Georgia, 'Bitstream Charter', serif\"><span style=\"font-size: small\">&#8211; Create jar file for your project and execute below command from Spark root folder.\u00a0For single node Job submission, master node will be always &#8216;local&#8217;.\u00a0<\/span><\/span><\/span><\/p>\n<ul>\n<li><span style=\"color: #444444\"><span style=\"font-family: Georgia, 'Bitstream Charter', serif\"><span style=\"font-size: small\">$ bin\/spark-submit &#8211;class com.intelligrape.spark.WordCount &#8211;master local HelloWorld.jar<\/span><\/span><\/span><\/li>\n<\/ul>\n<p><a href=\"\/blog\/wp-ttn-blog\/uploads\/2015\/01\/Screenshot-from-2015-01-20-194921.png\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-medium wp-image-16783\" src=\"\/blog\/wp-ttn-blog\/uploads\/2015\/01\/Screenshot-from-2015-01-20-194921-300x168.png\" alt=\"Screenshot from 2015-01-20 19:49:21\" width=\"300\" height=\"168\" srcset=\"\/blog\/wp-ttn-blog\/uploads\/2015\/01\/Screenshot-from-2015-01-20-194921-300x168.png 300w, \/blog\/wp-ttn-blog\/uploads\/2015\/01\/Screenshot-from-2015-01-20-194921-1024x575.png 1024w, \/blog\/wp-ttn-blog\/uploads\/2015\/01\/Screenshot-from-2015-01-20-194921.png 1366w\" sizes=\"(max-width: 300px) 100vw, 300px\" \/><\/a><\/p>\n<p><span style=\"color: #444444\"><span style=\"font-family: Georgia, 'Bitstream Charter', serif\"><span style=\"font-size: small\">This concludes the Hello World count program using Spark. In next post we will go ahead with a detailed architectural flow\u00a0of how Spark makes use of RDD and DAG.<\/span><\/span><\/span><\/p>\n","protected":false},"excerpt":{"rendered":"<p>This is the second blog of the Spark series. This blog post include setup of Spark environment followed by a small word count program. The idea behind the blog is to get hands on in Spark setup and running simple program on Spark. If you want to know more about Spark history and it&#8217;s comparison [&hellip;]<\/p>\n","protected":false},"author":152,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"iawp_total_views":11},"categories":[1395],"tags":[1515,1396,1607],"aioseo_notices":[],"_links":{"self":[{"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/posts\/16665"}],"collection":[{"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/users\/152"}],"replies":[{"embeddable":true,"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/comments?post=16665"}],"version-history":[{"count":0,"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/posts\/16665\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/media?parent=16665"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/categories?post=16665"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/tags?post=16665"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}