#fame is India's first (and now the biggest) live-streaming app on IOS and Android platforms. This app allows people to create their own beam and go live immediately, or book a slot for future. As time passed, the operational databases of #fame kept on increasing at a great speed. As a result, the disk space utilization of database server...
Overview This blog is related to the Yarn Cluster Optimizations for executing the spark jobs on yarn cluster. In this blog post I will be discussing about the YARN Optimizations for the efficient utilization of available resources to execute the spark jobs on yarn cluster. These optimization configurations could be done either in the...
In this post, I will present a technical “deep-dive” into Spark internals, including RDD and Shared Variables. If you want to know more about Spark and Spark setup in a single node, please refer previous post of Spark series, including Spark 1O1 and Spark 1O2. Resilient Distributed Datasets (RDD) - An RDD in is primary abstraction...
This is the second blog of the Spark series. This blog post include setup of Spark environment followed by a small word count program. The idea behind the blog is to get hands on in Spark setup and running simple program on Spark. If you want to know more about Spark history and it's comparison with Hadoop, please refer Spark 1o1. ...
Big Data has witnessed a tremendous movement and growth over the last couple of years. As per the top research agencies, Big Data has recently emerged as the most successful “launch pad”, giving a way to the maximum number of start-up ever. As the space evolves further, more and more organizations of varied sizes and...
Using GroupBy and JOIN is often very challenging. Recently in one of the POCs of MEAN project, I used groupBy and join in apache spark. I had two datasets in hdfs, one for the sales and other for the product. Sales Datasets column : Sales Id, Version, Brand Name, Product Id, No of Item Purchased, Purchased Date Product...