Hive Optimizations

21 / Nov / 2014 by Mohit Garg 0 comments


The big data space has been evolving continuously and each day more technologies are added in ecosystem. Hadoop Hive is one of the technologies that has been around along. It’s give a SQL wrapper to execute Hadoop as a query language. Inherently, It’s having some of the optimizations techniques. Through this blog, I thought to share my learning in one of our project which are specific to this technologies.

I am not covering all the optimization techniques, but may be in future, I’ll write another blog which will cover rest of optimizations techniques.

Problem Statement:

From the use case, I wanted to calculate the total sale of each brand – every year.So, for this I need to apply to join between the product table and sale, need to grouping the data by Brand Name  and year.

We have 2 tables :

Product : containing columns id, brand_name, price, category and many more

Sales: containing columns like product_id, brand_name, item_purchased and purchased_date

Query to calculate the sale of each brand for each year :

select year(a.purchase_date),b.brand_name,sum(a.item_purchased*b.price) from sale a join product b on a.product_id = group by year(a.purchase_date),b.brand_name

Problem  : We are performing the join operation in large datasets, That’s why It’s taking very large amount of time to give the results.

Solution (Partitioning):  Instead of performing the operation on large datasets, We can divide our datasets into small chunks. A table is split across multiple partitions. Each partition corresponds to a particular value(s) of partition column(s) and is stored as a sub-directory within the table’s directory on HDFS.

CREATE EXTERNAL TABLE IF NOT EXISTS salePartition(id bigint,version tinyint,brand_name STRING,item_purchased INT,product_id BIGINT,purchase_date TIMESTAMP) PARTITIONED BY(brand string) row format delimited  fields terminated by ',' lines terminated by '\n' stored as textfile;

SET hive.exec.dynamic.partition = true;
SET hive.exec.dynamic.partition.mode = nonstrict;

insert OVERWRITE TABLE salePartition PARTITION(brand) select id,version, brand_name,item_purchased,product_id,purchase_date,brand_name from sale;

Optimize Query on partition table :

select year(a.purchase_date),a.brand,sum(a.item_purchased*b.price) from salePartition a join product b on a.product_id = group by year(a.purchase_date),a.brand;

When we are applying join on 2 tables, It’s actually running 2 map join jobs in backend which are taking large amount of time to execute the result. We need to optimize our join operation

Solution (Optimize Join Conversion) : Hive provides some sort of parameters that we need to set before executing the queries, which will automatically optimize our results.


set = true;

set = 10000000;

Problem: Suppose If we have sorted and bucketed data, we want to optimize our execution time.

Solution (Auto Conversion to SMB(Sort-Merge-Bucket) Map join ): SMB joins are used wherever the tables are sorted and bucketed. The join boils down to just merging the already sorted tables, allowing this operation to be faster than an ordinary map-join.


set hive.optimize.bucketmapjoin = true;

set hive.optimize.bucketmapjoin.sortedmerge = true;

Database Records
Product – 436, Sale – 234808

System & Cluster Configuration4 Gb RAM, Hadoop Single Node Cluster, Hive

Solution Execution Time (sec)
Default Query 110
Partitioning 35
Optimize Join Conversion 21
Auto Conversion to SMB(Sort-Merge-Bucket) Map join 21


Please Note: This execution  used small set of test  data and as the data grows the difference is exponential.

Hope this blog will give you the better idea for hive optimization.

For complete source code:

github location:




Leave a comment -