The big data space has been evolving continuously and each day more technologies are added in ecosystem. Hadoop Hive is one of the technologies that has been around along. It’s give a SQL wrapper to execute Hadoop as a query language. Inherently, It’s having some of the optimizations techniques. Through this blog, I thought to share my learning in one of our project which are specific to this technologies.
I am not covering all the optimization techniques, but may be in future, I’ll write another blog which will cover rest of optimizations techniques.
From the use case, I wanted to calculate the total sale of each brand – every year.So, for this I need to apply to join between the product table and sale, need to grouping the data by Brand Name and year.
We have 2 tables :
Product : containing columns id, brand_name, price, category and many more
Sales: containing columns like product_id, brand_name, item_purchased and purchased_date
Query to calculate the sale of each brand for each year :
select year(a.purchase_date),b.brand_name,sum(a.item_purchased*b.price) from sale a join product b on a.product_id = b.id group by year(a.purchase_date),b.brand_name
Problem : We are performing the join operation in large datasets, That’s why It’s taking very large amount of time to give the results.
Solution (Partitioning): Instead of performing the operation on large datasets, We can divide our datasets into small chunks. A table is split across multiple partitions. Each partition corresponds to a particular value(s) of partition column(s) and is stored as a sub-directory within the table’s directory on HDFS.
CREATE EXTERNAL TABLE IF NOT EXISTS salePartition(id bigint,version tinyint,brand_name STRING,item_purchased INT,product_id BIGINT,purchase_date TIMESTAMP) PARTITIONED BY(brand string) row format delimited fields terminated by ‘,’ lines terminated by ‘\n’ stored as textfile;
SET hive.exec.dynamic.partition = true;
SET hive.exec.dynamic.partition.mode = nonstrict;
insert OVERWRITE TABLE salePartition PARTITION(brand) select id,version, brand_name,item_purchased,product_id,purchase_date,brand_name from sale;
Optimize Query on partition table :
select year(a.purchase_date),a.brand,sum(a.item_purchased*b.price) from salePartition a join product b on a.product_id = b.id group by year(a.purchase_date),a.brand;
Problem: When we are applying join on 2 tables, It’s actually running 2 map join jobs in backend which are taking large amount of time to execute the result. We need to optimize our join operation
Solution (Optimize Join Conversion) : Hive provides some sort of parameters that we need to set before executing the queries, which will automatically optimize our results.
set hive.auto.convert.join.noconditionaltask = true;
set hive.auto.convert.join.noconditionaltask.size = 10000000;
Problem: Suppose If we have sorted and bucketed data, we want to optimize our execution time.
Solution (Auto Conversion to SMB(Sort-Merge-Bucket) Map join ): SMB joins are used wherever the tables are sorted and bucketed. The join boils down to just merging the already sorted tables, allowing this operation to be faster than an ordinary map-join.
set hive.optimize.bucketmapjoin = true;
set hive.optimize.bucketmapjoin.sortedmerge = true;
Database Records : Product – 436, Sale – 234808
System & Cluster Configuration: 4 Gb RAM, Hadoop Single Node Cluster, Hive
|Solution||Execution Time (sec)|
|Optimize Join Conversion||21|
|Auto Conversion to SMB(Sort-Merge-Bucket) Map join||21|
Please Note: This execution used small set of test data and as the data grows the difference is exponential.
Hope this blog will give you the better idea for hive optimization.
For complete source code: