This blog is related to the Yarn Cluster Optimizations for executing the spark jobs on yarn cluster. In this blog post I will be discussing about the YARN Optimizations for the efficient utilization of available resources to execute the spark jobs on yarn cluster. These optimization configurations could be done either in the config xml files of yarn if using Simple hadoop installation or could be done by using the web UI which is provided in the Cloudera Manager (CDH) if using the Cloudera Cluster Manager. In this post optimizations will be done using the cloudera manager web UI.
In order to execute the spark jobs on yarn with best utilization of available resources we must optimize the yarn confgurations, otherwise your cluster will not give you the desired throughput. We know we have resources and we should make the best use of them. Ther are various configurations we need to tune. First I will discuss about the problems which I faced with default configs and without optimized configs.
We use spark jobs for data aggregation in our project and execute these jobs by spark-submit on yarn cluster. As the number of jobs increase to 8-10, they tries to execute simultaneously and all the available resources are consumed and each job do not get the desiresd resources due to which none of them were able to finish.
In this case resource allocation was not fair, now this is the time I realise to play with the configs. There are several configs and various factors that affects the jobs execution and we need to change them to tune our yarn cluster. These factors are :
- Container Allocation
- Memory Allocation
- Virtual Core Allocation
- Queuing Policy
- Scheduling Policy
- Number of executors for spark jobs
I will discuss the configuration optimization on the basis of available resources as follows :
- 2 Worker nodes each having 60GB RAM
- 16 vCores
The configurations which we are about to change can be seen in cloudera web UI in Yarn Service >> Configuration Tab
- Container Memory Allocation : We should set the container minimum and maximum memory depending on job as in our case we changed three properties
1. yarn.scheduler.minimum-allocation-mb = 1GB
2. yarn.scheduler.increment-allocation-mb = 256MB and
3. yarn.scheduler.maximum-allocation-mb = 6GB
- Container Virtual CPU Cores Allocation : We should configure the min and max vCores that can be requested by the container. we changed three properties as
1. yarn.scheduler.minimum-allocation-vcores = 1
2. yarn.scheduler.increment-allocation-vcores = 1
3. yarn.scheduler.maximum-allocation-vcores = 16
- Enable Ubertask Optimization : Enable ubertask optimization, which runs “sufficiently small” jobs sequentially within a single JVM. This property can be changed in Performance category. Property name is mapreduce.job.ubertask.enable
- Enable Optimized Map-side Output Collector : Enable this property. This can improve performance of many jobs that are shuffle-intensive.
- Maximum Number of Attempts for MapReduce Jobs : The maximum number of application attempts for MapReduce jobs. The value of this parameter overrides ApplicationMaster Maximum Attempts for MapReduce jobs. The property we changed for maximum attempts is mapreduce.am.max-attempts = 5
- ApplicationMaster Maximum Attempts : The maximum number of application attempts. The property changed is yarn.resourcemanager.am.max-attempts = 5
- Scheduler Class : The class to use as the resource scheduler. we used Fair Scheduler and the property changed is yarn.resourcemanager.scheduler.class = org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler
- Fair Scheduler Preemption : Enable this property. The property changed for this is yarn.scheduler.fair.preemption
- Fair Scheduler Size-Based Weight : When enabled, the Fair Scheduler will assign shares to individual apps based on their size, rather than providing an equal share to all apps regardless of size. The property changed for this is yarn.scheduler.fair.sizebasedweight
- Fair Scheduler Assign Multiple Tasks : Enable this property which Enables multiple Fair Scheduler container assignments in one heartbeat, which improves cluster throughput when there are many small tasks to run. The property changed for this is yarn.scheduler.fair.assignmultiple
- Fair Scheduler Allocations : Change the defaultQueueSchedulingPolicy to drf i.e. Dominant Resource Fairness. When you have two dimensions (memory and CPU), how do you make scheduling decisions? The solution to this comes from the Dominant Resource Fairness (DRF) policy – at every point make decisions based on what the dominant/bottleneck resource is and create allocations based on that.
- ApplicationMaster Virtual CPU Cores : The virtual CPU cores requirement, for the ApplicationMaster. The property changed for this is yarn.app.mapreduce.am.resource.cpu-vcores = 8
- Container Virtual CPU Cores : Number of virtual CPU cores that can be allocated for containers. The property changed for this is yarn.nodemanager.resource.cpu-vcores = 16
- Number of Executors : If we know that our spark job needs how many executors then we can set a flag in spark-submit command to set the number or executors. The flag to be set for this is –num-executors 2
So these were some optimized configs which can help in efficient utilization of resources. By doing these configs our cluster is able to execute 20-25 spark jobs simultaneously. In fact if a spark job is submitted in every 2 seconds on yarn cluster, then can be easily executed, the finish time may increase but the jobs will not go in dead lock situation.
So if you have a cluster and you have resources, make full utilization of them by keeping the configs accurate and optimized depending on your needs and priority.