We have been using Druid in our project for a while and had shared our experiences during GIDS. It has given us great results as it powers our real-time dashboards, reports on running Ad campaigns and provides real-time data to make quick decisions. But like any other databases there will be times when it needs to be optimized. This is similar to how we may optimize various things in MySQL, Oracle or any other database.
Druid stores all of its data sharded by time, data sources and optionally by other dimensions. In Druid terminology this sharded data is called segments. These segments are served by Druid’s historical node. For serving all of these segments the historical node needs to open file handlers to all the segments. The problem occurs as the amount of data increases then the number of memory mapped files will also keep on increasing and there is a limit on the memory mapped files on Linux.
We had reached a point where we had more than 140k+ segments which was pushing towards this limit although we had distributed the data across historical nodes in the cluster. So we had to somehow reduce the number of segments on each node or the overall of segments in our Druid. Less number of segments has other benefits also as there would be less references in zookeeper and the coordinator node of Druid will have to balance less number of segments.
Initial Solutions Considered
- Do Not Load Old Historical Data – In case we could not come up with something good enough this was the last resort. The consideration would have been that from a business perspective do we really need to keep the data that was 2 years old or maybe even a year old? For taking this decision, a discussion with the product team would be necessary. We could keep the backup of the data in our deep storage (S3) but configure Druid to not actually load the old data. This could be done by configuring a drop rule in Druid’s coordinator console.
- Merge Smaller Segments – As Druid shards the data by timestamp and datasource we end up with many segments of really small size. This could be easily checked via Druid’s coordinator console. We had segments of 1 KB, 10 KB etc. Was this really optimal? Not really !!.
Druid’s recommendation is that all segments should be in 300 to 700 mb. So we thought that we should merge the smaller segments into larger segments.
- More Smaller Machines For Historical Nodes – This one can be used to distribute the load across machines and take care of limit on memory mapped files. But we would hit that eventually.
Exploration that Led to the Final Solution
Needless to say merging seems the best option as that takes our storage in the recommended range and for the other 2 options they may not be feasible from a business or economic perspective.
We went through Druid’s merge task but found that doing it on scale would be problematic as the details required were a bit too much for a simple thing – merge the existing segments to the recommended size. We would have to go through Sruid’s metastore to get segment metadata, add aggregations which were different for all datasources, submit merge and then ensure that the old ones were dropped via drop tasks.
After some further exploration and discussions on Druid groups we found that this is something which is already present as a configuration option in coordinator configuration but turned off by default.
This configuration if turned on would do exactly what we want. Although for this to work we need to spin up and keep Overlord and Middle manager nodes running which we had not been doing so far.
Considerations While Implementing
- While testing this configuration we found that this was a CPU intensive task and each of the merge task peons would also require enough RAM (2g is default). This meant that we needed to strike a balance between how long we could wait for our backlog of merging to be cleared vs. how big machines we were willing to use.
- If we started too many middle manager peons then overlord and middle manager’s memory usage also went up.
- Final Size of segments – Although Druid’s recommended segment size if 300 – 700 mb and by default coordinator merges it to a size around 500 mb which is in this range it may not be optimal for all cases. This was found after in-house bench-marking based on the queries that were common for our use case. The segment size to which coordinator will try and merge the smaller segments into depends on a setting that can be changed via coordinator console mergeBytesLimit. More details can be found here
So while turning this auto-segment merge feature of Druid’s coordinator console is good for stability, be aware that you will have to tune the memory usage of various nodes, decide on number of peons for merging and find the segment size that works for your use case by benchmarking for the queries that are used in your system.