One of the most challenging tasks in any microservices ecosystem is the centralized log management, and there are many open source and paid solutions available in the market. In our ecosystem, we are using ELK stack as it provides scalability and the multitenant-capable full-text search engine that easily integrates with Logstash and Kibana for centralized logging and visualizations.
In the initial days, we did the setup of elasticsearch cluster with 2 nodes and both of them were configured as master eligible data nodes to mitigate node failure. Both the nodes were using the default configuration for shards and replica i.e. shard: 5 and replica: 1. This step was perfect for handling small data set i.e. 30-40 GB logs per day, but in our case the log size increased considerably from 30-40GB per day to 200 GB per day in a matter of 2-3 months. As a result, we started facing operational issues such as low disk space, high load average, and performance issues with the elasticsearch cluster. With this experience we realized that we were unable to leverage elasticsearch scalability to the optimum. Through this blog, I will be sharing how we can leverage spot instances in the elasticsearch cluster with default settings. It is highly recommended to tweak elasticsearch settings based on your use-case and that can be done only after thorough testing.
Prerequisite: To implement highly scalable elasticsearch cluster for logging, you should have the basic understanding of how ELK stack works.
In order to scale elasticsearch efficiently, it is recommended to have a separate master, data, and client nodes. The main purpose of all the nodes is mentioned below:
- Master node: It is responsible for cluster management i.e. creating/deleting index, information about nodes in the cluster, shard allocation, and routing table.
- Data node: It is responsible for holding the actual data in the cluster and it handles operations like CRUD, search, and aggregations.
- Client node: It is responsible for request routing, handles the search reduce phase and distribute bulk indexing across data nodes.
For more details click here
After going through the blog “Designing for Scale” on elastic.co. In our case – We had to refactor Elasticsearch cluster setup as shown below:
In this setup, we should use private DNS endpoints, for master and client nodes. Using this setup helps us in scaling in/out data nodes without any changes in the logstash configuration file, as we are using client private DNS endpoints in the config. Using all the above instance types in on-demand pricing model (master nodes: t2.medium, client nodes: m3.medium and data nodes: m3.large, m4.large, c3.xlarge and c4.xlarge) will incur good monthly AWS bill, in our case, therefore, we started playing with spot instances. The only concern before starting this activity was to ensure we have one copy of data available on the on-demand instance regularly. To achieve this, we had used cluster.routing.allocation.awareness.attributes, that helped in shard routing and allocation on different rack ids.
If Elasticsearch is aware of the physical configuration of your hardware, it can ensure that the primary shard and its replica shards are spread across different physical servers, racks, or zones, to minimize the risk of losing all shard copies at the same time.
You can find the master, data, and client node configurations below, we had used ep utility (envplate utility) to replace environment variables in the configuration file at the instance startup.
- Install Elasticsearch: You can either create a base image that already has elasticsearch installed or you can use elasticsearch-userdata.sh, script that can be passed into ubuntu userdata. Script will do the following:
- Install Java
- Download and Install elasticsearch 2.3.5 from official repo
- Install Kopf plugin
- Install ep (envplate utility as discussed above)
- Create elasticsearch.yml file with hostname and IP address fields which will be the variables
- Update environment variables in the elasticsearch.yml file
- Start elasticsearch service
2. Setup Master, Data, and Client nodes:
- Master Nodes:
- Create three master nodes with the help of elasticsearch-master-userdata.sh (link)
- Update all the three private DNS with the private IPs:
- Restart elasticsearch on all the three master nodes
- Check using: curl master1.ttn.com:9200/_cat/health, the cluster should be in a green state.
- Data Node – On-demand:
- Create launch configuration with userdata as mentioned in elasticsearch-datanode-ondemand-userdata.sh (link)
- Create Auto Scaling Group with min: 1, desire: 1 and maximum: 1
- Data Node – Spot: I would highly recommend you create spot fleet for launching spot instances. As it takes care of most of the heavy lifting.
- Create request and maintain spot fleet request
- Choose multiple instance types: for eg: c4.xlarge, c3.xlarge, m4.large, m3.large the more you select more availability you get.
- Choose ubuntu 14.04 AMI
- Allocation Strategy: diversified (launch spot instances in multiple launch specifications)
- Bidding Strategy: automated (max bid price = ondemand price)
- In the userdata section use elasticsearch-datanode-spot-userdata.sh (link)
- Launch spot fleet
- Configure auto scaling based on high CPU/memory usage
Note: Running production workloads on spot instances is not recommended, as they can terminate any time AWS market price rises above your bid price.
- Client Nodes:
- Create launch configuration with userdata as mentioned in elasticsearch-clientnode-userdata.sh (link)
- Create Auto Scaling Group with min: 2, desire: 2 and maximum: 2
- Check All is Well: Once you have added all the nodes in the cluster, you can check the status of the cluster using Kopf plugin. (URL: Master_Public_IP:9200/_plugins/kopf)
- Updating logstash configuration file: Now that we have entire cluster ready and scalable, we can update our logstash configuration file with client nodes IPs/private DNS.
With this kind of highly scalable elasticsearch cluster, we are able to handle approximately 200-220 GB logs per day, and this can scale massively by upgrading instance type or an increasing number of nodes.