{"id":45778,"date":"2017-02-20T09:15:57","date_gmt":"2017-02-20T03:45:57","guid":{"rendered":"http:\/\/www.tothenew.com\/blog\/?p=45778"},"modified":"2017-02-24T13:23:34","modified_gmt":"2017-02-24T07:53:34","slug":"aws-cost-optimization-series-blog-2-infrastructure-monitoring","status":"publish","type":"post","link":"https:\/\/www.tothenew.com\/blog\/aws-cost-optimization-series-blog-2-infrastructure-monitoring\/","title":{"rendered":"AWS Cost Optimization Series | Blog 2 | Infrastructure Monitoring"},"content":{"rendered":"<p>This is a blog series in continuation to a Use Case on\u00a0How the team at <a title=\"Product Engineering Services\" href=\"http:\/\/www.tothenew.com\/\"><strong>TO THE NEW<\/strong><\/a> reduced monthly <a href=\"http:\/\/www.tothenew.com\/blog\/aws-cost-optimization-series-blog-1-100k-to-40k-in-90-days\/\"><strong>AWS spend from $100K to $40K in 90 days for a client<\/strong><\/a>. In this blog, I would explain how we leveraged infrastructure monitoring to save cost by removing idle resources.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"wp-image-45779 aligncenter\" src=\"\/blog\/wp-ttn-blog\/uploads\/2017\/02\/monitoring.png\" alt=\"monitoring\" width=\"160\" height=\"160\" srcset=\"\/blog\/wp-ttn-blog\/uploads\/2017\/02\/monitoring.png 600w, \/blog\/wp-ttn-blog\/uploads\/2017\/02\/monitoring-150x150.png 150w, \/blog\/wp-ttn-blog\/uploads\/2017\/02\/monitoring-300x300.png 300w\" sizes=\"(max-width: 160px) 100vw, 160px\" \/><\/p>\n<p style=\"text-align: justify\"><span style=\"font-weight: 400\">Before getting deeper into any Cost Optimisation strategies, let us discuss how we choose and implemented <a title=\"Infrastructure Monitoring Tools\" href=\"http:\/\/www.tothenew.com\/blog\/infographic-7-best-devops-tools-for-agile-it\/\">Infrastructure Monitoring Tools<\/a> in our ecosystem. In the initial days of our engagement with the customer, we ensured the\u00a0customer has a proper Monitoring\u00a0system in place, as it can provide ample insights about the infrastructure. Moreover, in a case of Microservice Architecture, it helps to identify issues\/problems at an early stage.<\/span><\/p>\n<p style=\"text-align: justify\"><strong> <span style=\"font-weight: 400\">Most of the customer\u2019s use only <a title=\"AWS Cloudwatch\" href=\"http:\/\/www.tothenew.com\/blog\/grouping-together-my-metrics-using-aws-cloudwatch-dashboards\/\">AWS Cloudwatch<\/a> Monitoring in their ecosystem, as it provides metrics for all the <a title=\"AWS Services\" href=\"http:\/\/www.tothenew.com\/devops-aws\">AWS services<\/a> and it integrates well other AWS services. We have our custom scripts that extract data from AWS Cloudwatch and helps in identifying underutilized and overutilized EC2, EBS, RDS, ELB etc. The only disadvantage that we see with AWS Cloudwatch monitoring is setting up custom metrics for EC2 disk and memory monitoring. It works well if you have only a few servers, but if you have 400+ EC2 instances, setting up disk and memory scripts and custom metrics can be a costly affair. At the time of this activity, Cloudwatch uses to retain only last 14 days and it is very less time to understand the complex microservices ecosystem and therefore we opted for third party monitoring solutions like New Relic, Nagios, and Pagerduty.<\/span><\/strong><\/p>\n<ul>\n<li style=\"text-align: justify\"><strong><span style=\"font-weight: 400\"><b>Server Monitoring:<\/b> As discussed earlier AWS Cloudwatch doesn\u2019t provide operating system level monitoring, hence we ended up using NewRelic Servers Monitoring tools as they provide detailed insights about CPU, network, disk io, and processes. To setup NewRelic Server Monitoring agent you just need to install NPI package and update your account license key. To make sure every system in our ecosystem has the NewRelic agent installed, configured and added to specific alert policies, we have developed custom chef cookbook using which we manage NewRelic on all our servers.<\/span><\/strong><\/li>\n<\/ul>\n<ul>\n<li><b>Application Performance Monitoring:<\/b>\n<ul>\n<li><b>NewRelic APM:<\/b><span style=\"font-weight: 400\"> The development teams heavily rely on APM for identifying performance bottlenecks and identifying code level issues. APM provide us many key metrics like response time, app server time, throughput and error rate. In the initial days we were only using APM for production applications, but now we use for all staging and QA applications. It helps in identifying performance issues in the development life cycle. <\/span><b>Screenshot: NewRelic APM.<img decoding=\"async\" loading=\"lazy\" class=\"aligncenter  wp-image-45782\" src=\"\/blog\/wp-ttn-blog\/uploads\/2017\/02\/Screen-Shot-2017-01-28-at-1.25.19-PM.png\" alt=\"Screen Shot 2017-01-28 at 1.25.19 PM\" width=\"456\" height=\"388\" srcset=\"\/blog\/wp-ttn-blog\/uploads\/2017\/02\/Screen-Shot-2017-01-28-at-1.25.19-PM.png 1442w, \/blog\/wp-ttn-blog\/uploads\/2017\/02\/Screen-Shot-2017-01-28-at-1.25.19-PM-300x255.png 300w, \/blog\/wp-ttn-blog\/uploads\/2017\/02\/Screen-Shot-2017-01-28-at-1.25.19-PM-1024x872.png 1024w, \/blog\/wp-ttn-blog\/uploads\/2017\/02\/Screen-Shot-2017-01-28-at-1.25.19-PM-624x531.png 624w\" sizes=\"(max-width: 456px) 100vw, 456px\" \/><\/b><\/li>\n<li><b>ELB Cloudwatch Metrics:<\/b><span style=\"font-weight: 400\"> We also rely on AWS ELB Cloudwatch metrics for application monitor, it provides us great insights about the application latency, error rate, healthy\/unhealthy instances behind the ELB etc.<\/span><\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<ul>\n<li><b>Availability Monitoring:<\/b>\n<ul>\n<li style=\"text-align: justify\"><b>URL Monitoring: <\/b><span style=\"font-weight: 400\">In our ecosystem, we have 40+ internal services those are consumed by internal applications and 10+ public endpoints which are accessed by end users and third party teams. For monitoring public endpoints we are using NewRelic Synthetics, it enables us to monitor endpoint from different geo locations and for private endpoints, we are using Nagios. Both the monitoring tools provide SLA report on a weekly and monthly basis, which can be shared with the appropriate teams.<\/span><\/li>\n<li style=\"text-align: justify\"><b>Service Monitoring: <\/b><span style=\"font-weight: 400\">For monitoring database clusters (MongoDB and Elasticsearch), queuing system, internal endpoints we are using open source Nagios. We preferred Nagios over any other third party solution because it is highly customizable and it is open source. If you want to integrate your Nagios with Pagerduty, you can follow this <\/span><a href=\"http:\/\/www.tothenew.com\/blog\/how-to-integrate-nagios-with-pagerduty\/\"><span style=\"font-weight: 400\">Blog<\/span><\/a><span style=\"font-weight: 400\"> it<\/span><span style=\"font-weight: 400\"> is written by one of my colleagues.<\/span><\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<ul>\n<li><strong>AWS Monitoring:<\/strong>\n<ul>\n<li style=\"text-align: justify\"><strong>Underutilized\/Idle Resources:<\/strong> We leverage our custom scripts to generate a report for all under-utilized and idle resources (EC2, ELB, ELB, EIP, Snapshots, RDS, Legacy Instance Type) in the AWS account. If you want to use those scripts, you can follow <strong>&#8220;6 Tips for AWS Cost Optimization&#8221;<\/strong> <a href=\"http:\/\/www.tothenew.com\/blog\/6-tips-for-aws-cost-optimization\/\">blog<\/a> written by my colleague.<\/li>\n<li style=\"text-align: justify\"><strong>S3:\u00a0<\/strong>We heavily use AWS S3 for backups and hosting static content, on cross checking S3 bucket storage in Cloudwatch for last 15 days, we found we add 300 GB of backup data on daily basis. To reduce increasing storage cost,\u00a0we enabled lifecycle policy (standard storage -&gt; glacier -&gt; deletion ) on backup S3 bucket&#8217;s and this resulted in good cost savings.<\/li>\n<li style=\"text-align: justify\"><strong>Cloudfront<\/strong>: We heavily use AWS CloudFront for serving static content from S3, but we were not using compression feature offered by CloudFront. By enabling compression in all the CloudFront distributions, we were able to reduce data transfer out by 30%. For more details, you can check this <a href=\"https:\/\/aws.amazon.com\/blogs\/aws\/new-gzip-compression-support-for-amazon-cloudfront\/\">blog<\/a>.<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<p style=\"text-align: justify\"><b>How Monitoring was the key factor in determining next steps for us: <\/b><span style=\"font-weight: 400\">After having the monitoring system in place for 30 days, we gathered all the data from different sources NewRelic, Cloudwatch, and Nagios. The results were astonishing as mentioned below:<\/span><\/p>\n<ul>\n<li style=\"text-align: justify\"><strong><span style=\"font-weight: 400\"><strong>Application Stack:<\/strong> In the application stack, we were running about 222+ EC2 instances and their maximum CPU utilization was less than 35%\u00a0<img decoding=\"async\" loading=\"lazy\" class=\"aligncenter  wp-image-45783\" src=\"\/blog\/wp-ttn-blog\/uploads\/2017\/02\/Screen-Shot-2017-01-28-at-2.51.25-PM.png\" alt=\"Screen Shot 2017-01-28 at 2.51.25 PM\" width=\"333\" height=\"239\" srcset=\"\/blog\/wp-ttn-blog\/uploads\/2017\/02\/Screen-Shot-2017-01-28-at-2.51.25-PM.png 684w, \/blog\/wp-ttn-blog\/uploads\/2017\/02\/Screen-Shot-2017-01-28-at-2.51.25-PM-300x214.png 300w, \/blog\/wp-ttn-blog\/uploads\/2017\/02\/Screen-Shot-2017-01-28-at-2.51.25-PM-624x447.png 624w\" sizes=\"(max-width: 333px) 100vw, 333px\" \/><br \/>\n<\/span><\/strong><\/li>\n<li style=\"text-align: justify\"><span style=\"font-weight: 400\"><strong>Database Stack:<\/strong> In the database stack, we were running approx 120+ EC2 instances and their maximum CPU utilization was less than 25%\u00a0<img decoding=\"async\" loading=\"lazy\" class=\"aligncenter  wp-image-45784\" src=\"\/blog\/wp-ttn-blog\/uploads\/2017\/02\/Screen-Shot-2017-01-28-at-2.53.29-PM.png\" alt=\"Screen Shot 2017-01-28 at 2.53.29 PM\" width=\"332\" height=\"228\" srcset=\"\/blog\/wp-ttn-blog\/uploads\/2017\/02\/Screen-Shot-2017-01-28-at-2.53.29-PM.png 690w, \/blog\/wp-ttn-blog\/uploads\/2017\/02\/Screen-Shot-2017-01-28-at-2.53.29-PM-300x206.png 300w, \/blog\/wp-ttn-blog\/uploads\/2017\/02\/Screen-Shot-2017-01-28-at-2.53.29-PM-624x428.png 624w\" sizes=\"(max-width: 332px) 100vw, 332px\" \/><\/span><\/li>\n<\/ul>\n<p>In the next blog, we will be discussing\u00a0<strong>&#8220;<a title=\"AWS Cost Optimization Blog 3\" href=\"http:\/\/www.tothenew.com\/blog\/aws-cost-optimization-series-blog-3-leveraging-ec2-container-service-ecs\/\">AWS Cost Optimization Series | Blog 3 | Leveraging EC2 Container Service (ECS)<\/a>&#8220;. <\/strong>In that blog, we will be discussing challenges we faced during the migration and what we did to resolve them.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>This is a blog series in continuation to a Use Case on\u00a0How the team at TO THE NEW reduced monthly AWS spend from $100K to $40K in 90 days for a client. In this blog, I would explain how we leveraged infrastructure monitoring to save cost by removing idle resources. Before getting deeper into any [&hellip;]<\/p>\n","protected":false},"author":216,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"iawp_total_views":15},"categories":[1174,1],"tags":[248,1550,4425,4327,4427,4426],"aioseo_notices":[],"_links":{"self":[{"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/posts\/45778"}],"collection":[{"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/users\/216"}],"replies":[{"embeddable":true,"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/comments?post=45778"}],"version-history":[{"count":0,"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/posts\/45778\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/media?parent=45778"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/categories?post=45778"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/tags?post=45778"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}