Boosting ECS Task Monitoring with CloudWatch Input Transformer

AWS, Cloud, Cloud Managed Services, DevOps

30 / Jul / 2023 by Karandeep Singh 0 comments

Introduction

In the fast-paced world of application delivery, ensuring the health and reliability of our ECS tasks is crucial. Without a reliable alerting mechanism, there’s a risk of overlooking critical task failures that can have a bad impact on our production environment. Just imagine a situation where application tasks fail silently, resource constraints go unnoticed, or container failures go unattended. This can result in costly downtime and leave end users frustrated.

But fear not! In this brief article, we will delve into the journey to revolutionize our ECS Cluster monitoring capabilities. By actively detecting and addressing task failures, we can minimize downtime, optimize resource utilization, and ensure a seamless experience for our end users.

Join us as we dive deep into the world of Amazon EventBridge, CloudWatch Input Transformers, and SNS (Simple Notification Service). We will unravel the step-by-step implementation of task failure alerts, illuminating the path to actionable and targeted notifications. Get ready to witness the magic unfold as we unlock the true potential of ECS task monitoring.

Problem Statement

Lack of Task Failure Alerts in ECS Cluster Impacts Production Applications.

Description: Our production ECS Cluster experienced a critical issue when tasks within our main application started failing unexpectedly. Unfortunately, we discovered that we were not receiving any alerts specifically for task failures, relying only on the 5xx alert generated by the Application Load Balancer. This absence of task failure alerts hindered our ability to promptly identify and address the root cause of the failures, prolonging the impact on the application’s capabilities.

Architecture

Understanding the Components:-

1. Amazon Simple Notification Service (SNS): SNS is a fully managed pub/sub messaging service that allows you to publish, subscribe, and send messages to various endpoints. In our case, we will use SNS to send notifications whenever an ECS task failure happens.

2. Amazon EventBridge: AWS EventBridge is a serverless event bus that makes it easy to connect different AWS services together and trigger actions based on events. We will utilize EventBridge to capture and process ECS task failure events.

3. CloudWatch Input Transformer: CloudWatch Input Transformer is a feature of Amazon CloudWatch Events that allows you to extract, modify, and combine fields from incoming events before sending them to targets like SNS topics or AWS Lambda functions. Using the Input Transformer’s power, we will parse ECS task failure events and extract useful details for our alerting purpose.

Prerequisites

1. In this demo, we will be utilizing the ECS Fargate Cluster. Please ensure that you have an active ECS Fargate Cluster running.

2. Additionally, you will require a task definition and an ECS service. Make sure your task is up.

Deployment Of Alerting Mechanism:

1. In the AWS Management Console, search for “SNS” in the services search bar and click on “Simple Notification Service” when it appears.

2. Click the “Topics” section in the left navigation pane in the SNS console.

3. Provide the name of your topic in the “Name” field. You can choose a display name that helps you identify the purpose of the topic.

4. Click on the “Create topic” button to create the SNS topic.

5. Select the topic by clicking on its name. In the topic, details view, click on the “Create subscription” button. Choose the protocol as “Email” from the dropdown menu. Enter the email address where you want to receive the alerts in the “Endpoint” field.

6. Check the inbox of the subscribed email address for a confirmation message from AWS SNS. Click & subscribe to the notifications.

7. Once confirmed, the subscription status will change to “Confirmed” in the SNS console.

8. Now, event rule. Open the CloudWatch console by visiting https://console.aws.amazon.com/cloudwatch/.

9. In the navigation pane, click on “Events” and then click on “Create rule”. Provide a name and an optional description for your rule. Click on “Next” to proceed with the configuration of your rule.

10. Under the creation method, select “Custom event pattern” (JSON EDITOR).

11. Since, By default, all three states(Running, stopped & Pending) will be tracked by the Cloudwatch rule but we only want stopped tasks alerts only. Enter the following pattern in the custom event pattern tab. Click on next.

{
 "detail": {
   "lastStatus": ["STOPPED"],
   "stoppedReason": [{
     "anything-but": {
       "prefix": "Scaling activity initiated by"
     }
   }]
 },
 "detail-type": ["ECS Task State Change"],
 "source": ["aws.ecs"]
}

Note:–One thing you might notice is that we have used prefix matching with anything-but to ignore alerts when tasks are stopped during deployment or autoscaling. This approach ensures that we don’t receive alerts for task stops initiated during deployment, as it doesn’t make sense to be alerted for those specific cases.

In the provided custom event pattern, `anything-but` and `prefix` are used as matching rules for the `stoppedReason` field. Here’s what both of them mean:

anything-but: This is a logical operator used in EventBridge patterns. It specifies that the condition should be true for any value of the field except for the specified value or pattern. In this case, it means that the `stoppedReason` field should not have a prefix match of “Scaling activity initiated by”.

prefix: This is a comparison operator used in EventBridge patterns. It checks if the value of the field starts with the specified prefix. In the given pattern, it checks whether the `stoppedReason` field starts with the prefix “Scaling activity initiated by”. If there is a match, it will be excluded from triggering the alerts.

12. After configuring the custom event pattern, click on “Next” to proceed. Then, select the SNS Target for the alert. Choose the SNS Topic that you created in Step 4 as the target for the alert.

13. After selecting the SNS Target, click on “Additional settings”. This section will utilize the CloudWatch input transformer to transform the event and extract the required values in the desired format.

14. Click on “Configure input transformer” under the “Additional settings” section.In the “Target input transformer” field, enter the following input path:

{
  "TASK_ARN": "$.detail.taskArn",
  "PROBLEM": "$.detail-type",
  "STOP_CODE": "$.detail.stopCode",
  "STOPPED_REASON": "$.detail.stoppedReason",
  "STOPPED_TIME": "$.detail.stoppedAt",
  "AZ": "$.detail.availabilityZone",
  "SERVICE": "$.detail.group",
  "ECS_CLUSTER_ARN": "$.detail.clusterArn",
  "REGION": "$.region"
}

15. Under the Template section, please enter the following content:

"ECS TASK FAILURE ALERT"
"Problem: <PROBLEM>"
"Region: <REGION>"
"Availability-zone: <AZ>"
"ECS Cluster Arn: <ECS_CLUSTER_ARN>"
"Service Name: <SERVICE>"
"Task Arn: <TASK_ARN>"
"Stopped Reason: <STOPPED_REASON>"
"Stop Code: <STOP_CODE>"
"Stopped Time: <STOPPED_TIME>"

16. This template defines the format of the alert message that will be sent to the SNS topic. The placeholders enclosed in angle brackets (“<>” symbols) will be replaced with the actual values extracted from the event payload. Once you have entered this template content, click on “Next” to proceed.

17. You can optionally give tags to your Cloudwatch rule.

18. Verify everything and click on Create the rule. It’ll look like this.

Now, whenever an ECS Fargate task fails, you will receive a notification via email, as shown in the picture below.

Bonus Section
Automating the setup using Terraform

If you like to automate the setup process using infrastructure-as-code, Terraform can be a useful tool. With Terraform, you can declare and manage your AWS resources. Let’s see how we can implement the ECS task failure alerting mechanism using Terraform:

Step 1: Install Terraform

Download and install Terraform from the official website: https://www.terraform.io/downloads.html

Make sure to add Terraform to your system’s PATH.

Step 2: Initialize Terraform

Create a new directory for your Terraform project.
Open a terminal and navigate to the project directory.
Run the command terraform init to initialize the project. Terraform will download the necessary provider plugins.

Step 3: Create a Terraform Configuration File

Create a new file named main.tf in your project directory.
Make sure you have enough permissions & necessary access rights to run Terraform commands.
Copy and paste the following Terraform code into main.tf.

######################### Provider Configuration ###################
provider "aws" {
 region = "us-west-2" # Replace with your desired region
}

########################## SNS #####################################

resource "aws_sns_topic" "ecs_task_failure_sns" {
 name = "ecs_task_failure_sns"
}

###################### ECS Task Failure CW Event Rule ##############

resource "aws_cloudwatch_event_rule" "ecs_task_failure_alert" {
 name        = "ecs_task_failure_alert_rule"
 description = "ECS Task Failure Alerts"


 event_pattern = <<EOF
{
 "source": ["aws.ecs"],
 "detail-type": ["ECS Task State Change"],
 "detail": {
   "lastStatus": ["STOPPED"],
   "stoppedReason": [{
     "anything-but": {
       "prefix": "Scaling activity initiated by"
     }
   }]
 }
}
EOF
}


resource "aws_cloudwatch_event_target" "sns" {
 rule = aws_cloudwatch_event_rule.ecs_task_failure_alert.name
 arn  = aws_sns_topic.ecs_task_failure_sns.arn
 input_transformer {
   input_paths = {
     "AZ"              = "$.detail.availabilityZone"
     "ECS_CLUSTER_ARN" = "$.detail.clusterArn"
     "PROBLEM"         = "$.detail-type"
     "REGION"          = "$.region"
     "SERVICE"         = "$.detail.group"
     "STOPPED_REASON"  = "$.detail.stoppedReason"
     "STOPPED_TIME"    = "$.detail.stoppedAt"
     "STOP_CODE"       = "$.detail.stopCode"
     "TASK_ARN"        = "$.detail.taskArn"
   }
   input_template = <<EOT
               "ECS TASK FAILURE ALERT"
               "Problem: <PROBLEM>"
               "Region: <REGION>"
               "Availability Zone: <AZ>"
               "ECS Cluster Arn: <ECS_CLUSTER_ARN>"
               "Service Name: <SERVICE>"
               "Task Arn: <TASK_ARN>"
               "Stopped Reason: <STOPPED_REASON>"
               "Stop Code: <STOP_CODE>"
               "Stopped Time: <STOPPED_TIME>"
           EOT
 }
}

Step 4: Initialize and Apply Changes

Run the command terraform init to initialize Terraform once again (this time in your project directory).
Run the command terraform apply to create the AWS resources specified in your Terraform configuration.
Terraform will prompt for confirmation. Enter yes to proceed.

Once the Terraform applies command completes successfully, the ECS task failure alerting mechanism will be set up in your AWS account.

Conclusion

To conclude this, using AWS services such as Amazon SNS, EventBridge, and CloudWatch Input Transformer provides a comprehensive solution for Amazon ECS task failure alerting. By combining these services, you can easily capture, parse, and deliver meaningful notifications about task failures, enabling you to maintain the availability and stability of your containerized applications. Embrace these AWS services and take advantage of their capabilities to improve the reliability and uptime of your ECS application deployments. Refer to our other blogs for further deep insights.

References

https://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs_cwet2.html

Blogs

Boosting ECS Task Monitoring with CloudWatch Input Transformer

Introduction

Problem Statement

Architecture

Prerequisites

Deployment Of Alerting Mechanism:

Bonus Section
Automating the setup using Terraform

Conclusion

References

Leave a Reply Cancel reply

Blogs

Introduction

Problem Statement

Architecture

Prerequisites

Deployment Of Alerting Mechanism:

Bonus Section Automating the setup using Terraform

Conclusion

References

Tag -

Leave a Reply Cancel reply

Tips for writing a blog

Learn how to write a caption

Bonus Section
Automating the setup using Terraform