Monitor AWS ECS Agent & Automatically Restart Agent on Failure

30 / Apr / 2016 by Neeraj Gupta 2 comments

Amazon EC2 Container Service is a container management service that makes it easy to manage docker containers on EC2 instances. AWS ECS you can create task definition to define container configuration like memory, cpu, environment variables, mount point and services to scale docker containers.

AWS ECS Agent Monitoring

Use Case: In one of our project we setup complete QA environment on AWS ECS and after few days we observed ECS agent gets frequently disconnected with the AWS ECS service. As a result AWS ECS service is unable to communicate with ECS agent resulting in no more schedulding and unable to get any status of the existing containers.

Note: We are using AWS ECS Optimized AMI i.e. Amazon Linux AMI, if you are using other OS AMI few steps may change i.e. install aws and getting metadata.

Steps to setup monitoring script on ECS nodes:

1. Setup SNS topic for recieving notifications

On the AWS console create sns topic and in the subscriber add notification email id, confirm the subscription you recieved from the SNS service.

2. Install AWS CLI

Our script will use AWS CLI to query AWS to find container instance arn and agent status using awscli ecs command option.

[js]yum install -y aws-cli[/js]

3. Setup IAM policies for SNS and ECS

a. AWS SNS IAM Policy: The below mentioned policy will allow IAM instance role to publish message to the SNS topic we created earlier. This will help us in getting notifications for agent failure.

[js]
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "Stmt1460976768000",
"Effect": "Allow",
"Action": [
"sns:GetEndpointAttributes",
"sns:GetPlatformApplicationAttributes",
"sns:GetSubscriptionAttributes",
"sns:GetTopicAttributes",
"sns:ListEndpointsByPlatformApplication",
"sns:ListPlatformApplications",
"sns:ListSubscriptions",
"sns:ListSubscriptionsByTopic",
"sns:ListTopics",
"sns:Publish"
],
"Resource": [
"arn:aws:sns:ap-southeast-1:<aws-account-id>:<topic-name>"
]
}
]
}
[/js]

b. AWS ECS IAM Policy: The below mentioned IAM policy will allow IAM instance role to query AWS ECS api to list container instances and check agent connectivity status.

[js]
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "Stmt1460960788000",
"Effect": "Allow",
"Action": [
"ecs:DescribeClusters",
"ecs:DescribeContainerInstances",
"ecs:DescribeServices",
"ecs:DescribeTaskDefinition",
"ecs:DescribeTasks",
"ecs:DiscoverPollEndpoint",
"ecs:ListClusters",
"ecs:ListContainerInstances",
"ecs:ListServices",
"ecs:ListTaskDefinitionFamilies",
"ecs:ListTaskDefinitions",
"ecs:ListTasks",
"ecs:Poll"
],
"Resource": [
"arn:aws:ecs:ap-southeast-1:<aws-account-id>:cluster/<cluster-name>"
]
}
]
}
[/js]

4. Monitoring Script

The below mentioned script will check for ECS agent connectivity with the ECS service, it first extract all the container instances arns, instance id (using metadata). It will then check for each container instance arn for its current status check weather its on the same instance. If current instance ECS agent is donnected it will trigger a notification and restart ecs service on the instance.

[js]
#!/bin/bash
# Sourcing the ecs.config file for using the cluster name
source /etc/ecs/ecs.config
CONTAINERS_ID=$(aws ecs list-container-instances –cluster $ECS_CLUSTER –output text –query ‘containerInstanceArns’)
INSTANCE_ID=$(curl
DATE=$(date +%Y-%m-%d-%H:%M)
TOPIC="arn:aws:sns:ap-southeast-1:<aws-account-id>:<topic-name>"
for container in $CONTAINERS_ID
do
STATUS=$(aws ecs describe-container-instances –container-instances $container –cluster $ECS_CLUSTER –output json –query ‘containerInstances[0].agentConnected’)
CHECK_INSTANCE_ID=$(aws ecs describe-container-instances –container-instances $container –cluster $ECS_CLUSTER –output text –query ‘containerInstances[0].ec2InstanceId’)
if [ $INSTANCE_ID == $CHECK_INSTANCE_ID ]
then
if [ $STATUS == "false" ]
then
echo "Agent Disconnected" $DATE &gt;&gt; /var/log/script.log
aws sns publish –message "AWS ECS Agent Failed $INSTANCE_ID $DATE" –topic $TOPIC
sudo stop ecs
sudo start ecs
else
echo "Agent Connected" $DATE &gt;&gt; /var/log/script.log
fi
fi
done[/js]

5. Setup cron to run every 5 minutes

After monitoring QA environment for more than 1 week, we found ECS agent gets disconnected almost twice daily, so I choose to setup cronjob to run every 5 minutes and writing error logs to /var/log/monitor-agent-logs.txt

[js]*/5 * * * * bash /home/ec2-user/monitor_agent.sh 2&>1 /var/log/monitor-agent-logs.txt[/js]

6. Create AMI and Update ECS Auto Scaling Groups Launch Configuration

Once you create a AMI of running instance, copy the existing launch configuration (i.e. created by AWS ECS Cloudformation Stack), update the AMI and create new launch configuration. Update the auto scaling group to use newly created launch configuration.

FOUND THIS USEFUL? SHARE IT

comments (2)

Leave a Reply

Your email address will not be published. Required fields are marked *