Ensuring high availability: Testing AWS availability zone failover with Fault Injection Service (FIS)
Introduction
In this blog, I have checked application availability when one of the availability zones is down. AWS has regional services where there are more than one data centers, which are known as availability zones, to ensure High availability and redundancy. For instance, we have three availability zones for the ap-south-1(Mumbai) region, i.e., ap-south-1a, ap-south-1b, and ap-south-1c. To verify the experiment, a default application needs to be set up as multi-az, which will extend the services to another zone if one zone fails. We are applying AWS Fault Injection Service(FIS), which is a service meant for testing controlled failure experiments that verify the performance and stability of the system in bad circumstances.
Objective
This is a blog about checking the availability of the application against any disaster. This will make us prepared for any disaster and best practices. The following are topics that will be addressed:
- Condition for AZ failover: FIS service implementation.
- FIS service for AZ failover creation.
- Actions for executing AZ failover.
- Observations and suggestions.
Fault Injection Service
From pre-defined failure cases, this AWS service aids us in testing the way the application/services operate in such circumstances. It will aid you in testing faults like CPU spikes, memory leaks, network failures, or total service unavailability and observe how your application reacts. This tool will help us close the gaps and design a stronger architecture.
Features of AWS FIS
- Controlled experiment: You will apply the failure scenario to specific resources that you identify.
- Custom and Predefined faults: You can run various kinds of failures, like EC2 termination and RDS cluster failover.
- Safety controls: You can define the stop condition and timing to stop the experiment anytime.
- Monitoring and Logging: AWS services such as CloudWatch, CloudTrail, and X-Ray can be used for real-time visibility of infrastructure.
Pre-requisite
Please check the following points and use them to check if the application Infrastructure AZ is ready for failover.
- Multi-AZ Configuration: Ensure all of the application infrastructures, like RDS, EKS, and ECS, are multi-AZ configured.
- Resilient Services: Make sure that the services, such as Amazon MQ, Kafka, and Elastic Cache, are set up with failover in the application.
- Service Inventory: You may keep an Excel table of all the services with one column indicating Multi-AZ status (yes/no).
- AWS FIS Service Compatibility: CloudWatch, DynamoDB, EBS, EC2, ECS, EKS, ElasticCache, RDS, S3, System Manager, and VPC.
- Cost Approval: Verify the FIS service cost and have it approved upfront.
Scenario
In our configuration, the application has AWS components like Amazon MQ, MSK, EKS, RDS, and standalone EC2 instances. We have two zones: ap-south-1a and ap-south-1b. We are leaving ap-south-1b up and ap-south-1a down. We will not run any Infra components in ap-south-1a, and we will configure this state for all services individually to view the application response separately.

AWS Services
Steps to Set Up Experiment in FIS for AZ Failover
Step 1. Open the FIS service and you can see the option on the left side menu. AWS has predefined scenarios stored in the Scenario library with the name AZ Availability: Power Interruption. In our case, we have to test the AZ failover of every service individually to know the performance and response. Go to Experiments templates and choose Create experiment template.

FIS Homepage
Step 2. You may insert general details like name and Description. Experiment type is of two categories. One, you can choose the current account ID, or insert multiple accounts if you have a dependent app in another account and want to test it once.

Name and Description details
Step 3. You may select actions and targets based on services. We have picked the EKS service to create nodes in an AZ and test how pods are coming up and the application downtime. Select the options below.

Fields

Action-1 details

Target-1 Details
Step 4. Action 1 will block the new instance launch only in ap-south-1a. We need to manually terminate the instances running in ap-south-1a for the chosen ASG. The new action and target details are given below.

Fields-2

Action-2 Details

Target-2 Details
Step 5. You can select or define a role, and on the next page, pick the S3 bucket for report upload created by the FIS service. You may also save logs using CloudWatch logs or S3.

Other Details
Step 6. Design the experiment and execute the experiment by clicking the Start Experiment button. Make sure you track application uptime and pod transfer on fresh nodes.

Final Experiment
You can test the different AWS services similarly, using the fault injection service for high availability during disasters.
Conclusion
This task of failure in an availability zone will assist in designing fault-tolerant cloud infrastructure. We are making use of the Fault Injection service to ensure availability and minimize system downtime. It will assist us in creating a highly available and fault-tolerant cloud infrastructure. It will ensure business continuity despite unexpected failures. This disaster recovery enables the cloud infrastructure and application, facilitates simple diversion of traffic during failures, and reduces downtime.