Chaos Engineering: Simulating Network Latency using AWS FIS

Introduction

Modern applications have distributed systems consisting of multiple services, containers, and infrastructure components. While it improves scalability, security and reliability, it also increases the chances of unexpected failures and downtime.

Application testing methods majorly focus on application functionality, but they rarely test how systems behave in real-world failures such as instance crashes, network latency, or service outages in a live production environment on the cloud.

Chaos Engineering solves this problem by intentionally introducing failures into systems to test their resilience and recovery capabilities before such failures occur in production.

What is Chaos Engineering?

Chaos engineering is the intentional and controlled causing of failures in the production or pre-production environment to understand their impact and plan a better defense posture and incident maintenance strategy.

Instead of relying on failures to happen by chance, engineers intentionally attempt scenarios such as server crashes, network failures, and other stress tests to see how the system behaves under stress or failure.

Netflix learned this concept firsthand when it switched from on-premises to the cloud1 -they experienced an outage that led to a three-day interruption to service delivery in 2008.

Netflix created chaos monkey, an open source tool that creates random incidents in IT services and infrastructure meant to identify weaknesses that can be fixed or addressed through automatic recovery procedures. They implemented chaos monkey when it moved from a private data center to Amazon Web Services (AWS) in response to unreliability from the cloud.

Chaos Engineering

Types of chaos engineering experiments

Latency injection: DevOps teams intentionally create scenarios that emulate a slow or failing network connection. This includes the introduction of network delays or slower response times.
Load generation: This relates to intentionally stressing the system by sending significant traffic levels well beyond normal operations. This helps the site reliability engineers (SREs) or DevOps to understand any bottlenecks in the system.
Fault injection: This involves introducing errors into the system to determine how it affects the application and other dependent systems and whether it interrupts services. Examples of fault injections include inducing disk failures, terminating processes, shutting down a host or introducing power or temperature increases.

Why Chaos Engineering Matters

Chaos Engineering Cycle

Increased System Resilience: Resilience refers to a system’s ability to recover from disruptions and continue functioning at an optimal level despite having failures and ability to recover from failures.Improved Incident Response: By intentionally introducing disruptions in a controlled environment, teams can gain valuable insights into how their systems break under pressure and act on the incidents proactively without delays and get feedback on failures. These failures help engineering teams to identify potential bottlenecks, inefficiencies, or issues in the System.

Cultural Shift towards Proactive Reliability : Chaos Engineering flips this script by encouraging proactive reliability. Through the practice of chaos experiments, organizations are not merely responding to failures—they are actively seeking them out, understanding them, and learning from them.This fosters the members to understand and experiment with the systems.

Validation of Redundancy and Failover Mechanisms: Through chaos experiments, teams can validate the effectiveness of their redundancy and failover strategies in real-world conditions.

For instance, by simulating the failure of a database server, engineers can test whether traffic is properly rerouted to a backup system without degrading performance. Another real-world example can be introducing latency in the application to validate network lags and performance issues.

In this blog, we will explore Chaos Engineering on AWS and walk through a practical experiment using AWS Fault Injection Simulator

What is AWS Fault Injection Simulator (FIS)?

AWS Fault Injection Service (AWS FIS) is a managed service that enables you to perform fault injection experiments on your AWS workloads. Fault injection is based on the principles of chaos engineering. These experiments stress an application by creating disruptive events so that you can observe how your application responds. You can then use this information to improve the performance and resiliency of your applications so that they behave as expected.

With AWS FIS, you can simulate scenarios such as:

EC2 instance termination
CPU stress on instances
Network latency injection
EBS I/O disruption
API throttling
Database load testing

Simulating Network Latency Using AWS Fault Injection Simulator (FIS).

In this experiment, we use AWS Fault Injection Service to inject artificial network latency into an EC2 instance.

FIS

Step 1: Launch an EC2 Instance

Allow ports in Security Group: 22 (SSH), 80 (HTTP)

EC2 Instance

Step 2: Install a Web Server

Install Apache to simulate an application service.

$ sudo yum update -y

$ sudo yum install httpd -y

Start the server:

$ sudo systemctl start httpd

$ sudo systemctl enable httpd

starthttpd

Create a simple test page:

$ echo “Chaos Engineering Demo Application” | sudo tee /var/www/html/index.html

DemoApp

Step 3: Measure Normal Application Response Time

Before introducing failures, measure the application’s normal latency.

Command
$ curl -o /dev/null -s -w “Response Time: %{time_total}s\n” http://54.92.212.143

Initial Response

The application responds in approximately 0.83 ms, indicating normal network performance.

Step 4: Create an IAM Role for Fault Injection

-> Go to AWS Identity and Access Management

-> Click Roles → Create Role

-> Select Fault Injection Service

-> Attach policy:
AWSFaultInjectionSimulatorNetworkAccess
AWSFaultInjectionSimulatorEC2Access

AmazonSSMManagedInstanceCore

Step 5: Create a Fault Injection Experiment Template

Now configure the chaos experiment.

-> Open AWS Fault Injection Service

-> Click Experiment Templates

->Click Create experiment template

Configure the Template
Target

Resource type: EC2 instance

-> Select the instance created earlier

-> Action

Choose:

aws: AWSFIS-Run-Network-Latency

Select the IAM role created earlier.

Save the template.

setupFIS

Step 6: Start the Chaos Experiment

Run the experiment from the template.Experiment Status: Running

Target: EC2 Instance

Fault Type: Network Latency Injection

Latency: 200 ms

Duration: 1 minute

During this time, AWS injects a 200 ms network delay into the EC2 instance

runningFIS

Step 7: Test Application During the Experiment

While the experiment is running, test the application again.

Command
$ curl -o /dev/null -s -w “Response Time: %{time_total}s\n” http://54.92.212.143

response

Observation
Response time increased from: 0.80s → 1.78s

This confirms that the injected network latency is affecting the application.

Step 8: Observe Monitoring Metrics

While the experiment is running, open Amazon CloudWatch and observe the instance metrics.

CloudWatch

Final Result of the Experiment

This chaos engineering experiment demonstrated that:

AWS Fault Injection Service can simulate network failures safely.
Application response time increases when network latency is injected.
Monitoring tools like CloudWatch detect performance degradation.
The system recovers automatically after the experiment ends.
Such experiments help engineers validate the resilience and observability of cloud applications.

Conclusion

Chaos engineering plays an important role in making the system more reliable and resilient. The goal of modern infrastructure engineering is not to eliminate failures but to design systems that recover quickly and automatically.

Chaos Engineering helps teams proactively identify weaknesses before they impact production environments.

With AWS Fault Injection Simulator, organizations can safely run controlled experiments to validate the resilience of their architecture.

By regularly testing failure scenarios, teams can build systems that are not only scalable but also truly fault-tolerant and resilient distributed systems.

In the world of distributed systems, resilience isn’t proven through design diagrams — it’s proven through experiments.