Fixing JVM OutOfMemoryError on ECS (EC2 Based)

22 / Mar / 2026 by Ahmad Ali 0 comments

Introduction

We started seeing repeated OutOfMemoryError exceptions in a Spring Boot service running on Amazon ECS in EC2 mode.
The impact of the OutOfMemoryError was serious:-

  • JVM threads crashed, including SQS listeners, HTTP threads, and AWS SDK threads.
  • Messages were retried and eventually sent to SQS Dead Letter Queues.
  • The service became unstable under load.

At the same time, we also had a strict requirement:

  • The same Docker image must be used for QA and Prod.
  • No changes to be done in the Jenkins pipeline.
  • QA needed to remain lightweight i.e no JVM Memory Config to be done because of tiny instance.
  • Prod had to be memory-safe.

This blog outlines the exact approach we took to solve this issue effectively.

Root Cause

  1. JVM auto-sizing and containers

The application was launched with a basic entrypoint:

 java -jar app.jar

Even though Java 21 is aware of containers, auto-sizing was overly optimistic for our workload, which included:

  • Spring Boot
  • Spring Cloud AWS with SQS listeners
  • AWS SDK background threads
  • Chromium and Selenium, which use a lot of native memory

As a result,

  • JVM heap grew too large
  • Native memory had no headroom
  • The JVM encountered an OutOfMemoryError
  • Messages failed to process and went to the DLQ.

2. Spring profile coupling

Spring profile selection was managed via environment variables:

profiles:
active: ${ENV:local}

ECS task definitions had:
ENV=live

This means both env has different application.yml file, i.e:
QA used application-qa.yml
Prod used application-live.yml

This setup seemed fine, but we had different constraints.

Constraints

We specifically did not want:

  • Separate Dockerfiles
  • Hardcoded Spring profiles in Docker
  • Jenkins(Our CICD Tool) logic to control runtime behavior
  • Rebuilding images for each environment

The solution also had to follow immutable image principles.

The Solution

We took the key design decision:

  • The ENTRYPOINT must not depend on the environment.
  • Environment-specific behavior should be in ECS task definitions, not in Dockerfiles or Jenkins.

Our Final Dockerfile ENTRYPOINT  included JAVA_OPTS variable,

ENTRYPOINT exec java \  
  ${JAVA_OPTS} \  
  -jar /app/bp-order-service.jar

The reasons this worked for us:

  • We had One ENTRYPOINT for all environments.
  • JVM options are injected only at runtime.
  • No Spring profile overrides.
  • Proper signal handling is in place (exec, PID 1).

Environment-Specific Configuration  

For QA (Running on small resources & no tuning required)

In QA ECS Task Definition, we added following variables:
We kept JAVA_OPTS null in QA.

ENV=qa  
JAVA_OPTS=

The runtime command for QA became:

java -jar /app/bp-order-service.jar

Thats exactly like our original setup. Which didn’t had any runtime tuning.

Production (memory-safe)  

In Prod ECS Task Definition we added below variables:

ENV=live  
JAVA_OPTS=-XX:MaxRAMPercentage=65.0 \  
          -XX:InitialRAMPercentage=50.0 \  
          -XX:MaxMetaspaceSize=512m \  
          -XX:+ExitOnOutOfMemoryError \  
          -XX:+UseContainerSupport

By adding JAVA_OPTS we achieved the following points:

  • Limits the heap to 65% of container memory.
  • Reserves 35% for native memory, which includes Chromium, threads, and buffers.
  • The JVM exits immediately on OOM.
  • There’s no partial JVM failure or DLQ storm.

Verification in Production  

We confirmed the configuration inside a running ECS task using:

jcmd 1 VM.flags

Key confirmations included:

  • Container memory was detected correctly (MaxRAM).
  • Heap was sized as expected (MaxRAMPercentage, InitialRAMPercentage).
  • Metaspace cap was applied.
  • G1GC was active.
  • There were no hardcoded -Xmx or Spring profile overrides.

Example highlights:

MaxRAM            = 3145728000 (~3 GB)
MaxHeapSize       = 2044723200 (~65%)
InitialHeapSize   = 1572864000 (~50%)

This confirmed that the JVM was working as designed.

What Remains Unchanged: 

  • Jenkins pipeline remained unchanged.
  • Image build process stayed the same.
  • Spring profile logic was not altered.
  • QA behavior remained unchanged.

It’s important to note that our fix changed runtime-only , not the build-time behavior.

Outcome

  • OutOfMemoryError stopped occurring.
  • SQS DLQ stopped growing.
  • The service stabilized under load.
  • QA remained lightweight.
  • One Docker image continued to serve all environments.

 

Conclusion

We resolved production crashes by managing JVM memory at runtime through ECS, without changing the Docker image, Jenkins pipeline, or QA behavior.

Key Takeaways from this post:

  • The ENTRYPOINT should not contain environment logic.
  • JVM memory tuning should be in ECS task definitions.
  • Use percentage-based heap sizing in containers.
  • Leave room for native memory for non-JVM processes.
  • Allow ECS to handle recovery by exiting cleanly on OOM.
  • Keep CI/CD pipelines free from environment dependencies.
FOUND THIS USEFUL? SHARE IT

Leave a Reply

Your email address will not be published. Required fields are marked *