Fixing JVM OutOfMemoryError on ECS (EC2 Based)

Introduction

We started seeing repeated OutOfMemoryError exceptions in a Spring Boot service running on Amazon ECS in EC2 mode.
The impact of the OutOfMemoryError was serious:-

JVM threads crashed, including SQS listeners, HTTP threads, and AWS SDK threads.
Messages were retried and eventually sent to SQS Dead Letter Queues.
The service became unstable under load.

At the same time, we also had a strict requirement:

The same Docker image must be used for QA and Prod.
No changes to be done in the Jenkins pipeline.
QA needed to remain lightweight i.e no JVM Memory Config to be done because of tiny instance.
Prod had to be memory-safe.

This blog outlines the exact approach we took to solve this issue effectively.

Root Cause

JVM auto-sizing and containers

The application was launched with a basic entrypoint:

 java -jar app.jar

Even though Java 21 is aware of containers, auto-sizing was overly optimistic for our workload, which included:

Spring Boot
Spring Cloud AWS with SQS listeners
AWS SDK background threads
Chromium and Selenium, which use a lot of native memory

As a result,

JVM heap grew too large
Native memory had no headroom
The JVM encountered an OutOfMemoryError
Messages failed to process and went to the DLQ.

2. Spring profile coupling

Spring profile selection was managed via environment variables:

profiles:
active: ${ENV:local}

ECS task definitions had:
ENV=live

This means both env has different application.yml file, i.e:
QA used application-qa.yml
Prod used application-live.yml

This setup seemed fine, but we had different constraints.

Constraints

We specifically did not want:

Separate Dockerfiles
Hardcoded Spring profiles in Docker
Jenkins(Our CICD Tool) logic to control runtime behavior
Rebuilding images for each environment

The solution also had to follow immutable image principles.

The Solution

We took the key design decision:

The ENTRYPOINT must not depend on the environment.
Environment-specific behavior should be in ECS task definitions, not in Dockerfiles or Jenkins.

Our Final Dockerfile ENTRYPOINT included JAVA_OPTS variable,

ENTRYPOINT exec java \  
  ${JAVA_OPTS} \  
  -jar /app/bp-order-service.jar

The reasons this worked for us:

We had One ENTRYPOINT for all environments.
JVM options are injected only at runtime.
No Spring profile overrides.
Proper signal handling is in place (exec, PID 1).

Environment-Specific Configuration

For QA (Running on small resources & no tuning required)

In QA ECS Task Definition, we added following variables:
We kept JAVA_OPTS null in QA.

ENV=qa  
JAVA_OPTS=

The runtime command for QA became:

java -jar /app/bp-order-service.jar

Thats exactly like our original setup. Which didn’t had any runtime tuning.

Production (memory-safe)

In Prod ECS Task Definition we added below variables:

ENV=live  
JAVA_OPTS=-XX:MaxRAMPercentage=65.0 \  
          -XX:InitialRAMPercentage=50.0 \  
          -XX:MaxMetaspaceSize=512m \  
          -XX:+ExitOnOutOfMemoryError \  
          -XX:+UseContainerSupport

By adding JAVA_OPTS we achieved the following points:

Limits the heap to 65% of container memory.
Reserves 35% for native memory, which includes Chromium, threads, and buffers.
The JVM exits immediately on OOM.
There’s no partial JVM failure or DLQ storm.

Verification in Production

We confirmed the configuration inside a running ECS task using:

jcmd 1 VM.flags

Key confirmations included:

Container memory was detected correctly (MaxRAM).
Heap was sized as expected (MaxRAMPercentage, InitialRAMPercentage).
Metaspace cap was applied.
G1GC was active.
There were no hardcoded -Xmx or Spring profile overrides.

Example highlights:

MaxRAM = 3145728000 (~3 GB)
MaxHeapSize = 2044723200 (~65%)
InitialHeapSize = 1572864000 (~50%)

This confirmed that the JVM was working as designed.

What Remains Unchanged:

Jenkins pipeline remained unchanged.
Image build process stayed the same.
Spring profile logic was not altered.
QA behavior remained unchanged.

It’s important to note that our fix changed runtime-only , not the build-time behavior.

Outcome

OutOfMemoryError stopped occurring.
SQS DLQ stopped growing.
The service stabilized under load.
QA remained lightweight.
One Docker image continued to serve all environments.

Conclusion

We resolved production crashes by managing JVM memory at runtime through ECS, without changing the Docker image, Jenkins pipeline, or QA behavior.

Key Takeaways from this post:

The ENTRYPOINT should not contain environment logic.
JVM memory tuning should be in ECS task definitions.
Use percentage-based heap sizing in containers.
Leave room for native memory for non-JVM processes.
Allow ECS to handle recovery by exiting cleanly on OOM.
Keep CI/CD pipelines free from environment dependencies.

Introduction

Root Cause

Constraints

The Solution

Conclusion

Tag

Leave a Reply Cancel reply

Tips for writing a blog

Learn how to write a caption