Fixing JVM OutOfMemoryError on ECS (EC2 Based)
Introduction
We started seeing repeated OutOfMemoryError exceptions in a Spring Boot service running on Amazon ECS in EC2 mode.
The impact of the OutOfMemoryError was serious:-
- JVM threads crashed, including SQS listeners, HTTP threads, and AWS SDK threads.
- Messages were retried and eventually sent to SQS Dead Letter Queues.
- The service became unstable under load.
At the same time, we also had a strict requirement:
- The same Docker image must be used for QA and Prod.
- No changes to be done in the Jenkins pipeline.
- QA needed to remain lightweight i.e no JVM Memory Config to be done because of tiny instance.
- Prod had to be memory-safe.
This blog outlines the exact approach we took to solve this issue effectively.
Root Cause
- JVM auto-sizing and containers
The application was launched with a basic entrypoint:
java -jar app.jar
Even though Java 21 is aware of containers, auto-sizing was overly optimistic for our workload, which included:
- Spring Boot
- Spring Cloud AWS with SQS listeners
- AWS SDK background threads
- Chromium and Selenium, which use a lot of native memory
As a result,
- JVM heap grew too large
- Native memory had no headroom
- The JVM encountered an OutOfMemoryError
- Messages failed to process and went to the DLQ.
2. Spring profile coupling
Spring profile selection was managed via environment variables:
profiles:
active: ${ENV:local}
ECS task definitions had:
ENV=live
This means both env has different application.yml file, i.e:
QA used application-qa.yml
Prod used application-live.yml
This setup seemed fine, but we had different constraints.
Constraints
We specifically did not want:
- Separate Dockerfiles
- Hardcoded Spring profiles in Docker
- Jenkins(Our CICD Tool) logic to control runtime behavior
- Rebuilding images for each environment
The solution also had to follow immutable image principles.
The Solution
We took the key design decision:
- The ENTRYPOINT must not depend on the environment.
- Environment-specific behavior should be in ECS task definitions, not in Dockerfiles or Jenkins.
Our Final Dockerfile ENTRYPOINT included JAVA_OPTS variable,
ENTRYPOINT exec java \
${JAVA_OPTS} \
-jar /app/bp-order-service.jar
The reasons this worked for us:
- We had One ENTRYPOINT for all environments.
- JVM options are injected only at runtime.
- No Spring profile overrides.
- Proper signal handling is in place (exec, PID 1).
Environment-Specific Configuration
For QA (Running on small resources & no tuning required)
In QA ECS Task Definition, we added following variables:
We kept JAVA_OPTS null in QA.
ENV=qa JAVA_OPTS=
The runtime command for QA became:
java -jar /app/bp-order-service.jar
Thats exactly like our original setup. Which didn’t had any runtime tuning.
Production (memory-safe)
In Prod ECS Task Definition we added below variables:
ENV=live JAVA_OPTS=-XX:MaxRAMPercentage=65.0 \ -XX:InitialRAMPercentage=50.0 \ -XX:MaxMetaspaceSize=512m \ -XX:+ExitOnOutOfMemoryError \ -XX:+UseContainerSupport
By adding JAVA_OPTS we achieved the following points:
- Limits the heap to 65% of container memory.
- Reserves 35% for native memory, which includes Chromium, threads, and buffers.
- The JVM exits immediately on OOM.
- There’s no partial JVM failure or DLQ storm.
Verification in Production
We confirmed the configuration inside a running ECS task using:
jcmd 1 VM.flags
Key confirmations included:
- Container memory was detected correctly (MaxRAM).
- Heap was sized as expected (MaxRAMPercentage, InitialRAMPercentage).
- Metaspace cap was applied.
- G1GC was active.
- There were no hardcoded -Xmx or Spring profile overrides.
Example highlights:
MaxRAM = 3145728000 (~3 GB)
MaxHeapSize = 2044723200 (~65%)
InitialHeapSize = 1572864000 (~50%)
This confirmed that the JVM was working as designed.
What Remains Unchanged:
- Jenkins pipeline remained unchanged.
- Image build process stayed the same.
- Spring profile logic was not altered.
- QA behavior remained unchanged.
It’s important to note that our fix changed runtime-only , not the build-time behavior.
Outcome
- OutOfMemoryError stopped occurring.
- SQS DLQ stopped growing.
- The service stabilized under load.
- QA remained lightweight.
- One Docker image continued to serve all environments.
Conclusion
We resolved production crashes by managing JVM memory at runtime through ECS, without changing the Docker image, Jenkins pipeline, or QA behavior.
Key Takeaways from this post:
- The ENTRYPOINT should not contain environment logic.
- JVM memory tuning should be in ECS task definitions.
- Use percentage-based heap sizing in containers.
- Leave room for native memory for non-JVM processes.
- Allow ECS to handle recovery by exiting cleanly on OOM.
- Keep CI/CD pipelines free from environment dependencies.
