{"id":78123,"date":"2026-03-22T11:38:22","date_gmt":"2026-03-22T06:08:22","guid":{"rendered":"https:\/\/www.tothenew.com\/blog\/?p=78123"},"modified":"2026-03-23T21:58:13","modified_gmt":"2026-03-23T16:28:13","slug":"fixing-jvm-outofmemoryerror-on-ecs-ec2-based","status":"publish","type":"post","link":"https:\/\/www.tothenew.com\/blog\/fixing-jvm-outofmemoryerror-on-ecs-ec2-based\/","title":{"rendered":"Fixing JVM OutOfMemoryError on ECS (EC2 Based)"},"content":{"rendered":"<h2>Introduction<\/h2>\n<p>We started seeing repeated OutOfMemoryError exceptions in a Spring Boot service running on Amazon ECS in EC2 mode.<br \/>\nThe impact of the OutOfMemoryError was serious:-<\/p>\n<ul>\n<li>JVM threads crashed, including SQS listeners, HTTP threads, and AWS SDK threads.<\/li>\n<li>Messages were retried and eventually sent to SQS Dead Letter Queues.<\/li>\n<li>The service became unstable under load.<\/li>\n<\/ul>\n<p>At the same time, we also had a strict requirement:<\/p>\n<ul>\n<li>The same Docker image must be used for QA and Prod.<\/li>\n<li>No changes to be done in the Jenkins pipeline.<\/li>\n<li>QA needed to remain lightweight i.e no JVM Memory Config to be done because of tiny instance.<\/li>\n<li>Prod had to be memory-safe.<\/li>\n<\/ul>\n<p>This blog outlines the exact approach we took to solve this issue effectively.<\/p>\n<h2>Root Cause<\/h2>\n<ol>\n<li><strong>JVM auto-sizing and containers<\/strong><\/li>\n<\/ol>\n<p>The application was launched with a basic entrypoint:<\/p>\n<pre>\u00a0java -jar app.jar<\/pre>\n<p>Even though Java 21 is aware of containers, auto-sizing was overly optimistic for our workload, which included:<\/p>\n<ul>\n<li>Spring Boot<\/li>\n<li>Spring Cloud AWS with SQS listeners<\/li>\n<li>AWS SDK background threads<\/li>\n<li>Chromium and Selenium, which use a lot of native memory<\/li>\n<\/ul>\n<p>As a result,<\/p>\n<ul>\n<li>JVM heap grew too large<\/li>\n<li>Native memory had no headroom<\/li>\n<li>The JVM encountered an OutOfMemoryError<\/li>\n<li>Messages failed to process and went to the DLQ.<\/li>\n<\/ul>\n<p><strong>2. Spring profile coupling<\/strong><\/p>\n<p>Spring profile selection was managed via environment variables:<\/p>\n<pre>profiles:\r\nactive: ${ENV:local}<\/pre>\n<p>ECS task definitions had:<br \/>\nENV=live<\/p>\n<p>This means both env has different application.yml file, i.e:<br \/>\nQA used application-qa.yml<br \/>\nProd used application-live.yml<\/p>\n<p>This setup seemed fine, but we had different constraints.<\/p>\n<h2>Constraints<\/h2>\n<p>We specifically did not want:<\/p>\n<ul>\n<li>Separate Dockerfiles<\/li>\n<li>Hardcoded Spring profiles in Docker<\/li>\n<li>Jenkins(Our CICD Tool) logic to control runtime behavior<\/li>\n<li>Rebuilding images for each environment<\/li>\n<\/ul>\n<p>The solution also had to follow immutable image principles.<\/p>\n<h2>The Solution<\/h2>\n<p>We took the key design decision:<\/p>\n<ul>\n<li>The ENTRYPOINT must not depend on the environment.<\/li>\n<li>Environment-specific behavior should be in ECS task definitions, not in Dockerfiles or Jenkins.<\/li>\n<\/ul>\n<p>Our Final Dockerfile ENTRYPOINT\u00a0 included <strong>JAVA_OPTS<\/strong> variable,<\/p>\n<pre>ENTRYPOINT exec java \\\u00a0 \r\n \u00a0${JAVA_OPTS} \\\u00a0\u00a0\r\n\u00a0\u00a0-jar \/app\/bp-order-service.jar<\/pre>\n<p>The reasons this worked for us:<\/p>\n<ul>\n<li>We had One ENTRYPOINT for all environments.<\/li>\n<li>JVM options are injected only at runtime.<\/li>\n<li>No Spring profile overrides.<\/li>\n<li>Proper signal handling is in place (exec, PID 1).<\/li>\n<\/ul>\n<p><strong>Environment-Specific Configuration\u00a0\u00a0<\/strong><\/p>\n<p>For QA (Running on small resources &amp; no tuning required)<\/p>\n<p>In QA ECS Task Definition, we added following variables:<br \/>\nWe kept JAVA_OPTS null in QA.<\/p>\n<pre>ENV=qa\u00a0 \r\nJAVA_OPTS=<\/pre>\n<p>The runtime command for QA became:<\/p>\n<pre>java -jar \/app\/bp-order-service.jar<\/pre>\n<p>Thats exactly like our original setup. Which didn&#8217;t had any runtime tuning.<\/p>\n<p><strong>Production (memory-safe)\u00a0\u00a0<\/strong><\/p>\n<p>In Prod ECS Task Definition we added below variables:<\/p>\n<pre>ENV=live\u00a0 \r\nJAVA_OPTS=-XX:MaxRAMPercentage=65.0 \\\u00a0 \r\n \u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0-XX:InitialRAMPercentage=50.0 \\\u00a0\u00a0\r\n \u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0-XX:MaxMetaspaceSize=512m \\\u00a0\u00a0\r\n \u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0-XX:+ExitOnOutOfMemoryError \\\u00a0\u00a0\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0-XX:+UseContainerSupport<\/pre>\n<p>By adding JAVA_OPTS we achieved the following points:<\/p>\n<ul>\n<li>Limits the heap to 65% of container memory.<\/li>\n<li>Reserves 35% for native memory, which includes Chromium, threads, and buffers.<\/li>\n<li>The JVM exits immediately on OOM.<\/li>\n<li>There\u2019s no partial JVM failure or DLQ storm.<\/li>\n<\/ul>\n<h2><\/h2>\n<p><strong>Verification in Production\u00a0\u00a0<\/strong><\/p>\n<p>We confirmed the configuration inside a running ECS task using:<\/p>\n<pre>jcmd 1 VM.flags<\/pre>\n<p>Key confirmations included:<\/p>\n<ul>\n<li>Container memory was detected correctly (MaxRAM).<\/li>\n<li>Heap was sized as expected (MaxRAMPercentage, InitialRAMPercentage).<\/li>\n<li>Metaspace cap was applied.<\/li>\n<li>G1GC was active.<\/li>\n<li>There were no hardcoded -Xmx or Spring profile overrides.<\/li>\n<\/ul>\n<p>Example highlights:<\/p>\n<p>MaxRAM\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 = 3145728000 (~3 GB)<br \/>\nMaxHeapSize \u00a0 \u00a0 \u00a0 = 2044723200 (~65%)<br \/>\nInitialHeapSize \u00a0 = 1572864000 (~50%)<\/p>\n<p>This confirmed that the JVM was working as designed.<\/p>\n<p><strong>What Remains Unchanged:\u00a0<\/strong><\/p>\n<ul>\n<li>Jenkins pipeline remained unchanged.<\/li>\n<li>Image build process stayed the same.<\/li>\n<li>Spring profile logic was not altered.<\/li>\n<li>QA behavior remained unchanged.<\/li>\n<\/ul>\n<p>It&#8217;s important to note that our fix changed runtime-only , not the build-time behavior.<\/p>\n<p><strong>Outcome<\/strong><\/p>\n<ul>\n<li>OutOfMemoryError stopped occurring.<\/li>\n<li>SQS DLQ stopped growing.<\/li>\n<li>The service stabilized under load.<\/li>\n<li>QA remained lightweight.<\/li>\n<li>One Docker image continued to serve all environments.<\/li>\n<\/ul>\n<p>&nbsp;<\/p>\n<h2>Conclusion<\/h2>\n<p>We resolved production crashes by managing JVM memory at runtime through ECS, without changing the Docker image, Jenkins pipeline, or QA behavior.<\/p>\n<p><strong>Key Takeaways from this post:<\/strong><\/p>\n<ul>\n<li>The ENTRYPOINT should not contain environment logic.<\/li>\n<li>JVM memory tuning should be in ECS task definitions.<\/li>\n<li>Use percentage-based heap sizing in containers.<\/li>\n<li>Leave room for native memory for non-JVM processes.<\/li>\n<li>Allow ECS to handle recovery by exiting cleanly on OOM.<\/li>\n<li>Keep CI\/CD pipelines free from environment dependencies.<\/li>\n<\/ul>\n","protected":false},"excerpt":{"rendered":"<p>Introduction We started seeing repeated OutOfMemoryError exceptions in a Spring Boot service running on Amazon ECS in EC2 mode. The impact of the OutOfMemoryError was serious:- JVM threads crashed, including SQS listeners, HTTP threads, and AWS SDK threads. Messages were retried and eventually sent to SQS Dead Letter Queues. The service became unstable under load. [&hellip;]<\/p>\n","protected":false},"author":1741,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"iawp_total_views":7},"categories":[2348],"tags":[1853,1892,4844],"aioseo_notices":[],"_links":{"self":[{"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/posts\/78123"}],"collection":[{"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/users\/1741"}],"replies":[{"embeddable":true,"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/comments?post=78123"}],"version-history":[{"count":9,"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/posts\/78123\/revisions"}],"predecessor-version":[{"id":78941,"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/posts\/78123\/revisions\/78941"}],"wp:attachment":[{"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/media?parent=78123"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/categories?post=78123"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/tags?post=78123"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}