{"id":78958,"date":"2026-04-15T14:06:45","date_gmt":"2026-04-15T08:36:45","guid":{"rendered":"https:\/\/www.tothenew.com\/blog\/?p=78958"},"modified":"2026-04-22T11:27:42","modified_gmt":"2026-04-22T05:57:42","slug":"building-a-python-file-compressor-with-ghostscript-and-pillow","status":"publish","type":"post","link":"https:\/\/www.tothenew.com\/blog\/building-a-python-file-compressor-with-ghostscript-and-pillow\/","title":{"rendered":"Building a Python File Compressor with Ghostscript and Pillow"},"content":{"rendered":"<h2><span style=\"font-size: 1.5rem;\">Building a Production\u2011Grade File Compression Utility in Python<\/span><\/h2>\n<article>We built an open\u2011source Python tool that shrinks 30\u201350 MB PDFs and images down to under 2 MB \u2014 with readable output, a desktop GUI, Docker support, CLI, and a one\u2011click <code>.exe<\/code> build. This post explains how it works under the hood.<\/p>\n<h2>The Problem Nobody Talks About<\/h2>\n<p>Every organization has the same quiet pain point: <strong>oversized files<\/strong>.<\/p>\n<ul>\n<li>A 42 MB scanned contract that won\u2019t attach to an email<\/li>\n<li>A folder of 35 MB site photos that choke a web upload form<\/li>\n<li>A SharePoint library ballooning because nobody compresses before uploading<\/li>\n<\/ul>\n<p>Online compressors cap uploads, destroy quality, or charge monthly fees.<br \/>\nManual Ghostscript commands? Great \u2014 if you memorize flags for a living.<\/p>\n<p>We needed something that just works:<br \/>\n<strong>target a size, hit compress, done.<\/strong><\/p>\n<h2>What This Utility Does<\/h2>\n<table border=\"1\" cellspacing=\"0\" cellpadding=\"8\">\n<thead>\n<tr>\n<th>Capability<\/th>\n<th>Details<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Image compression<\/td>\n<td>JPEG, PNG, TIFF, BMP, WebP \u2192 optimized JPEG<\/td>\n<\/tr>\n<tr>\n<td>PDF compression<\/td>\n<td>Ghostscript (primary), raster pipeline, pikepdf fallback<\/td>\n<\/tr>\n<tr>\n<td>Precision targeting<\/td>\n<td>Set an exact target size (e.g. 1.8 MB)<\/td>\n<\/tr>\n<tr>\n<td>Scanned PDF intelligence<\/td>\n<td>Auto\u2011detects text vs scanned pages<\/td>\n<\/tr>\n<tr>\n<td>Batch processing<\/td>\n<td>Parallel directory compression<\/td>\n<\/tr>\n<tr>\n<td>Desktop GUI<\/td>\n<td>Tkinter-based Upload \u2192 Compress \u2192 Download workflow<\/td>\n<\/tr>\n<tr>\n<td>CLI<\/td>\n<td>Click-powered command-line interface<\/td>\n<\/tr>\n<tr>\n<td>Docker<\/td>\n<td>Preconfigured image with system dependencies<\/td>\n<\/tr>\n<tr>\n<td>Standalone .exe<\/td>\n<td>PyInstaller build for zero-install distribution<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<h2>Real\u2011World Compression Results<\/h2>\n<table border=\"1\" cellspacing=\"0\" cellpadding=\"8\">\n<thead>\n<tr>\n<th>File Type<\/th>\n<th>Input Size<\/th>\n<th>Output Size<\/th>\n<th>Reduction<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Images<\/td>\n<td>30\u201350 MB<\/td>\n<td>~500 KB<\/td>\n<td>~98%<\/td>\n<\/tr>\n<tr>\n<td>PDFs<\/td>\n<td>30\u201350 MB<\/td>\n<td>~1.8 MB<\/td>\n<td>~96%<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>A 42 MB scanned PDF becomes a crisp <strong>1.7 MB<\/strong> file that remains perfectly readable.<\/p>\n<h2>Architecture Overview<\/h2>\n<pre><code>\r\nsrc\/\r\n\u251c\u2500\u2500 compressor.py\r\n\u251c\u2500\u2500 image_compressor.py\r\n\u251c\u2500\u2500 pdf_compressor.py\r\n\u251c\u2500\u2500 config.py\r\n\u2514\u2500\u2500 utils.py\r\n\r\nmain.py\r\ncompression_service.py\r\nui_app.py\r\nui_actions.py\r\nui_styles.py\r\nstate.py\r\n  <\/code><\/pre>\n<p>The key design decision:<br \/>\n<strong>one compression core<\/strong>.<br \/>\nBoth the CLI and GUI call <code>compress_core()<\/code>.<br \/>\nNo duplicated logic. No divergence.<\/p>\n<h2>How the Image Pipeline Works<\/h2>\n<ol>\n<li><strong>Load &amp; Analyze<\/strong> \u2014 dimensions, mode, alpha channel<\/li>\n<li><strong>Pre\u2011process<\/strong> \u2014 RGB conversion, alpha compositing<\/li>\n<li><strong>Resize<\/strong> \u2014 proportional downscale using LANCZOS<\/li>\n<li><strong>Optimize<\/strong> \u2014 dual\u2011axis quality + dimension tuning<\/li>\n<li><strong>Validate<\/strong> \u2014 verify size and compression ratio<\/li>\n<\/ol>\n<p>This two\u2011axis approach (quality + dimensions) enables aggressive targets<br \/>\nlike <strong>500 KB from a 40 MB image<\/strong> without turning it into mush.<\/p>\n<h2>How the PDF Pipeline Works<\/h2>\n<h3>Strategy 1: Ghostscript<\/h3>\n<pre><code>\r\ngs -sDEVICE=pdfwrite\r\n   -dCompatibilityLevel=1.4\r\n   -dPDFSETTINGS=\/ebook\r\n   -dColorImageResolution=120\r\n   -dDownsampleColorImages=true\r\n   -sOutputFile=output.pdf\r\n   input.pdf\r\n  <\/code><\/pre>\n<p>Presets and DPI values are tried in descending quality order until the target is met.<\/p>\n<h3>Strategy 2: Raster Pipeline<\/h3>\n<ul>\n<li>Render pages via pdf2image<\/li>\n<li>Enhance contrast and sharpness (OpenCV)<\/li>\n<li>Binary\u2011search JPEG quality<\/li>\n<li>Rebuild PDF via img2pdf<\/li>\n<\/ul>\n<h3>Strategy 3: Auto\u2011Detection<\/h3>\n<p>Text pages are preserved. Scanned pages are raster\u2011compressed.<br \/>\nOptional region\u2011aware compression preserves text while shrinking photos.<\/p>\n<h3>Strategy 4: pikepdf Fallback<\/h3>\n<p>When all else fails, basic stream compression ensures some reduction.<\/p>\n<hr \/>\n<h2>The Desktop GUI<\/h2>\n<p>A Tkinter-based interface with a simple workflow:<br \/>\n<strong>Upload \u2192 Compress \u2192 Download<\/strong><\/p>\n<ul>\n<li>Threaded background compression<\/li>\n<li>KB \/ MB target selector<\/li>\n<li>Temp file safety (no overwrites)<\/li>\n<li>Minimal state management<\/li>\n<\/ul>\n<hr \/>\n<h2>The CLI<\/h2>\n<pre><code>\r\npython main.py compress document.pdf --target-size 1800\r\npython main.py compress-dir input\/ --target-size 500 --workers 8\r\npython main.py analyze document.pdf\r\npython main.py check\r\n  <\/code><\/pre>\n<hr \/>\n<h2>Using It as a Python Library<\/h2>\n<pre><code>\r\nfrom src.compressor import FileCompressor\r\n\r\ncompressor = FileCompressor(target_size_kb=1800)\r\nresult = compressor.compress(\"input.pdf\", \"output.pdf\")\r\n  <\/code><\/pre>\n<h2>Lessons Learned<\/h2>\n<ul>\n<li>Binary search beats linear quality stepping<\/li>\n<li>Two\u2011axis optimization is essential<\/li>\n<li>Scanned PDFs need raster treatment<\/li>\n<li>Ghostscript discovery on Windows is non\u2011trivial<\/li>\n<li>Temp outputs prevent accidental data loss<\/li>\n<\/ul>\n<hr \/>\n<h2>Getting Started<\/h2>\n<pre><code>\r\npip install -r requirements.txt\r\npython main.py compress large.pdf --target-size 1800\r\n  <\/code><\/pre>\n<p>Built with Python, Ghostscript, Pillow, and a healthy frustration with file size limits.<\/p>\n<\/article>\n","protected":false},"excerpt":{"rendered":"<p>Building a Production\u2011Grade File Compression Utility in Python We built an open\u2011source Python tool that shrinks 30\u201350 MB PDFs and images down to under 2 MB \u2014 with readable output, a desktop GUI, Docker support, CLI, and a one\u2011click .exe build. This post explains how it works under the hood. The Problem Nobody Talks About [&hellip;]<\/p>\n","protected":false},"author":2246,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"iawp_total_views":2},"categories":[5879],"tags":[8512,8513,1358],"aioseo_notices":[],"_links":{"self":[{"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/posts\/78958"}],"collection":[{"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/users\/2246"}],"replies":[{"embeddable":true,"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/comments?post=78958"}],"version-history":[{"count":2,"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/posts\/78958\/revisions"}],"predecessor-version":[{"id":79669,"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/posts\/78958\/revisions\/79669"}],"wp:attachment":[{"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/media?parent=78958"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/categories?post=78958"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/tags?post=78958"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}