Building a Python File Compressor with Ghostscript and Pillow
Building a Production‑Grade File Compression Utility in Python
.exe build. This post explains how it works under the hood.
The Problem Nobody Talks About
Every organization has the same quiet pain point: oversized files.
- A 42 MB scanned contract that won’t attach to an email
- A folder of 35 MB site photos that choke a web upload form
- A SharePoint library ballooning because nobody compresses before uploading
Online compressors cap uploads, destroy quality, or charge monthly fees.
Manual Ghostscript commands? Great — if you memorize flags for a living.
We needed something that just works:
target a size, hit compress, done.
What This Utility Does
| Capability | Details |
|---|---|
| Image compression | JPEG, PNG, TIFF, BMP, WebP → optimized JPEG |
| PDF compression | Ghostscript (primary), raster pipeline, pikepdf fallback |
| Precision targeting | Set an exact target size (e.g. 1.8 MB) |
| Scanned PDF intelligence | Auto‑detects text vs scanned pages |
| Batch processing | Parallel directory compression |
| Desktop GUI | Tkinter-based Upload → Compress → Download workflow |
| CLI | Click-powered command-line interface |
| Docker | Preconfigured image with system dependencies |
| Standalone .exe | PyInstaller build for zero-install distribution |
Real‑World Compression Results
| File Type | Input Size | Output Size | Reduction |
|---|---|---|---|
| Images | 30–50 MB | ~500 KB | ~98% |
| PDFs | 30–50 MB | ~1.8 MB | ~96% |
A 42 MB scanned PDF becomes a crisp 1.7 MB file that remains perfectly readable.
Architecture Overview
src/
├── compressor.py
├── image_compressor.py
├── pdf_compressor.py
├── config.py
└── utils.py
main.py
compression_service.py
ui_app.py
ui_actions.py
ui_styles.py
state.py
The key design decision:
one compression core.
Both the CLI and GUI call compress_core().
No duplicated logic. No divergence.
How the Image Pipeline Works
- Load & Analyze — dimensions, mode, alpha channel
- Pre‑process — RGB conversion, alpha compositing
- Resize — proportional downscale using LANCZOS
- Optimize — dual‑axis quality + dimension tuning
- Validate — verify size and compression ratio
This two‑axis approach (quality + dimensions) enables aggressive targets
like 500 KB from a 40 MB image without turning it into mush.
How the PDF Pipeline Works
Strategy 1: Ghostscript
gs -sDEVICE=pdfwrite
-dCompatibilityLevel=1.4
-dPDFSETTINGS=/ebook
-dColorImageResolution=120
-dDownsampleColorImages=true
-sOutputFile=output.pdf
input.pdf
Presets and DPI values are tried in descending quality order until the target is met.
Strategy 2: Raster Pipeline
- Render pages via pdf2image
- Enhance contrast and sharpness (OpenCV)
- Binary‑search JPEG quality
- Rebuild PDF via img2pdf
Strategy 3: Auto‑Detection
Text pages are preserved. Scanned pages are raster‑compressed.
Optional region‑aware compression preserves text while shrinking photos.
Strategy 4: pikepdf Fallback
When all else fails, basic stream compression ensures some reduction.
The Desktop GUI
A Tkinter-based interface with a simple workflow:
Upload → Compress → Download
- Threaded background compression
- KB / MB target selector
- Temp file safety (no overwrites)
- Minimal state management
The CLI
python main.py compress document.pdf --target-size 1800
python main.py compress-dir input/ --target-size 500 --workers 8
python main.py analyze document.pdf
python main.py check
Using It as a Python Library
from src.compressor import FileCompressor
compressor = FileCompressor(target_size_kb=1800)
result = compressor.compress("input.pdf", "output.pdf")
Lessons Learned
- Binary search beats linear quality stepping
- Two‑axis optimization is essential
- Scanned PDFs need raster treatment
- Ghostscript discovery on Windows is non‑trivial
- Temp outputs prevent accidental data loss
Getting Started
pip install -r requirements.txt
python main.py compress large.pdf --target-size 1800
Built with Python, Ghostscript, Pillow, and a healthy frustration with file size limits.
