Building a Python File Compressor with Ghostscript and Pillow

15 / Apr / 2026 by Shubham Agrawal 0 comments

Building a Production‑Grade File Compression Utility in Python

We built an open‑source Python tool that shrinks 30–50 MB PDFs and images down to under 2 MB — with readable output, a desktop GUI, Docker support, CLI, and a one‑click .exe build. This post explains how it works under the hood.

The Problem Nobody Talks About

Every organization has the same quiet pain point: oversized files.

  • A 42 MB scanned contract that won’t attach to an email
  • A folder of 35 MB site photos that choke a web upload form
  • A SharePoint library ballooning because nobody compresses before uploading

Online compressors cap uploads, destroy quality, or charge monthly fees.
Manual Ghostscript commands? Great — if you memorize flags for a living.

We needed something that just works:
target a size, hit compress, done.

What This Utility Does

Capability Details
Image compression JPEG, PNG, TIFF, BMP, WebP → optimized JPEG
PDF compression Ghostscript (primary), raster pipeline, pikepdf fallback
Precision targeting Set an exact target size (e.g. 1.8 MB)
Scanned PDF intelligence Auto‑detects text vs scanned pages
Batch processing Parallel directory compression
Desktop GUI Tkinter-based Upload → Compress → Download workflow
CLI Click-powered command-line interface
Docker Preconfigured image with system dependencies
Standalone .exe PyInstaller build for zero-install distribution

Real‑World Compression Results

File Type Input Size Output Size Reduction
Images 30–50 MB ~500 KB ~98%
PDFs 30–50 MB ~1.8 MB ~96%

A 42 MB scanned PDF becomes a crisp 1.7 MB file that remains perfectly readable.

Architecture Overview


src/
├── compressor.py
├── image_compressor.py
├── pdf_compressor.py
├── config.py
└── utils.py

main.py
compression_service.py
ui_app.py
ui_actions.py
ui_styles.py
state.py
  

The key design decision:
one compression core.
Both the CLI and GUI call compress_core().
No duplicated logic. No divergence.

How the Image Pipeline Works

  1. Load & Analyze — dimensions, mode, alpha channel
  2. Pre‑process — RGB conversion, alpha compositing
  3. Resize — proportional downscale using LANCZOS
  4. Optimize — dual‑axis quality + dimension tuning
  5. Validate — verify size and compression ratio

This two‑axis approach (quality + dimensions) enables aggressive targets
like 500 KB from a 40 MB image without turning it into mush.

How the PDF Pipeline Works

Strategy 1: Ghostscript


gs -sDEVICE=pdfwrite
   -dCompatibilityLevel=1.4
   -dPDFSETTINGS=/ebook
   -dColorImageResolution=120
   -dDownsampleColorImages=true
   -sOutputFile=output.pdf
   input.pdf
  

Presets and DPI values are tried in descending quality order until the target is met.

Strategy 2: Raster Pipeline

  • Render pages via pdf2image
  • Enhance contrast and sharpness (OpenCV)
  • Binary‑search JPEG quality
  • Rebuild PDF via img2pdf

Strategy 3: Auto‑Detection

Text pages are preserved. Scanned pages are raster‑compressed.
Optional region‑aware compression preserves text while shrinking photos.

Strategy 4: pikepdf Fallback

When all else fails, basic stream compression ensures some reduction.


The Desktop GUI

A Tkinter-based interface with a simple workflow:
Upload → Compress → Download

  • Threaded background compression
  • KB / MB target selector
  • Temp file safety (no overwrites)
  • Minimal state management

The CLI


python main.py compress document.pdf --target-size 1800
python main.py compress-dir input/ --target-size 500 --workers 8
python main.py analyze document.pdf
python main.py check
  

Using It as a Python Library


from src.compressor import FileCompressor

compressor = FileCompressor(target_size_kb=1800)
result = compressor.compress("input.pdf", "output.pdf")
  

Lessons Learned

  • Binary search beats linear quality stepping
  • Two‑axis optimization is essential
  • Scanned PDFs need raster treatment
  • Ghostscript discovery on Windows is non‑trivial
  • Temp outputs prevent accidental data loss

Getting Started


pip install -r requirements.txt
python main.py compress large.pdf --target-size 1800
  

Built with Python, Ghostscript, Pillow, and a healthy frustration with file size limits.

FOUND THIS USEFUL? SHARE IT

Leave a Reply

Your email address will not be published. Required fields are marked *