The Hidden Work Behind Gzip: How it compresses so well



This content originally appeared on DEV Community and was authored by Rijul Rajesh

When I exported my CloudWatch logs to an S3 bucket, I noticed something surprising.
A 35MB log file had turned into a tiny 2.5MB .gz file. That made me wonder — how can Gzip compress plain text so efficiently without losing any data?

I decided to dig deeper into how the .gz format works and what really happens when you compress something like a CSV or a log file.

In this article, let’s explore how Gzip actually compresses a CSV file behind the scenes.

What is a .gz File

A .gz file is a compressed version of another file, created using the GNU zip (gzip) program. It uses a combination of two algorithms, LZ77 and Huffman coding, to shrink the size of the original data.

Gzip is particularly effective for text-based files like CSV, JSON, or HTML, which often contain repeating patterns.

Understanding CSV Files

A CSV (Comma-Separated Values) file is just plain text that stores data in a table-like structure. For example:

name,age,city
John,25,London
Jane,30,Paris
Jake,25,London

Since words like “London” or numbers like “25” appear multiple times, the file contains many repeated patterns that can be compressed efficiently.

How Gzip Compresses a CSV

When you run a command like:

gzip data.csv

it performs a few steps internally.

1. Scanning for Repetition

Gzip reads the file and looks for repeated sequences of bytes. In the CSV above, “London”, “,25,” and other recurring patterns are ideal candidates for compression.

2. LZ77 Compression

LZ77 looks for repeated sections of data and replaces them with references to where that data appeared earlier in the file.

Example:

Original data:

LONDON,LONDON,LONDON

Instead of storing “LONDON” three times, LZ77 works like this:

LONDON,<ref to "LONDON">,<ref to "LONDON">

So only the first “LONDON” is stored completely.
The next two are replaced with pointers (offsets) that say “go back X bytes and copy Y characters.”

This alone removes a lot of redundancy, especially in CSVs with repeating names, values, or cities.

3. Huffman Coding

After LZ77, Gzip applies Huffman coding, which reduces the number of bits needed to store frequent characters.

Huffman coding assigns shorter binary codes to frequent characters and longer codes to rare ones.

Example:

Imagine this simplified frequency table:

Character Frequency Huffman Code
, Very High 0
o High 10
n Medium 110
L Low 1110
D Low 1111

So “LONDON” which would normally take 6 bytes in plain text might become a much shorter sequence in binary form, like:

1110 10 110 1111 110 10

Because commas and letters that occur often take fewer bits, the overall size drops dramatically.

The Result

After these steps, you get a compressed file named data.csv.gz.

Gzip is lossless, meaning that when you decompress it, you get back the exact original file with no changes.

Compression efficiency depends on how repetitive your CSV is. Files with many repeating values compress very well. Files with random or unique data may not shrink as much.

Example comparison:

File Original Size Compressed Size
data.csv 35 MB 2.5 MB

Decompressing a .gz File

You can easily decompress it using:

gunzip data.csv.gz

or

gzip -d data.csv.gz

This restores the original data.csv file.

Doing It in Python

If you want to compress or decompress CSV files in Python, the gzip module makes it simple:

import gzip
import shutil

# Compress
with open('data.csv', 'rb') as f_in:
    with gzip.open('data.csv.gz', 'wb') as f_out:
        shutil.copyfileobj(f_in, f_out)

# Decompress
with gzip.open('data.csv.gz', 'rb') as f_in:
    with open('data_restored.csv', 'wb') as f_out:
        shutil.copyfileobj(f_in, f_out)

This code works just like the terminal commands, letting you handle .gz files programmatically.

If you work with large datasets, try compressing your CSVs with Gzip. It’s simple, effective, and saves a lot of space without changing your data.

If you’ve ever struggled with repetitive tasks, obscure commands, or debugging headaches, this platform is here to make your life easier. It’s free, open-source, and built with developers in mind.

👉 Explore the tools: FreeDevTools

👉 Star the repo: freedevtools


This content originally appeared on DEV Community and was authored by Rijul Rajesh