Skip to content

Smaller World

In college, a junior was desperate. His girlfriend — a computer science major — had sent him a file he couldn't open. The extension was .rar. I hadn't seen one either. LZH, sure. ZIP, sure. But RAR was new to all of us. Someone eventually dug up a trial version of WinRAR and saved the day. How many times that 30-day trial got renewed, I couldn't say.

The history of compression formats is the history of one obsession: "it should be smaller."

ZIP went open and became the world standard. LZH dominated Japan's BBS culture but barely existed outside it. A Galápagos format. RAR had better compression ratios, but its algorithm stayed proprietary. You can decompress but not compress — a peculiar kind of freedom that has kept it alive for three decades. One went open, one stayed domestic, one let you unpack but not pack. Behind each format, a different philosophy.

Now Facebook is quietly shifting the ground with Zstandard (zstd). Comparable compression ratios to gzip, several times faster. Linux distributions have switched their package compression to zstd. Major browsers are starting to send Accept-Encoding: zstd. A crack in the wall that gzip held for over thirty years.

Push the question of what compression really is, and you hit an interesting limit.

In 1989, someone posted a compression tool called THcomp on an ASCII-net bulletin board. Short for Tower of Hanoi compress. The idea: assign a unique number to every possible file. To compress, just store the number. To decompress, reconstruct the original from the number. The ultimate compression. Any file, no matter how large, reduced to a single number.

A joke, of course. I caught the tail end of the dial-up BBS era and only know THcomp as a legend from the early days. But I love the audacity of someone in a bandwidth-starved world saying "just number everything." And it cuts to the heart of information theory. Shrink one file, and another must grow. No algorithm shrinks everything.

Compression works because real-world data is biased. Natural language has redundancy. Images have spatial correlation. Code has repetition. Find the bias, replace it with something shorter. That's all it is. But push "that's all it is" far enough, and you get a beast like zstd.

The junior and his girlfriend got married. They're both still in IT. Compression formats come and go, but a connection that decompresses cleanly tends to last.