Skip to content

Compressing an Inverted Index

MygramDB's index crossed 1GB. After ingesting a million Wikipedia articles, the n-gram inverted index was eating more memory than I expected.

What is an inverted index?

Think of the index at the back of a book. "Tokyo" → p.12, p.45, p.198 — a mapping from a word to the pages where it appears. An inverted index is the same idea. "Tokyo" → [doc 1, doc 5, doc 42, doc 1003] — a mapping from a term to the IDs of documents containing it. It's called "inverted" because it reverses the natural direction: instead of document → words, it maps word → documents. This is the core data structure behind full-text search. Elasticsearch, Solr, MySQL's FULLTEXT index — virtually every system with search capability depends on it.

At search time, you look up the posting list (the list of IDs) for each term. Searching for "Tokyo Tower" means finding document IDs that appear in both the "Tokyo" list and the "Tower" list — an intersection. Simple structure. The problem is size.

What is an n-gram?

A fragment produced by sliding a window of N characters across the text. "Tokyo" becomes "To", "ok", "ky", "yo" as 2-grams (bigrams). No dictionary needed — works for any language. MygramDB uses 1-grams for kanji (single characters) and 2-grams for ASCII, a hybrid approach.

With a million documents, high-frequency terms — "the" in English, の in Japanese — have posting lists with hundreds of thousands of document IDs. At 4 bytes each (uint32_t), 400,000 entries means 1.6MB. Tens of thousands of such lists exist. Store them as plain arrays and memory grows without bound.

This is a classic problem in full-text search, studied for over 30 years. MygramDB is an in-memory search engine. There's no disk to spill to. Compressing the index in memory is a matter of survival.

Delta Encoding — Sort First

Posting list IDs are sorted in ascending order. This is primarily for efficient intersection at search time, but it also enables compression. A sorted sequence of integers has one obvious property — the gap (delta) between adjacent values is smaller than the values themselves.

[10, 15, 20, 50, 51, 52] becomes [10, 5, 5, 30, 1, 1]. Keep the first value as-is, then record only "how much more than the previous." Even when the original IDs are in the hundreds of thousands, deltas typically stay in the tens to hundreds. Smaller values need fewer bits.

VByte (Variable Byte Encoding) — delta's companion

VByte is commonly paired with delta encoding. Instead of a fixed 4 bytes per integer, it uses a variable number of bytes depending on the value's magnitude.

The scheme is simple. The high bit of each byte signals "more bytes follow," leaving 7 bits for the value. Values up to 127 fit in 1 byte. Up to 16,383 in 2 bytes. Since deltas are typically small, most encode in 1–2 bytes — less than half the size of fixed 4-byte integers.

Lucene (the engine behind Elasticsearch and Solr) has used VByte variants for years. Google's early search implementation is said to have been VByte-based as well. Simple, fast to decode, decent compression. The "try this first" approach for inverted index compression, with nearly 30 years of track record.

MygramDB doesn't use VByte. It stores delta values as raw uint32_t. The compression ratio suffers, but there's a reason. More on that later.

The beauty of delta encoding is simplicity. Encoding and decoding are just addition — add the delta to the previous value. Little room for implementation bugs. For a search engine, a bug in index decoding means corrupted search results. Simplicity is a virtue.

Bitmaps — A Change of Perspective

Delta encoding has a weakness: terms that appear in most documents. If "the" appears in 800,000 out of a million documents, that's 800,000 delta values. The values may be small, but there are a lot of them. Even if every delta is 1 (the term appears in every consecutive document), storing 800,000 ones adds up.

Change the approach. Instead of listing which documents contain the term, represent presence as a sequence of bits.

With a million documents, allocate a million-bit array. If the term appears in document 1, set that document's bit to 1; if it doesn't appear in document 2, leave that bit as 0. If the term appears in 800,000 documents, 800,000 of those bits are 1.

An intuitive way to think about bitmaps

Think of a class attendance sheet. One slot for every student — mark ○ for present, × for absent. A bitmap works the same way. One slot for every document — 1 if the term appears, 0 if it doesn't. Instead of listing the names of everyone who showed up, you keep a full roster of checks and blanks.

A million bits = 125KB. Compared to 800,000 uint32_t values (3.2MB), that's less than 1/25th the size. Better yet, intersecting two bitmaps is a single CPU bitwise AND — no merge-sorting ID lists. Search gets faster too.

But bitmaps backfire for sparse posting lists. A term appearing in only 100 out of a million documents would need a million bits (125KB) — over 300 times larger than 100 uint32_t values (400 bytes). Spending memory on a bit array that's almost all zeros defeats the purpose.

Dense lists favor bitmaps. Sparse lists favor deltas. The optimal representation depends on the density — the ratio of documents containing the term to total documents. The problem is that within a single inverted index, dense and sparse lists coexist. The posting list for "the" and the posting list for "Tokyo Skytree" need entirely different representations.

Roaring Bitmap — A 2014 Invention

Roaring Bitmap, published by Daniel Lemire et al. in 2014, provided an elegant solution to this "different densities need different representations" problem.

Roaring's core idea is the container. It splits the 32-bit integer space by the upper 16 bits and the lower 16 bits. IDs sharing the same upper 16 bits are grouped into one container. Each container can hold up to 65,536 IDs (the lower 16-bit space = 2^16 = 65,536).

For example, document IDs 0–65535 go into "container 0," IDs 65536–131071 into "container 1." ID 100 and ID 200 share container 0; ID 100000 lands in container 1.

And here's the heart of Roaring — each container automatically selects from three internal representations based on its contents.

The first two are straightforward. Few IDs → Array (sorted ID list). Many IDs → Bitmap (bit array). This is the same delta-vs-bitmap trade-off we've been discussing. The crossover point is 4,096 entries. 4,096 × 2 bytes = 8KB. A bitmap is also 65,536 bits = 8KB. At exactly this count, the sizes match.

The third is Run (run-length encoding). It handles consecutive IDs — say, [100, 101, 102, ..., 199]. Instead of listing 100 individual IDs, store "start 100, length 100" in 4 bytes. This covers a pattern that neither Array nor Bitmap handles efficiently: long stretches of consecutive IDs. It shines with time-series-ordered data or batch imports that produce sequential ID ranges.

RepresentationWhen usedSizeLookup speed
ArrayFew IDs (≤4096)2 bytes × countBinary search O(log n)
BitmapMany IDs (>4096)Fixed 8KBDirect access O(1)
RunConsecutive IDs4 bytes × run countBinary search O(log n)

Roaring mixes all three representations within a single data structure, at container granularity. One container might be an Array, the next a Bitmap, the one after that a Run — the optimal representation is chosen automatically based on local data characteristics.

Why Roaring became ubiquitous

Compressed bitmaps existed before Roaring. WAH (Word-Aligned Hybrid), EWAH (Enhanced WAH), Concise — all run-length encoding variants.

Roaring won because of random access speed. WAH/EWAH excel at sequential scans but are slow for point lookups like "does this contain ID 42?" — you have to decode run-lengths from the start. Roaring achieves O(log n) random access through binary search on containers, then Array/Bitmap lookup within.

Set operations (AND, OR, XOR) are fast too. They decompose into per-container operations on matching upper-16-bit groups, skipping containers that don't overlap. Bitmap-to-bitmap AND is a straight CPU bitwise AND.

Today, Lucene, Apache Spark, Redis, Elasticsearch, Pilosa, and Apache Druid all use Roaring. Daniel Lemire's CRoaring library (C/C++) includes SIMD optimizations using AVX2 and NEON for high-speed set operations.

MygramDB's Hybrid Strategy

MygramDB automatically switches between delta encoding and Roaring Bitmap per posting list. Delta-compressed lists and Roaring lists coexist within the same inverted index.

The threshold is 18% density. An n-gram's posting list is converted to Roaring when it appears in 18% or more of all documents. Below 18%, it stays as delta.

The 18% number was determined empirically. A theoretical optimum can be computed from delta array byte costs versus Roaring container overhead, but in practice it's workload-dependent. Japanese and English Wikipedia articles have quite different n-gram frequency distributions. Japanese uses kanji 1-grams, so characters like 的, 年, は tend to be high-density. English 2-grams spread more evenly. Since the optimal threshold varies by language, it's exposed as a tunable parameter (roaring_threshold).

Hysteresis — Preventing Oscillation

Posting lists hovering near the threshold are trouble. If density bounces around 18% with each document addition or deletion, the list flips between delta and Roaring repeatedly. Each conversion means memory allocation and decode/encode cycles — unnecessary load on the CPU and memory allocator.

What is hysteresis?

A control technique that uses different thresholds for increasing and decreasing inputs. Think of a thermostat: if it turns on at 25°C and off at 25°C, the system cycles rapidly near that temperature. Set it to turn on at 26°C and off at 24°C, and it stabilizes. Offsetting the thresholds for the up-direction and down-direction — that's hysteresis.

MygramDB converts to Roaring at 18% density but doesn't convert back to delta until density drops to 9% (18% × 0.5). A density drop to 17% keeps the list as Roaring. It has to fall below 9% before switching back. This asymmetric threshold prevents oscillation near the boundary.

RCU Optimization — Don't Stop Searching

Format conversion is heavy work. Decoding tens of thousands of IDs and re-encoding them in a different format takes time. But MygramDB is an in-memory database — it must keep answering search requests throughout. "Please wait while we optimize" is not an option.

What is RCU (Read-Copy-Update)?

A concurrency pattern. Instead of locking data to update it, you make a copy, update the copy, then atomically swap it in. Readers continue accessing the old data during the update — reads are never blocked. Widely used in the Linux kernel.

MygramDB's optimization runs in three steps.

In step ①, the optimizer takes a snapshot of the index. The lock is brief — just copying references to posting lists. In step ②, no lock is held at all. The optimizer works on the copied data while search threads continue using the original. Step ③ atomically swaps in the optimized data.

There's a catch: another thread might have updated the posting list during step ②. Each posting list carries a version counter. If the version has changed between steps ① and ③, the swap is skipped. This prevents an old snapshot from overwriting a newer write.

Optimization processes n-grams in batches of 10,000. Other operations can interleave between batches, so even with hundreds of thousands of n-grams, the impact on search latency stays minimal.

Retracing the History of Compression

Inverted index compression techniques are the history of information retrieval itself.

In the 1990s, early search engines used plain integer arrays. Delta encoding and VByte were introduced and dramatically reduced memory usage. The 2000s brought bit-packing techniques (PForDelta, Simple9/16) that used SIMD for batch decoding, improving both speed and compression. The 2010s saw Roaring Bitmap establish a new design paradigm: automatic optimization based on density.

MygramDB's hybrid strategy compresses this history into a single implementation. Delta for sparse lists (1990s wisdom), Roaring for dense lists (2014 invention). VByte was left out because storing delta values as raw uint32_t makes decoding faster and conversion to Roaring simpler. Trading memory savings (VByte could more than halve the footprint) for implementation simplicity and decode speed.

Looking back across 30 years of research, the conclusion is almost banal: the optimal compression method depends on the data. That's why Roaring is powerful. Choosing the representation after seeing the data, rather than committing to one method upfront, is almost always the right call.