Wrestling with ANN Indexes
The first time I tried to implement vector search, I wrote a naive brute-force comparison. It worked up to 100K vectors. Past a million, response times hit seconds. Of course. Computing cosine similarity across a million 768-dimensional vectors does that.
What is cosine similarity?
A measure of how closely two vectors point in the same direction. Values range from -1 (opposite) to 1 (identical). After converting text or images into vectors with hundreds of dimensions, cosine similarity is the most common way to determine "how similar" they are.
Enter ANN — Approximate Nearest Neighbor. The word "approximate" says it all. Instead of guaranteeing the exact nearest neighbor, it returns "close enough" results orders of magnitude faster.
Ten Million Bottles of Wine
An analogy. Imagine a massive warehouse with ten million bottles of wine. You have one bottle and want to find the one with the closest taste. The brute-force approach is to taste all ten million. Even if you had a palate capable of distinguishing them, your life would be over.
ANN gets you to "probably around here" by organizing the warehouse cleverly. And there are several schools of thought on how to organize it.
Two, broadly.
IVF — Dividing the Warehouse into Zones
IVF (Inverted File Index) divides the warehouse into zones. Red Bordeaux here, white Burgundy there, rosé Provence over there — clustering.
At search time, you first determine which zone the target bottle is closest to, then search only that zone and its neighbors. Split ten million bottles into 1,000 zones, and you only need to check 10,000. A thousand-fold reduction in search space. Fast.
The catch: you need training upfront to decide how to divide the zones. You have to survey the data and compute cluster centroids. This is heavy computation. You can't redo it every time the data changes. FAISS is the canonical implementation.
What is k-means?
A method that automatically groups data into K clusters. It iteratively computes the "centroid" of each cluster, gathering similar data points together. In IVF, these centroids become the "representative points" for each zone.
HNSW — A Map Connecting the Wines
HNSW (Hierarchical Navigable Small World) builds a network — a graph — connecting wines by "similarity."
Upper layers are coarse maps. Lower layers are detailed maps. Start with the coarse map to get a rough position, descend to the lower layers for precision. Like zooming in on Google Maps from country level to street level.
The advantage of HNSW: when new wine arrives, just add a node to the graph and connect it to its neighbors. No upfront training like IVF. The trade-off: the entire graph lives in memory, and the edge information alone consumes substantial memory at scale.
The Data Never Stops
So much for the algorithms. The problem is that in real systems, data never stops.
Vectors arrive every second while search requests keep coming. New wine shows up at the warehouse every day, and you have to keep filling orders while reorganizing the shelves.
This tension between writes and reads is where vector database design philosophies diverge.
FAISS is a library, so it hands this problem to the developer. train() and add() are cleanly separated — train once, then add incrementally. "When to reorganize the shelves is the warehouse manager's problem."
Milvus solves this with what amounts to a "loading dock" and "organized shelves" — a two-tier structure. New wine goes straight to the loading dock (Growing Segment). No labels, no shelf numbers. When a customer asks "find me something similar," the staff searches the organized shelves by map while tasting their way through the loading dock. When the dock fills up, it's sealed and the crew starts moving bottles to the organized shelves in the back. Meanwhile, new wine arrives at a fresh loading dock, and customer orders keep getting filled.
Qdrant has a similar two-tier structure. It writes to a WAL (Write-Ahead Log) first, then propagates to segments. Segments start with a plain index — shelf numbers only, nothing organized — and once enough data accumulates, an optimizer thread builds an HNSW index in the background. Searches continue against the old segment during construction. The key difference from Milvus: it's HNSW-based. No upfront training means smoother segment transitions.
Weaviate puts HNSW front and center. Since it's graph-based, new vectors become nodes on the map the instant they arrive. No loading dock needed. Everything is organized shelves. The trade-off: holding the entire graph in memory means higher memory consumption than IVF-based approaches for the same data volume. Recently, flat indexes and a dynamic mode — starting flat and switching to HNSW past a threshold — have been added.
Elasticsearch sits on top of Lucene's segment architecture. Each segment holds its own independent HNSW graph. New documents get buffered in memory, then flushed as new segments. A background merge policy periodically consolidates segments, keeping the largest segment's HNSW graph and adding vectors from the others. If you know Elasticsearch, it's the same segment merge you've always known — just with vectors now.
Line them up and a pattern emerges. Each DB's design philosophy is a different answer to the same question: where do you draw the line between training and serving?
| DB | Approach |
|---|---|
| FAISS | "That's the developer's problem" |
| Milvus / Qdrant | "Absorb it with a two-tier structure" |
| Weaviate | "Eliminate training entirely" |
| Elasticsearch | "Ride the existing segment merge" |
None of these is the right answer. Data volume, update frequency, memory budget, operational capacity. There are many axes to choose from.
Building It in nvecd
I'm building a small vector search engine called nvecd. It fuses event co-occurrence data with vector similarity for real-time recommendations. Think TikTok-style feeds.
Brute Force Is Surprisingly Fast
nvecd works with 32-dimensional vectors. That's a different world from the 768 dimensions mentioned earlier (output from language models like Sentence-BERT). Lower dimensions mean dramatically cheaper computation. With SIMD optimization, even 10 million vectors came back in 28.5ms per query. 500 concurrent users, 100% success rate.
| Data size | Per-query latency | 500 concurrent |
|---|---|---|
| 1M vectors | 3.4 ms | No issues |
| 10M vectors | 28.5 ms | Barely |
Honestly, I thought this would be fine for a while.
But 28.5ms isn't "fast." It's "barely tolerable." With 500 simultaneous queries, the thread pool starts backing up. The moment load hits 1.5x, it breaks. With 10 million on the horizon, ANN was necessary.
Writing IVF Without FAISS
I could have depended on FAISS. I didn't. Heavy builds, licensing concerns, and above all — I didn't want another library I couldn't crack open.
The IVF principle is simple.
- Split vectors into clusters with k-means
- Record which vectors belong to each cluster (inverted list — in warehouse terms, "Zone A contains these wines")
- At search time, only check the clusters closest to the query
On my own, maybe not. But this is the age of AI. No way I couldn't write it.
I wrote it. Tests passed. Then I plugged it into the server and hell began.
Hell #1: Deadlock
The VECSET command writes to the VectorStore, then notifies the IVF index: "new vector arrived." The notification handler tries to read from the VectorStore.
C++'s std::shared_mutex does not guarantee recursive read locks from the same thread. It's undefined behavior per the spec. Works in small tests. Silently freezes under 16-thread load testing.
The fix was simple. Copy the vector data and release the lock before sending the notification. Finding the cause was not simple.
Hell #2: Training Is Too Heavy
k-means training is expensive. Computing 1,024 clusters over 10 million vectors takes tens of seconds. If that runs on a request-handling thread, the server freezes.
The FAISS documentation casually says "call train separately." How much design judgment is packed behind that one line — I didn't understand until I implemented it myself.
I rewrote it to run training asynchronously on a background thread, falling back to brute-force search until training completes.
Hell #3: Getting the Cluster Count Wrong
The rule of thumb for IVF cluster count (nlist) is sqrt(n) — the square root of the data size. A million vectors means 1,000 clusters.
But I was training on a sample of 50K vectors. I computed n from the sample size: sqrt(50000) = 223 clusters.
Put a million vectors into 223 clusters and each cluster holds ~4,500 vectors. Probing 10 clusters at search time means scanning 4,500 × 10 = 45,000 vectors. Faster than brute force over a million, but far from the "check only 1% of the data" I'd expected.
nlist must be computed from the total data size, not the sample. Easy to lose sight of when sampling is involved.
Hell #4: Adding One at a Time Is Too Slow
After training, I was adding vectors to the IVF index one at a time. Each add acquired and released a write lock. A million lock/unlock cycles. Load speed tanked.
On top of that, training was copying the full vector data. 10 million × 32 dimensions × 4 bytes = 1.2GB. Memory spiked and macOS silently killed the process.
PQ (Product Quantization) — Another answer to the memory problem
PQ (Product Quantization) is commonly paired with IVF. It compresses vectors into short codes, reducing memory usage by 10–100x. FAISS's IVF-PQ is the most widely used practical combination. nvecd didn't need it — 32 dimensions is low enough — but for 768+ dimensional vectors, it's worth considering.
Adopting the Two-Tier Structure
I went back to study how the production databases solve this. The common thread: separate the data structures for writes and reads.
nvecd adopted the Milvus-style two-tier approach.
Three key points.
- Writes always go to the buffer. Only a dedicated buffer lock is needed, so writes never interfere with IVF search or training
- Search hits both buffer and IVF, merging results. If the buffer is small, brute-force is fast enough
- Seal is asynchronous. Swap extracts the buffer contents (lock released immediately), then flows them into the IVF in the background. Writes and searches never stop
Result: 1 million vectors loaded in 28 seconds, 500 concurrent users at 100% success.
| Metric | Brute-force only | IVF two-tier |
|---|---|---|
| Load speed | 128K ops/sec | 35K ops/sec |
| 500 concurrent success | 100% | 100% |
| Search p50 | 0.18 ms | 19 ms (buffer included) |
The tens of thousands of vectors still in the buffer add brute-force overhead, pushing p50 to 19ms — slower than pure brute-force at 3.4ms. But once the seal catches up and the buffer empties, it's IVF-only. Searching just 10 clusters is fast.
What I Learned
Implementing it myself, one thing became clear: ANN algorithms are straightforward to write from a textbook. k-means, IVF — the core logic fits in 100 lines.
What's hard is embedding them in a concurrent system.
- Lock granularity
- When and on which thread to run training
- Consistency between buffer and index
- Memory limits
None of this is in the papers. All of it is unavoidable in production. The reason Milvus and Qdrant have large codebases isn't algorithmic complexity. It's the sheer volume of this integration work.
Close Enough
One thing I can say with certainty. "Approximate" is good enough more often than you'd think.
100% accurate nearest neighbors versus 99.5% precision approximate nearest neighbors. Users can't tell the difference. Clinging to brute-force perfection instead of returning close enough at speed is the wrong call in almost every case.
Wine taste, vector space distance — "close enough" is usually enough.