Wrestling with ANN Indexes

The first time I tried to implement vector search, I wrote a naive brute-force comparison. It worked up to 100K vectors. Past a million, response times hit seconds. Of course. Computing cosine similarity across a million 768-dimensional vectors does that.

What is cosine similarity?

A measure of how closely two vectors point in the same direction. Values range from -1 (opposite) to 1 (identical). After converting text or images into vectors with hundreds of dimensions, cosine similarity is the most common way to determine "how similar" they are.

Enter ANN — Approximate Nearest Neighbor. The word "approximate" says it all. Instead of guaranteeing the exact nearest neighbor, it returns "close enough" results orders of magnitude faster.

Ten Million Bottles of Wine

An analogy. Imagine a massive warehouse with ten million bottles of wine. You have one bottle and want to find the one with the closest taste. The brute-force approach is to taste all ten million. Even if you had a palate capable of distinguishing them, your life would be over.

ANN gets you to "probably around here" by organizing the warehouse cleverly. And there are several schools of thought on how to organize it.

Two, broadly.

IVF — Dividing the Warehouse into Zones

IVF (Inverted File Index) divides the warehouse into zones. Red Bordeaux here, white Burgundy there, rosé Provence over there — clustering.

At search time, you first determine which zone the target bottle is closest to, then search only that zone and its neighbors. Split ten million bottles into 1,000 zones, and you only need to check 10,000. A thousand-fold reduction in search space. Fast.

The catch: you need training upfront to decide how to divide the zones. You have to survey the data and compute cluster centroids. This is heavy computation. You can't redo it every time the data changes. FAISS is the canonical implementation.

What is k-means?

A method that automatically groups data into K clusters. It iteratively computes the "centroid" of each cluster, gathering similar data points together. In IVF, these centroids become the "representative points" for each zone.

HNSW — A Map Connecting the Wines

HNSW (Hierarchical Navigable Small World) builds a network — a graph — connecting wines by "similarity."

Upper layers are coarse maps. Lower layers are detailed maps. Start with the coarse map to get a rough position, descend to the lower layers for precision. Like zooming in on Google Maps from country level to street level.

The advantage of HNSW: when new wine arrives, just add a node to the graph and connect it to its neighbors. No upfront training like IVF. The trade-off: the entire graph lives in memory, and the edge information alone consumes substantial memory at scale.

The Data Never Stops

So much for the algorithms. The problem is that in real systems, data never stops.

Vectors arrive every second while search requests keep coming. New wine shows up at the warehouse every day, and you have to keep filling orders while reorganizing the shelves.

This tension between writes and reads is where vector database design philosophies diverge.

FAISS is a library, so it hands this problem to the developer. train() and add() are cleanly separated — train once, then add incrementally. "When to reorganize the shelves is the warehouse manager's problem."

Milvus solves this with what amounts to a "loading dock" and "organized shelves" — a two-tier structure. New wine goes straight to the loading dock (Growing Segment). No labels, no shelf numbers. When a customer asks "find me something similar," the staff searches the organized shelves by map while tasting their way through the loading dock. When the dock fills up, it's sealed and the crew starts moving bottles to the organized shelves in the back. Meanwhile, new wine arrives at a fresh loading dock, and customer orders keep getting filled.

Qdrant has a similar two-tier structure. It writes to a WAL (Write-Ahead Log) first, then propagates to segments. Segments start with a plain index — shelf numbers only, nothing organized — and once enough data accumulates, an optimizer thread builds an HNSW index in the background. Searches continue against the old segment during construction. The key difference from Milvus: it's HNSW-based. No upfront training means smoother segment transitions.

Weaviate puts HNSW front and center. Since it's graph-based, new vectors become nodes on the map the instant they arrive. No loading dock needed. Everything is organized shelves. The trade-off: holding the entire graph in memory means higher memory consumption than IVF-based approaches for the same data volume. Recently, flat indexes and a dynamic mode — starting flat and switching to HNSW past a threshold — have been added.

Elasticsearch sits on top of Lucene's segment architecture. Each segment holds its own independent HNSW graph. New documents get buffered in memory, then flushed as new segments. A background merge policy periodically consolidates segments, keeping the largest segment's HNSW graph and adding vectors from the others. If you know Elasticsearch, it's the same segment merge you've always known — just with vectors now.

Line them up and a pattern emerges. Each DB's design philosophy is a different answer to the same question: where do you draw the line between training and serving?

DB	Approach
FAISS	"That's the developer's problem"
Milvus / Qdrant	"Absorb it with a two-tier structure"
Weaviate	"Eliminate training entirely"
Elasticsearch	"Ride the existing segment merge"

None of these is the right answer. Data volume, update frequency, memory budget, operational capacity. There are many axes to choose from.

Building It in nvecd

I'm building a small vector search engine called nvecd. It fuses event co-occurrence data with vector similarity for real-time recommendations. Think TikTok-style feeds.

Brute Force Is Surprisingly Fast

nvecd is an in-memory vector search engine, optimized for brute-force comparison via SIMD. The vectors typical for recommendation workloads aren't as heavy as the 768-dimensional outputs from language models like Sentence-BERT mentioned earlier. With reasonable dimensions, SIMD parallelism kicks in and computation stays light. Measured on Apple M5 Max (NEON), a SIM query over 100K vectors comes back in 0.90ms. Brute-force scans scale roughly linearly with data size, so ~9ms at 1M and ~90ms at 10M. Up to 1M, 500 concurrent users hit 100% success.

Data size	Per-query latency	500 concurrent
100K vectors	0.90 ms	Comfortable
1M vectors	~9 ms	No issues
10M vectors	~90 ms	Breaks

Up to 1M, I thought this would be fine for a while.

But 90ms isn't "fast." With 500 simultaneous queries, the thread pool jams almost instantly. 10 million was a real scale on the horizon. ANN was necessary.

Writing IVF Without FAISS

I could have depended on FAISS. I didn't. Heavy builds, licensing concerns, and above all — I didn't want another library I couldn't crack open.

The IVF principle is simple.

Split vectors into clusters with k-means
Record which vectors belong to each cluster (inverted list — in warehouse terms, "Zone A contains these wines")
At search time, only check the clusters closest to the query

On my own, maybe not. But this is the age of AI. No way I couldn't write it.

I wrote it. Tests passed. Then I plugged it into the server and hell began.

Hell #1: Deadlock

The VECSET command writes to the VectorStore, then notifies the IVF index: "new vector arrived." The notification handler tries to read from the VectorStore.

C++'s std::shared_mutex does not guarantee recursive read locks from the same thread. It's undefined behavior per the spec. Works in small tests. Silently freezes under 16-thread load testing.

The fix was simple. Copy the vector data and release the lock before sending the notification. Finding the cause was not simple.

Hell #2: Training Is Too Heavy

k-means training is expensive. Computing 1,024 clusters over 10 million vectors takes tens of seconds. If that runs on a request-handling thread, the server freezes.

The FAISS documentation casually says "call train separately." How much design judgment is packed behind that one line — I didn't understand until I implemented it myself.

I rewrote it to run training asynchronously on a background thread, falling back to brute-force search until training completes.

Hell #3: Getting the Cluster Count Wrong

The rule of thumb for IVF cluster count (nlist) is sqrt(n) — the square root of the data size. A million vectors means 1,000 clusters.

But I was training on a sample of 50K vectors. I computed n from the sample size: sqrt(50000) = 223 clusters.

Put a million vectors into 223 clusters and each cluster holds ~4,500 vectors. Probing 10 clusters at search time means scanning 4,500 × 10 = 45,000 vectors. Faster than brute force over a million, but far from the "check only 1% of the data" I'd expected.

nlist must be computed from the total data size, not the sample. Easy to lose sight of when sampling is involved.

Hell #4: Adding One at a Time Is Too Slow

After training, I was adding vectors to the IVF index one at a time. Each add acquired and released a write lock. A million lock/unlock cycles. Load speed tanked.

On top of that, training was copying the full vector data. Holding 10 million vectors in memory eats gigabytes — exactly how many depends on the dimension. Memory spiked, and macOS silently killed the process.

PQ (Product Quantization) — Another answer to the memory problem

PQ (Product Quantization) is commonly paired with IVF. It compresses vectors into short codes, reducing memory usage by 10–100x. FAISS's IVF-PQ is the most widely used practical combination. nvecd doesn't currently use it at the scales it targets, but for handling 768+ dimensional vectors at scale, it's a strong option.

Adopting the Two-Tier Structure

I went back to study how the production databases solve this. The common thread: separate the data structures for writes and reads.

nvecd adopted the Milvus-style two-tier approach.

Three key points.

Writes always go to the buffer. Only a dedicated buffer lock is needed, so writes never interfere with IVF search or training
Search hits both buffer and IVF, merging results. If the buffer is small, brute-force is fast enough
Seal is asynchronous. Swap extracts the buffer contents (lock released immediately), then flows them into the IVF in the background. Writes and searches never stop

Result: 1 million vectors loaded in 28 seconds, 500 concurrent users at 100% success.

Metric	Brute-force only	IVF two-tier
Load speed	128K ops/sec	35K ops/sec
500 concurrent success	100%	100%
Search p50	~9 ms	~10 ms (buffer included)

The tens of thousands of vectors still in the buffer add brute-force overhead, pushing p50 to ~10ms — about on par with pure brute-force at ~9ms. The payoff comes later: once the seal catches up and the buffer empties, it's IVF-only. Searching just 10 clusters is fast.

What I Learned

Implementing it myself, one thing became clear: ANN algorithms are straightforward to write from a textbook. k-means, IVF — the core logic fits in 100 lines.

What's hard is embedding them in a concurrent system.

Lock granularity
When and on which thread to run training
Consistency between buffer and index
Memory limits

None of this is in the papers. All of it is unavoidable in production. The reason Milvus and Qdrant have large codebases isn't algorithmic complexity. It's the sheer volume of this integration work.

Close Enough

One thing I can say with certainty. "Approximate" is good enough more often than you'd think.

100% accurate nearest neighbors versus 99.5% precision approximate nearest neighbors. Users can't tell the difference. Clinging to brute-force perfection instead of returning close enough at speed is the wrong call in almost every case.

Wine taste, vector space distance — "close enough" is usually enough.

Wrestling with ANN Indexes ​

Ten Million Bottles of Wine ​

IVF — Dividing the Warehouse into Zones ​

HNSW — A Map Connecting the Wines ​

The Data Never Stops ​

Building It in nvecd ​

Brute Force Is Surprisingly Fast ​

Writing IVF Without FAISS ​

Hell #1: Deadlock ​

Hell #2: Training Is Too Heavy ​

Hell #3: Getting the Cluster Count Wrong ​

Hell #4: Adding One at a Time Is Too Slow ​

Adopting the Two-Tier Structure ​

What I Learned ​

Close Enough ​