Skip to content

Search from Scratch

Site search used to be a nightmare. Normalize Japanese text with a morphological analyzer carrying a massive dictionary. Build an inverted index. Maintain the dictionary every time a new word appeared. Stand up a search server, tune it, keep it running. Search alone was a full-time job.

I added search to this blog. There's no server. It's a static site — just HTML and JSON generated by VitePress. I wanted search to run entirely on the client side. Something I dreamed of twenty years ago.

English has always been relatively straightforward. Split on spaces, run a Porter Stemmer to reduce words to stems. "running" becomes "run," "connected" becomes "connect." A 3KB npm package. Japanese was the problem.

Japanese has no spaces between words. Search for the past tense and the dictionary form won't match. You need morphological analysis to split the text into words and reduce them to base forms. MeCab's dictionary is over 50MB. Not browser-friendly.

I've been writing a lightweight morphological analyzer in C++ as an open-source project. Compiled to WASM, 363KB gzipped. Still in active development. Feed it "hashitta" and it returns "hashiru" plus "ta," with part-of-speech tags — so you can filter out particles and auxiliaries as noise. This is exactly the use case I built it for. The day came to try it on my own blog.

The mechanism is simple. At build time, normalize every post and generate an inverted index as JSON. On the browser side, normalize the search query with the same logic and match against the index. About 1KB of index per post. A personal blog is small enough to handle entirely on the frontend. Even so, compared to twenty years ago, it feels like a different world.

Extracting keywords from long-form Japanese, normalizing them, making them searchable. No server. For a backend engineer, that ease is something special. Self-indulgent, maybe. But I checked off a small dream.