Small Dictionary
About twenty years ago, I was building a search engine. Japanese search requires morphological analysis — splitting text into words — and I was using ChaSen, the standard tool at the time. On well-formed text, it works accurately. Internet text is not well-formed.
New idol group names, products just hitting the market, verbs coined on social media. Words that aren't in the dictionary don't exist to a morphological analyzer. Words that don't exist don't land in the index correctly. Users search and find nothing. You could update the dictionary. But the internet coins new words faster than any dictionary can keep up.
Something had been on my mind since then. Unknown words should be partially inferable from context. Kanji followed by hiragana is likely a verb. Consecutive katakana is probably a loanword or proper noun. Even without a dictionary, character patterns and grammar knowledge should get you reasonably far.
Twenty years later, I built it. A Japanese tokenizer written in C++, compiled to WebAssembly. Under 400KB gzipped. Runs in the browser. No server needed.
MeCab — the current de facto standard — has a dictionary of about 20MB for IPAdic. Every word, its cost, its part of speech — all in there. Scan the text, look up every matching candidate, pick the lowest-cost path. Dictionary quality determines accuracy. A clean approach.
Strip the dictionary and you have to implement the grammar instead. MeCab resolves "taberarenai" by looking up "tabe," "rare," and "nai" in its dictionary. Without a dictionary, you have to reason: the kanji is followed by hiragana, so this could be a verb; ichidan conjugation, irrealis form, with the potential auxiliary "rareru" in irrealis, followed by the negation auxiliary "nai." Godan, ichidan, irregular. Euphonic changes. Connection rules. The more dictionary you cut, the more Japanese you have to understand.
Parts of speech that will never grow — particles, auxiliaries — I hardcoded. Things that don't change go in the code. Things that change go in a small dictionary. If a rule can generate a word, it doesn't go in the dictionary. The dictionary exists only for exceptions.
Over 2,000 test cases use MeCab's output as reference data. MeCab gets things wrong, though. "Nomasareta" — was made to drink — MeCab splits it as "nomasa / re / ta." Grammatically, it should be "noma / sa / re / ta." The verb irrealis "noma," causative "sa," passive "re," past "ta." MeCab absorbs the causative into the verb stem. And it's not even consistent — "yomasareta" gets split correctly. Dictionary-based analyzers are at the mercy of what's registered. When MeCab is wrong, I go with what's grammatically correct. MeCab is a reference, not something I can follow blindly.
That's when something clicked. A C compiler has gcc as its oracle. An absolute standard that returns the correct output for any input. Writing tests is simple — check if your output matches gcc's. AI was able to build a C compiler because that oracle existed.
Morphological analysis has no oracle. MeCab is close but not infallible. Whether Japanese even has a single correct way to segment is itself unclear. Is "Tokyo-to" one word or "Tokyo" plus "to"? The answer changes with context and purpose. In a domain where defining correctness is ambiguous, you can't hand it to AI in a clean room and walk away. You have to write the rules yourself and decide what's correct yourself.
Accuracy doesn't match MeCab. A 20MB dictionary carries real weight. Japanese has too many exceptions. Ateji, dialects, orthographic variation — things rules can't catch. Choosing a small dictionary means living with the accuracy trade-off.
Still, I wanted an analyzer that doesn't say "doesn't exist" when it sees a word not in the dictionary. Shrinking the dictionary looked like subtraction. It was addition. Every byte cut is another piece of Japanese you need to understand.