Skip to content

Building a Tokenizer from Scratch

How do you grammatically split 食べさせられなかった (tabesaserarenakatta)?

One of many problems I ran into while building a Japanese tokenizer. The answer: 食べ / させ / られ / なかっ / た. Base form: 食べる (taberu, "to eat"). A four-layer chain of causative + passive + negative + past.

Tokenizer vs. morphological analysis

A tokenizer splits text into tokens — units for search or NLP. Morphological analysis splits text into linguistically minimal units (morphemes) and assigns parts of speech. suzume is a tokenizer, not a strict morphological analyzer. In Japanese, though, "text has no spaces, so you need to split first" means the two overlap heavily.

MeCab can handle this with a dictionary lookup. 食べさせられなかった isn't stored as-is, but the conjugation pattern data lives in the dictionary. Dictionary size: 20–50MB.

I was building a tokenizer that runs in the browser. 50MB is out of the question. So: can you recover base forms from conjugated forms using grammar rules alone, without a dictionary? That's the question suzume started from.

Dictionary Lookup vs. Grammar-Based Reverse Engineering

Two broad approaches to Japanese tokenization.

One is dictionary-based. MeCab and Kuromoji take this approach. Hundreds of thousands of headwords and conjugation patterns stored in a dictionary, matched against the text. If it's in the dictionary, instant answer. Weak against words that aren't.

The other is grammar-based. The approach suzume took. Keep the dictionary minimal (particles, auxiliaries, exception nouns — about 400KB) and reverse-engineer verbs and adjectives from grammar rules.

Take 食べさせられなかった as an example.

MeCab looks it up. suzume peels off conjugation suffixes from the end.

Strip た from the end. なかっ is the negative auxiliary ない reshaped to connect to た (its renyokei form). られ is the renyokei of られる. させ is the renyokei of させる. What's left — 食べ — is the renyokei of the ichidan verb 食べる. Base form reached without touching a dictionary.

This is suzume's core technique: reverse conjugation analysis.

Conjugation Rules Are Finite

Japanese verb conjugation looks complex, but the rules are finite.

Verbs fall into three categories. Godan (five-row: 書く kaku, 読む yomu, 飲む nomu), ichidan (one-row: 食べる taberu, 見る miru), and irregular (する suru, 来る kuru). Each has a fixed conjugation pattern.

FormGodan 書くIchidan 食べる
Mizenkei (irrealis)書か食べ
Renyokei (continuative)書き食べ
Shuushikei (terminal)書く食べる
Kateikei (conditional)書け食べれ
Ishikei (volitional)書こ食べよ

Godan 書く shifts its ending across the ka-ki-ku-ke-ko row (hence "five-row"). Ichidan 食べる attaches endings directly to the stem 食べ.

Auxiliary connection rules are fixed too. れる/られる attach to the mizenkei. た attaches to the renyokei. ない also attaches to the mizenkei. Know these rules and you can, in theory, reverse any conjugation chain by peeling from the end.

In theory.

In Practice

There's a deep gap between theory and implementation.

Euphonic Changes in Godan Verbs

Godan verb renyokei forms change their sound when connecting to て or た. 書く becomes 書いた (i-onbin), 読む becomes 読んだ (hatsuonbin), 飛ぶ becomes 飛んだ (hatsuonbin). Not 書きた — 書いた.

By the rulesActual JapaneseType
書き + た → 書きた ✗書いた ✓i-onbin
読み + た → 読みた ✗読んだ ✓hatsuonbin
飛び + た → 飛びた ✗飛んだ ✓hatsuonbin
持ち + た → 持ちた ✗持った ✓sokuonbin
What is onbin (音便)?

Sound changes for easier pronunciation. Occurs when godan verbs connect to て or た. Ka-row (書く→書いた) and ga-row (泳ぐ→泳いだ) undergo i-onbin. Ma/ba/na-row (読む→読んだ, 飛ぶ→飛んだ) undergo hatsuonbin (nasal sound change). Ta/ra/wa-row (持つ→持った, 切る→切った) undergo sokuonbin (gemination).

To reverse this, seeing いた means checking for "i-onbin + た." Seeing んだ means checking for "hatsuonbin + た." Simple suffix matching isn't enough. I built onbin patterns into the conjugation tables as a reverse-lookup index.

Same Ending, Different Conjugation

切る (kiru) is godan (切ら・切り・切る・切れ・切ろ). 見る (miru) is ichidan (見・見・見る・見れ・見よ). Verbs ending in る can be either.

Written in kanji, 切る and 着る are visually distinct. The problem is hiragana きる. suzume generates both candidates and defers to scoring. But without kanji, context alone is often not enough. Feed suzume hiragana きる and it fails to split correctly in some cases. How powerful kanji is as a signal — I didn't fully appreciate that until I implemented it.

Auxiliary Chain Explosion

食べさせられなかった is a four-layer chain. Japanese goes longer. 食べさせられたくなかったらしい stretches to six layers: causative + passive + desiderative + negative + past + evidential.

Each layer spawns multiple interpretations, so candidate combinations grow exponentially. Trying all of them blows up.

This is where lattice + Viterbi comes in. Load all candidates into a lattice and find the minimum-cost path in one pass with the Viterbi algorithm. Combinatorial explosion reduced to computation proportional to text length times POS count.

Lattice and Viterbi

At each position in the text, enumerate all candidate morphemes starting there and build a directed graph (lattice). Assign costs to each edge (word likelihood + POS transition plausibility), then use the Viterbi algorithm to find the minimum-cost path from start to end. Dynamic programming — optimal solution without trying every combination.

Scoring — Defining "Natural Japanese" as Cost

Generating candidates isn't enough. Whether 切る or 着る is correct — scoring decides.

suzume's scoring combines three signals.

Word cost. POS priors. Nouns are most frequent at 0.0, verbs 0.2, adjectives 0.3. Dictionary matches get a -1.0 bonus. Single-character kanji get a +2.0 penalty (prefer longer words).

Connection cost. POS transition probabilities managed by a bigram table. "Verb renyokei → auxiliary" is -0.8 (natural connection). "Noun → noun" is +0.5 (compound nouns, slightly suppressed). suzume defines connection costs not with 14 base POS types but with ~55 ExtendedPOS categories (verb terminal form, continuative form, each auxiliary type...). Finer granularity makes it easier to reject unnatural transitions.

Conjugation confidence. How plausible the inferred base form is. If the onbin pattern matches, confidence is high. Ambiguous cases (could be godan or ichidan) lower the confidence and raise the cost.

I started with if/else connection rules. At some point I consolidated everything into a bigram table. The pile of hand-written rules became a 55×55 numeric table. Code got simpler. Tuning became a matter of adjusting numbers.

The Trial of "I Am a Cat"

I was looking for text to test accuracy. Out of copyright, standard Japanese, substantial length. Natsume Soseki seemed perfect.

I fed in the opening of Wagahai wa Neko de Aru ("I Am a Cat"). Splitting errors everywhere. It was Meiji-era literary Japanese.

The ござ in ございます is the renyokei of ござる, an archaic verb not in the modern conjugation tables. がお was misrecognized as an adjective stem. The boundary between を and つかんで shifted.

Each fix was implemented as a general rule, not a special case. For instance, the archaic verb pattern introduced for ござる also worked for other classical forms like 参る (mairu) and 申す (mousu).

Test cases reached 2,201. Modern prose, classical text, colloquial speech, social media, Soseki, Miyazawa Kenji. Each new test case refined the reverse conjugation rules further.

What 400KB of Grammar Can and Can't Do

The final WASM binary is about 360KB gzipped. Around 400KB with the dictionary.

The grammar-based approach works well for vocabulary that follows conjugation rules — verbs, adjectives, auxiliaries. These don't need every pattern pre-stored in a dictionary. Japanese verb conjugation is complex but finite. It can be encoded as rules.

It doesn't work for vocabulary with no rules — proper nouns, technical terms, neologisms. 東京スカイツリー won't emerge from grammar rules. That's where the core dictionary or user dictionary comes in.

Read by grammar, or read by dictionary. Humans face the same choice. Some people reach for a dictionary when they hit an unknown word. Others infer from context and word structure. suzume is the latter. It guesses wrong sometimes. But it doesn't need to carry a dictionary.