Mining Words Without a Dictionary
I once built a small search engine. It needed to handle Japanese. Here's the thing about Japanese: there are no spaces between words. English has spaces. You split on spaces, stem the words, build an index. Done. Japanese is a continuous stream of characters. To index it, you first have to figure out where one word ends and another begins. That process is called morphological analysis.
The standard tool at the time was ChaSen. It worked well on clean text. Internet text is not clean.
New idol group names, products just hitting the market, verbs coined on social media. If a word isn't in the dictionary, it doesn't exist to a morphological analyzer. It doesn't land in the index. Users search for it and find nothing.
There was an enterprise morphological analysis engine called Rosette (Basis Technology at the time, now under Babel Street). Google and other major search engines were reportedly among its customers. I inquired about licensing back then. They didn't bother replying. Enterprise-only.
Growing the dictionary improves coverage, but the internet coins new words faster than any dictionary can keep up. Maintaining one is endless work. So I started thinking. Could I infer "this is probably a word" from statistical patterns alone, without a dictionary?
PMI — Character Co-occurrence Hints at Words
PMI (Pointwise Mutual Information) measures how much more likely two events are to occur together compared to occurring independently.
Take the characters 東 and 京 in Japanese. The probability of 京 following 東 is far higher than you'd expect if the two appeared randomly. PMI quantifies that "far higher."
PMI(x, y) = log₂( P(x, y) / (P(x) × P(y)) )An intuitive grasp of PMI
Say 東 appears in 0.3% of all characters and 京 in 0.2%. If they were unrelated, the bigram 東京 should appear at 0.003 × 0.002 = 0.0006%. In reality, 東京 (Tokyo) appears far more frequently. That gap between expectation and reality is the PMI score. The higher it is, the more likely that character sequence forms a meaningful unit — a word.
Extract character bigrams (two-character pairs) and trigrams from a large corpus, compute PMI scores. High-scoring pairs are likely parts of real words. That's the basic idea.
PMI's Weakness
PMI has a bias problem. Low-frequency bigrams get absurdly high scores. A rare character combination appearing once in the entire corpus has an extremely low expected probability, so its PMI spikes. In practice, it's just noise.
The countermeasures: minimum frequency filtering (only consider bigrams appearing N or more times) and normalization to NPMI (Normalized PMI), which scales values to the -1 to +1 range and dampens the low-frequency bias.
Bloom Filters — Discarding a Billion Duplicate Lines
The first problem in large-scale corpus preprocessing is duplication. Web-crawled text contains the same sentences over and over. Feed duplicates into PMI calculation and frequencies skew.
Naive deduplication means putting every line into a HashMap. For a billion lines, that eats tens of gigabytes.
What is a Bloom filter?
A data structure for fast set membership testing. It consists of a bit array and multiple hash functions. To add an element, set the bits at positions computed by each hash function. To test membership, check whether all those bits are set. If they are, the element is "probably in the set." If any bit is unset, the element is "definitely not in the set." Bit collisions cause false positives, but false negatives are structurally impossible. Proposed by Burton Howard Bloom in 1970.
A Bloom filter is a probabilistic data structure that answers "definitely not seen" or "maybe seen." No false negatives, only false positives. It might wrongly say "seen this before" for a new line, but it will never say "never seen" for a line it has. Good enough for deduplication.
The memory difference is dramatic. For 10 million unique strings, a Bloom filter at 1% false positive rate takes about 12MB. A HashMap would need 160MB or more. 99% of duplicate checks happen in constant time through the Bloom filter. Only the remaining 1% hits the HashMap.
Bloom filter parameter design
Bit count: m = -n × ln(p) / ln(2)². For n=10 million, p=0.01 (1% false positive rate), that's about 96 million bits = 12MB. Optimal hash function count: k = (m/n) × ln(2), yielding 6–8. The implementation uses xxHash64 with double hashing (h(i) = hash1 + i × hash2) to generate multiple hash values.
Accessor Variety — Inferring Word Boundaries Statistically
PMI measures character co-occurrence, but it struggles with variable-length words. 東京 is two characters. 東京タワー (Tokyo Tower) is five. Bigram PMI alone can't extract variable-length words whole.
Accessor Variety (AV) takes a different approach. It counts how many distinct characters appear before and after a given string.
東京 has diverse left neighbors (は, の, 大) and diverse right neighbors (タ, 駅, 都). Both left and right AV are high. Meanwhile, 京タ has almost exclusively 東 on the left and ワ on the right. Low AV. That's the interior of a word, not a boundary.
AV computation uses a Suffix Array. The SA-IS algorithm builds one in O(n) time, enabling fast lookup of any substring's occurrences and surrounding context.
Building suzume-feedmill
I combined these techniques into a C++ pipeline tool for extracting unknown word candidates from a corpus.
Pipeline stages
Normalization: ICU library handles Unicode NFKC normalization. Full-width alphanumerics become half-width, variant characters unify. Bloom filter and HashMap deduplication happens at this stage.
Suffix Array construction: Build a Suffix Array from the normalized text using the SA-IS algorithm. An LCP (Longest Common Prefix) array is computed simultaneously, enabling fast frequency counting of any substring.
Accessor Variety computation: Using the Suffix Array, count the number of unique characters to the left and right of each substring. Keep substrings with AV above a threshold (default 5) as word candidates.
Scoring: Compute a weighted sum of NPMI score (weight 0.4), frequency (weight 0.3), and Accessor Variety (weight 0.3). Extract top-K results using a min-heap.
Results
The theory was beautiful. The output was unusable.
Most "word candidates" were fragments of words, phrases with particles attached, or meaningless character sequences. Tuning the parameters (AV threshold, minimum frequency, score weights) could improve things, but every corpus needed its own tuning. A general-purpose "dictionary-free unknown word extractor" was not what came out.
| What I expected | What I got |
|---|---|
| New words (idol names, product names) | Word fragments (ンライン, ックス) |
| Compound words (東京タワー) | Phrases with particles (東京の, タワーに) |
| Domain-specific terms | High-frequency but meaningless bigrams |
PMI and Accessor Variety show good results in academic papers on Chinese segmentation (Feng et al., 2004). But those evaluations are on clean corpora. Web-crawled Japanese text is noisy, and the noise feeds straight into the scores.
What happened next
The Suzume tokenizer I was developing in parallel shifted to a grammar-based design. Instead of growing a dictionary, Suzume encodes Japanese conjugation and connection rules directly in code. The need to "discover unknown words from a corpus" faded. When Suzume encounters an unknown word, grammar-based inference turned out to be more stable.
Is PMI Dead?
suzume-feedmill didn't deliver what I hoped. But PMI combined with Accessor Variety isn't worthless.
The problem was scope. Extracting unknown words directly from noisy web text demanded too much preprocessing. A more constrained domain — a specific forum, a company's internal documents — would have less noise and better precision.
I found no existing product that implements PMI + Accessor Variety as a standalone command-line tool. Plenty of academic papers, but none packaged into a reusable tool. Niche, but an empty space.
The code is on GitHub. C++, 24,000 lines, with tests and Docker support. Anyone is welcome to use it. Fair warning: parameter tuning takes patience.