Solutions Asia

Japanese Language Tokenization

Oct 20, 2016


What is Tokenization?

Tokenization (lexical or morphological analysis)  is the process of breaking up a stream of text into words, phrases, names, or other meaningful elements called tokens. The list of tokens can then be used in indices for search, text mining, NLP, etc.
In languages that use spaces, such as English, tokenization may seem easy. However, even with English there are many edge cases such as hyphenated words, contractions, etc. , and even what is meant by a word. Should International Business Machines be tokenized as one token or three? For languages like Japanese and Chinese that do not use spaces for word breaks it can be even more challenging. Adding to the complexity is often the need for stemming to base forms of the words. How do you tokenize something like

機械学習を使うえでも、できあがったモデルのコンパクトさが重要になります。
to
 機械 / 学習 / / 使う / / でも
できあがっ / モデル / / コンパクト / / / 重要 / / なり ます


Japanese Language Tokenization Methods and Open Source Implementations:


Dictionary based sequence prediction

Based on provided or used supplemented dictionary of words, the programs find the best sequence of the words in the dictionary that matches the input. Scoring is usually done by statistical methods such as HMMs (ChaSen) or CRFs (MeCab) but can be done manually (Juman). The advantages of dictionary based methods are that they are very accurate for items in their dictionary and they can relatively easily be customized, which can be important for names of people and organizations. They usually include part-of-speech (POS) tagging as well. The disadvantages are they do not handle unknown words very well.
 
Kuromoji Very easy to use Java port of MeCab
MeCab    C++ but can be used easily from other languages such as Java, Python, Ruby
ChaSen  C++
JUMAN  C++ but has Python bindings via PyKNP

Machine learning word-boundary prediction methods

These programs can be smaller than the dictionary-based methods and often handle new or unknown words better, but can be more difficult to customize as it involves retraining the dataset. The basic idea is that for adjacent characters in the input string, the trained model will predict whether there is a word boundary between the two characters.

TinySegmenter  Javascript and Python versions available
KyTea   C++ (includes POS tagging)
Micter   C++

N-Gram

The N-gram method brute-force generates contiguous sequences of N (usually two or three) characters and generates scores by counting the frequency of each block. The advantage of this approach is that it is language agnostic and handles new or unexpected words better than other approaches. The draw backs are it is impossible to customize and the larger memory requirements and slower speed. The N-gram approach is not capable of POS tagging. N-Gram solutions are usually included with other tools (for example both Solr and Elastic Search include n-gram tokenizers) and not as standalone applications.


Commercial Offerings

Rosette Text AnalyticsSDK from BasisTech (multilingual)


Online Services

(free and commercial)
http://atilika.org/kuromoji/  (meant for kicking the tires on the Kuromoji Open Source library not for heavy use)