Solutions Asia

Natural Language Identification/Detection

Oct 24, 2016

Language Identification

In natural language processing, language identification or language detection is the problem of determining the natural language of given text content. Much like character encoding detection, 100% reliability is not possible for shorter amounts of text. Take for example the words ballet, date, empire, image, menu, etc. They are exactly the same in French and English so with the one word alone it is impossible to determine the language. This problem extends to other languages, take for example the characters for China 中国 , which are the same in Simplified Chinese and Japanese. Fortunately, in most cases and particularly with longer samplings of text it is possible to determine the language.  Moreover, depending on your use case you may be able to use additional information, such as user language preferences, to narrow down which languages to consider.

 

There are a number of approaches to language identification including the naïve (character ranges) but the vast majority of serious tools use n-gram. N-gram has the advantage that byte n-grams can be used when character encoding is not known, which is often the case when you don’t know the language, although it does work better when codepoint n-grams are used, but which requires knowing the encoding. Almost all language identifiers use various takes on n-grams such as different numbers for n and how they analyze the statistics and which corpuses they use for training.

There are actually a large number of open source language detectors available but as the quality varies widely I have listed only a few of the better ones below. The quality of the better open source implementations is the equal of the commercial offerings, and other than cost, they have the advantage over commercial implementations in that if you can invest the time, you can use your own training set to further improve the results for your particular corpus.

Open Source


Java
optimaize used by the latest version of Tikaand a vast improvement over what was previously used.
Shuyo  Naive Bayes code point

C
Compact Language Detector 2   Naive Bayes code point; allows supplemental information such as expected language, original document encoding, document URL top-level domain name

Python
LangID comes with confidence score
Langdetect  Port of Shuyo language-detection library to Python

Javascript
franc supports 176 “languages”, by default.

Commercial


Language Detection API (provides API clients for Java, Python, Ruby, PHP, C#, Crystal)
LingPipeSDK (Java)
RosetteSDK and API