Natural Language Identification/Detection
Oct 24, 2016
Language Identification
In natural language processing, language identification or language detection is the problem of determining the natural language of given text content. Much like character encoding detection, 100% reliability is not possible for shorter amounts of text. Take for example the words ballet, date, empire, image, menu, etc. They are exactly the same in French and English so with the one word alone it is impossible to determine the language. This problem extends to other languages, take for example the characters for China 中国
, which are the same in Simplified Chinese and Japanese. Fortunately, in most cases and particularly with longer samplings of text it is possible to determine the language. Moreover, depending on your use case you may be able to use additional information, such as user language preferences, to narrow down which languages to consider.
There are a number of approaches to language identification including the naïve (character ranges) but the vast majority of serious tools use n-gram. N-gram has the advantage that byte n-grams can be used when character encoding is not known, which is often the case when you don’t know the language, although it does work better when codepoint n-grams are used, but which requires knowing the encoding. Almost all language identifiers use various takes on n-grams such as different numbers for n and how they analyze the statistics and which corpuses they use for training.
There are actually a large number of open source language detectors available but as the quality varies widely I have listed only a few of the better ones below. The quality of the better open source implementations is the equal of the commercial offerings, and other than cost, they have the advantage over commercial implementations in that if you can invest the time, you can use your own training set to further improve the results for your particular corpus.
Open Source
Java
Shuyo Naive Bayes code point
C
Compact Language Detector 2 Naive Bayes code point; allows supplemental information such as expected language, original document encoding, document URL top-level domain name
Python
LangID comes with confidence score
Langdetect Port of Shuyo language-detection library to Python
Javascript
franc supports 176 “languages”, by default.
Commercial
Language Detection API (provides API clients for Java, Python, Ruby, PHP, C#, Crystal)
LingPipeSDK (Java)
RosetteSDK and API
Meaning Clould API (n-gram)