Natural Language Identification/Detection

In natural language processing, language identification or language detection is the problem of determining the natural language of given text content. Much like character encoding detection, 100% reliability is not possible for shorter amounts of text. Take for example the words ballet, date, empire, image, menu, etc. They are exactly the same in French and English so with the one word alone it is impossible to determine the language. This problem extends to other languages, take for example the characters for China 中国 , which are the same in Simplified Chinese and Japanese. Fortunately, in most cases and particularly with longer samplings of text it is possible to determine the language. Moreover, depending on your use case you may be able to use additional information, such as user language preferences, to narrow down which languages to consider.

There are a number of approaches to language identification including the naïve (character ranges) but the vast majority of serious tools use n-gram. N-gram has the advantage that byte n-grams can be used when character encoding is not known, which is often the case when you don’t know the language, although it does work better when codepoint n-grams are used, but which requires knowing the encoding. Almost all language identifiers use various takes on n-grams such as different numbers for n and how they analyze the statistics and which corpuses they use for training.

There are actually a large number of open source language detectors available but as the quality varies widely I have listed only a few of the better ones below. The quality of the better open source implementations is the equal of the commercial offerings, and other than cost, they have the advantage over commercial implementations in that if you can invest the time, you can use your own training set to further improve the results for your particular corpus.

Open Source

Java

optimaize used by the latest version of Tikaand a vast improvement over what was previously used.

Shuyo Naive Bayes code point

C

Compact Language Detector 2 Naive Bayes code point; allows supplemental information such as expected language, original document encoding, document URL top-level domain name

Python

LangID comes with confidence score

Langdetect Port of Shuyo language-detection library to Python

Javascript

franc supports 176 “languages”, by default.

Commercial

Language Detection API (provides API clients for Java, Python, Ruby, PHP, C#, Crystal)

LingPipeSDK (Java)

RosetteSDK and API

Meaning Clould API (n-gram)

Natural Language Identification/Detection

Language Identification

Open Source

Commercial

Recent posts

Categories

Tags

Natural Language Identification/Detection

Language Identification

Open Source

Commercial

Share on

Recent posts

Categories

Tags