Automatic detection of text language with Python and NLTK. This script uses a very simple approach based on stopwords comparaison. The stopwords list with the most commun words wins the association.
you have to install NLTK package for Python to run this script.
just give the script a brunch of text to analyse and the script will :
- Parse and tokenize you text
- Compare the tokens with all stopwords lists contained in NLTK corpus in all available languages
- Select the most relevant language
- Calculate the relevancy level of the selected language
If you want to know how this script works, just have a look at this blog post titled Detection de langue en NLP i wrote (in french) on my personnal blog le-geek.com