Skip to content

mmaguero/lang-detection

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Detectors

Tools used for this purpose:

*: Supports the Guarani language.

Installation

Pre-requisites:

Install polyglot dependencies.

Install requirements pip install -r requirements.txt

Download fastText lib.

Download the crubadan corpus.

# commented out due to low precision of textcat, use glcd3 instead.
"""
import nltk
nltk.download('crubadan')
nltk.download('punkt')
"""

Command Line Interface

All commands must be run from the src directory.

Detect language of tweets

python run.py [data_dir] [file_name_of_tweets] [language_lexicon] --detect_language --guarani

data_dir: path to data directory and must be relative to the src directory. Required.
file_name_of_tweets: Name of the file containing the tweets in CSV format. Required.
language_lexicon: Name of the file containing the language's (to-identify) words lexicon. Optional. In fact, language_lexicon can be any low-resource language.
guarani: The language (to-identify) is Guarani (or another low-resource language)? Optional. Needed for language_lexicon.

See also: lang, lang_2.


Note: Partially forked from https://github.com/social-link-analytics-group-bsc/tw_coronavirus in v1.0.