Polarity.tn is a web platform that detects the language and sentiment polarity of a text written in arabic characters. It differenciates arabic and tunisian dialect texts using machine learning techniques.
These steps give an overview on the language identification pipeline of our script:
- Text Cleaning
- Construct a language classifier using supervised learning by training our arab and tunisian corpuses
- Converting the documents to feature vectors using the BOW-tfidf method with character n-grams : max_df= 0.85, min_df=0.25, ngram_range = (1,4)
- Training and testing our MultinomialNB model ( Parameters of BOW were chosen according to accuracy value & confusion matrix results (F1) after multiple tests)
These steps give an overview on the sentiment analysis pipeline of our script:
- Text Cleaning
- Normalization & tokenization
- Remove stop words
- Stemming
- Document representation using BOW
- Learning Clasiffication model: we tested Naive Bayes Classifier, SVM and LP to finally choose the NB classifier because it gave us the best accuracy and confusion matrix compared to LR and SVM .
- Construct the final model using the entire corpus.
Realized by Ibtihel Sidhom, Molka Zaouali and Taysir Ben Hamed in December 2018 💻
Run this command under the root directory of this repository:
$ pipenv install
To create a virtual environment you just execute the $ pipenv shell
command.
To run the language identification script on the existing corpus files, you can execute this command:
$ python Generating-models/language-identification.py
You can also test it locally by uncommenting the last lines of the script and typing your input text in the script. Comment the dumping part to make the script run faster.
To run the sentiment analysis script on the existing corpus files, you can execute this command:
$ python Generating-models/sentiment-analysis.py
You can also test it locally by uncommenting the last lines of the script and typing your input text in the script. Comment the dumping part to make the script run faster.
To start the web application, you can execute this command:
$ python Web-application/app.py
In order to enlarge our data, when you get the results of a text message, you are asked for feedback on the predicted results by answering the given small form.
Based on this evaluation, this data will be stored in a file to be added to the corpus in the future.
The amazing background is by the awesome street artist El Seed.