This is the implementation of our paper "SentNoB: A Dataset for Analysing Sentiment on Noisy Bangla Texts". You can find the paper here.
In this paper, we propose an annotated sentiment analysis dataset made of informally written Bangla texts. This dataset comprises public comments on news and videos collected from social media covering 13 different domains, including politics, education, and agriculture. These comments are labeled with one of the polarity labels, namely positive, negative, and neutral. One significant characteristic of the dataset is that each of the comments is noisy in terms of the mix of dialects and grammatical incorrectness. Our experiments to develop a benchmark classification system show that hand-crafted lexical features provide superior performance than neural network and pretrained language models.
- Khondoker Ittehadul Islam 1
- Md Saiful Islam 1, 2
- Sudipta Kar 3
- Mohammad Ruhul Amin 4
1 Shahjalal University of Science and Technology, Bangladesh
2 University of Alberta, Canada
3 Amazon Alexa AI, USA
4 Fordham University, USA
SentNoB Dataset is available here
- Train.csv
- Val.csv
- Test.csv
Column Title | Description |
---|---|
Data | Social media comment |
Label | 0, 1 or 2 . '0' for neutral, '1' for positive and '2' for negative |
Requires the following packages:
- Python 3.9.7 or higher
It is recommended to use virtual environment packages such as virtualenv. Follow the steps below to setup the project:
- Clone this repository via
git clone https://github.com/KhondokerIslam/SentNoB.git
- Use this command to install required packages
pip install -r requirements.txt
- Type
setup.sh
to download bangla fastText embeddings
- Download the SentNoB dataset from here
- Unzip the folder
- Ensure the folder name is "SentNoB Dataset"
- Go to data_processing folder and run
python preprocess.py
to obtain the preprocessed data.
- Go to Models folder
- Use
python feature_based.py
- Type in the model name when you will be asked to specify the model name in the console
- Model Names (Please follow the paper to read the details about experiments):
- Unigram
- Bigram
- Trigram
- U+B
- B+T
- U+B+T
- Char 2-gram
- Char 3-gram
- Char 4-gram
- Char 5-gram
- C2+C3
- C3+C4
- C4+C5
- C2+C3+C4
- C3+C4+C5
- C2+C3+C4+C5
- U+B+C3+C4+C5
- U+B+C2+C3+C4+C5
- U+B+T+C2+C3+C4+C5
- Embeddings
- U+B+C2+C3+C4+C5+E
- U+B+T+C2+C3+C4+C5+E
- Go to Models folder
- Use "python neural_network_(random).py" to run an experiment.
- Go to Models folder
- Use "python neural_network_(fasttext).py" to run an experiment.
- Go to Models folder
- Use "python mbert.py" to run an experiment.
@inproceedings{islam2021sentnob,
title={SentNoB: A Dataset for Analysing Sentiment on Noisy Bangla Texts},
author={Islam, Khondoker Ittehadul and Kar, Sudipta and Islam, Md Saiful and Amin, Mohammad Ruhul},
booktitle={Findings of the Association for Computational Linguistics: EMNLP 2021},
pages={3265--3271},
year={2021}
}