Skip to content

Latest commit

 

History

History
127 lines (103 loc) · 3.6 KB

README.md

File metadata and controls

127 lines (103 loc) · 3.6 KB

SentNoB: A Dataset for Analysing Sentiment on Noisy Bangla Texts

This is the implementation of our paper "SentNoB: A Dataset for Analysing Sentiment on Noisy Bangla Texts". You can find the paper here.

Abstract

In this paper, we propose an annotated sentiment analysis dataset made of informally written Bangla texts. This dataset comprises public comments on news and videos collected from social media covering 13 different domains, including politics, education, and agriculture. These comments are labeled with one of the polarity labels, namely positive, negative, and neutral. One significant characteristic of the dataset is that each of the comments is noisy in terms of the mix of dialects and grammatical incorrectness. Our experiments to develop a benchmark classification system show that hand-crafted lexical features provide superior performance than neural network and pretrained language models.

Authors

  • Khondoker Ittehadul Islam 1
  • Md Saiful Islam 1, 2
  • Sudipta Kar 3
  • Mohammad Ruhul Amin 4

1 Shahjalal University of Science and Technology, Bangladesh

2 University of Alberta, Canada

3 Amazon Alexa AI, USA

4 Fordham University, USA

SentNoB Dataset is available here

List of files

  • Train.csv
  • Val.csv
  • Test.csv

Files Format

Column Title Description
Data Social media comment
Label 0, 1 or 2 . '0' for neutral, '1' for positive and '2' for negative

INSTALLATION

Requires the following packages:

  • Python 3.9.7 or higher

It is recommended to use virtual environment packages such as virtualenv. Follow the steps below to setup the project:

  • Clone this repository via git clone https://github.com/KhondokerIslam/SentNoB.git
  • Use this command to install required packages pip install -r requirements.txt
  • Type setup.sh to download bangla fastText embeddings

Usage

  1. Download the SentNoB dataset from here
  2. Unzip the folder
  3. Ensure the folder name is "SentNoB Dataset"
  4. Go to data_processing folder and run python preprocess.py to obtain the preprocessed data.

Feature-Based Experiments

  • Go to Models folder
  • Use python feature_based.py
  • Type in the model name when you will be asked to specify the model name in the console
  • Model Names (Please follow the paper to read the details about experiments):
    • Unigram
    • Bigram
    • Trigram
    • U+B
    • B+T
    • U+B+T
    • Char 2-gram
    • Char 3-gram
    • Char 4-gram
    • Char 5-gram
    • C2+C3
    • C3+C4
    • C4+C5
    • C2+C3+C4
    • C3+C4+C5
    • C2+C3+C4+C5
    • U+B+C3+C4+C5
    • U+B+C2+C3+C4+C5
    • U+B+T+C2+C3+C4+C5
    • Embeddings
    • U+B+C2+C3+C4+C5+E
    • U+B+T+C2+C3+C4+C5+E

Neural Network Experiments

Random Initialize
  • Go to Models folder
  • Use "python neural_network_(random).py" to run an experiment.
FastText
  • Go to Models folder
  • Use "python neural_network_(fasttext).py" to run an experiment.

mBert

  • Go to Models folder
  • Use "python mbert.py" to run an experiment.

Bibtex

@inproceedings{islam2021sentnob,
  title={SentNoB: A Dataset for Analysing Sentiment on Noisy Bangla Texts},
  author={Islam, Khondoker Ittehadul and Kar, Sudipta and Islam, Md Saiful and Amin, Mohammad Ruhul},
  booktitle={Findings of the Association for Computational Linguistics: EMNLP 2021},
  pages={3265--3271},
  year={2021}
}