GURA, short for GRU Unit for Recognizing Affect, is a simply sentiment classification model that leverages the power of Bidirectional GRU
and Self-attention
mechanisms.
This is a sample project for a dataset of MyAnimeList in myanimelist-comment-dataset.
You can get the dataset from Hugging Face.
https://huggingface.co/datasets/NatLee/sentiment-classification-dataset-bundle
# Make sure you have git-lfs installed (https://git-lfs.com)
git lfs install
git clone --depth=1 git@hf.co:datasets/NatLee/sentiment-classification-dataset-bundle ./data
-
myanimelist-sts
: This dataset is derived from MyAnimeList, a social networking and cataloging service for anime and manga fans. The dataset typically includes user reviews with ratings. We used skip-thoughts to summarize them. You can find the original source of the dataset myanimelist-comment-dataset and the version is2023-05-11
. -
aclImdb
: The ACL IMDB dataset is a large movie review dataset collected for sentiment analysis tasks. It contains 50,000 highly polar movie reviews, divided evenly into 25,000 training and 25,000 test sets. Each set includes an equal number of positive and negative reviews. The source is from sentiment -
MR
: Movie Review Data (MR) is a dataset that contains 5,331 positive and 5,331 negative processed sentences/lines. This dataset is suitable for binary sentiment classification tasks, and it's a good starting point for text classification models. You can find the source movie-review-data and the section isSentiment scale datasets
. -
MPQA
: The Multi-Perspective Question Answering (MPQA) dataset is a resource for opinion detection and sentiment analysis research. It consists of news articles from a wide variety of sources annotated for opinions and other private states. You can get the source from MPQA -
SST2
: The Stanford Sentiment Treebank version 2 (SST2) is a popular benchmark for sentence-level sentiment analysis. It includes movie review sentences with corresponding sentiment labels (positive or negative). You can obtain the dataset from SST2 -
SUBJ
: The Subjectivity dataset is used for sentiment analysis research. It consists of 5000 subjective and 5000 objective processed sentences, which can help a model to distinguish between subjective and objective (factual) statements. You can find the source movie-review-data and the section isSubjectivity datasets
.
- Build the image for this task
docker-compose build
-
Make tokenizer with your data in the folder
./data
.docker run --rm \ --name sentiment-tokenizer \ -v $(pwd)/src:/src \ -v $(pwd)/data:/data \ -v $(pwd)/model:/model \ sentiment-image:latest \ bash -c "python make-tokenizer.py"
-
Train the sentiment model.
docker run --rm \ --name sentiment-runner \ -v $(pwd)/src:/src \ -v $(pwd)/data:/data \ -v $(pwd)/model:/model \ sentiment-image:latest \ bash -c "python main.py --mpqa"
You can run with all dataset at once. Just change the last line to the following:
python -i main.py --imdb --sst2 --mpqa --mr --subj --anime
-
(Optional) Make an auto-training for getting stable model in accuracy.
docker run --rm \ --name sentiment-runner \ -v $(pwd)/src:/src \ -v $(pwd)/data:/data \ -v $(pwd)/model:/model \ sentiment-image:latest \ bash -c "python command.py"
-
(Optional) Evaluate and inference the model.
docker run --rm \ --name sentiment-evaluate \ -v $(pwd)/src:/src \ -v $(pwd)/data:/data \ -v $(pwd)/model:/model \ sentiment-image:latest \ bash -c "python evaluate.py"
You'll get matrices like this:
2023-05-12 15:28:33.054 | INFO | dataLoader:loadOfficialIMDB:128 - Load official IMDB data. 2023-05-12 15:28:33.053 | INFO | __main__:<module>:55 - Model file path -> /model/modelOurs/imdb-epoch_0003-acc_0.7628-valAcc_0.7310.h5 2048/25000 [=>............................] - ETA: 4s ... 25000/25000 [==============================] - 56s 2ms/step 2023-05-12 15:29:40.376 | INFO | __main__:predict:50 - 0.73096 2023-05-12 15:29:40.444 | INFO | __main__:predict:51 - precision recall f1-score support Negative 0.73 0.72 0.73 12500 Positive 0.73 0.74 0.73 12500 accuracy 0.73 25000 macro avg 0.73 0.73 0.73 25000 weighted avg 0.73 0.73 0.73 25000
When you successfully train the model with specified dataset, you cat get the plot in the ./model
.
OS: Ubuntu 22.04.2 LTS x86_64
Kernel: 5.19.0-41-generic
Shell: zsh 5.8.1
CPU: 12th Gen Intel i5-1235U (12) @ 4.400GHz
Memory: 15770MiB
N. Lee, "GURA: GRU Unit for Recognizing Affect," GitHub, 2019. [Online]. Available: https://github.com/NatLee/GURA-gru-unit-for-recognizing-affect.
Nat Lee |