Official implementation of the paper titled Leveraging Dependency Grammar for Fine-Grained Offensive Language Detection using Graph Convolutional Networks. We propose a novel neural network architecture called SyLSTM, which integrates the inherent dependency structure of an input tweet into the deep feature space of the model to overcome biases introduced by overly used pejorative word senses. We test our approach on two open-source datasets for the task of fine-grained offensive language detection, and find that SyLSTM strong baselines such as the state-of-the-art BERT model while utilizing orders of magnitude fewer number of parameters.
Structure the repo in the following format after downloading the dataset from source.
|-- [root] SyLSTM\
|-- [DIR] sylstm\
|-- [DIR] data\
|-- training_data.csv
|-- [DIR] logs\
|-- [DIR] checkpoints\
- python 3.6 or 3.7
- torch>=1.4.0
- transformers==3.0.0
- scikit-learn
- nltk
- spacy
- pyenchant
- compound-word-splitter
Using a docker image requires an NVIDIA GPU. If you do not have a GPU please follow the directions for installing from source. In order to get GPU support you will have to use the nvidia-docker2 plugin. The docker image is cached on the GPU with id 0. In case of OOM errors at training, pass two GPUs.
# Download data for the chosen task using the respective link.
# Please maintain the directory tree provided above.
# Build the Dockerfile to create a Docker image.
docker build -t dgoel04/sylstm_guest:1.0 .
# This will create a container from the image we just created.
docker run -it --gpus '"device=gpu-ids"' dgoel04/sylstm_guest:1.0
-
Clone this repository.
git clone https://github.com/dv-fenix/SyLSTM.git
cd SyLSTM
-
Download the data for the chosen task using the respective links. Please follow the directory tree given in Repo Structure
NOTE: You may have to change the names of the downloaded data files in order to be compliant with the aforementioned directory tree. -
Create a python virtual environment to run your experiments.
python3 -m venv sylstm_venv
source sylstm_venv/bin/activate
-
Install the requirements given in
requirements.txt
.
pip install --upgrade pip
pip install -r requirements.txt
-
Change working directory to run the desired experiment.
cd sylstm
The first step towards training the SyLSTM is pre-processing the Twitter data and extracting the Dependency Parse Trees (DPTs). We make use of the spaCy toolkit for extracting the DPT of an input tweet. Please refer to our paper for more information on the pre-processing module developed here.
sh preprocess.sh
The code to train the SyLSTM is fairly modular. Here we provide a shell script with a sample training configuration for the SyLSTM.
# For more information on the optional experimental setups and configurations.
python ./sylstm/train.py --help
# You can manually change the arguments in run_train.sh to choose the different SyLSTM configurations.
sh run_train.sh
Please make sure that all the arguments are to your liking before getting started with the training!
If you use this code in your study/project, please cite the paper:
@inproceedings{goel2022leveraging,
title={Leveraging Dependency Grammar for Fine-Grained Offensive Language Detection using Graph Convolutional Networks},
author={Goel, Divyam and Sharma, Raksha},
booktitle={Proceedings of the Tenth International Workshop on Natural Language Processing for Social Media},
pages={45--54},
year={2022}
}