1 Introduction

This readme describes how to reproduce the experiments in the paper: Mika Juuti, Tommi Gröndahl, Adrian Flanagan and N. Asokan. "A little goes a long way: Improving toxic language classification despite data scarcity".

2 Compute hardware requirements

This code has been tested to work with a computer using

CPU: Intel Core i9-9900K CPU @ 3.60GHz
RAM: 32 GB
GPU: GeForce RTX 2080 Ti (Driver Version: 435.21)

3 Install libraries (anaconda version)

Install Anaconda3 from https://www.anaconda.com/distribution/#linux At the time of writing this report, the most current version is https://repo.anaconda.com/archive/Anaconda3-2019.03-Linux-x86_64.sh

Download the latest version, e.g.

$ wget https://repo.anaconda.com/archive/Anaconda3-2020.02-Linux-x86_64.sh && bash Anaconda3-2020.02-Linux-x86_64.sh

Create a conda environment:

$ conda create python==3.7.6 -n py3

Start the virtual environment, e.g.

$ source activate py3

Install pytorch and torchvision:

$ conda install pytorch=1.4.0 torchvision==0.5.0 -c pytorch

Install rest of requirements

$ pip install -r requirements.txt

Download spacy en_core_web_lg:

$ python -m spacy download en_core_web_lg

(Optional) Verify pytorch and cuda are correctly installed:

$ python

>> import torch

>> torch.FloatTensor([1]).cuda()

(Optional) Install apex for mixed-precision BERT

$ git clone https://github.com/NVIDIA/apex

$ cd apex

$ pip install -v --no-cache-dir ./

4 Download Kaggle dataset

Download Kaggle's toxic comment classification dataset https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge and extract the directory to ./data/jigsaw-toxic-comment-classification-challenge

5 Supplementary code

(Optional) For running EDA, download eda.py:

$ cd src/eda_scripts

$ bash get_eda.sh

6 Run experiments

cd src && python run_experiments.py

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
src		src
LICENSE.txt		LICENSE.txt
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

1 Introduction

2 Compute hardware requirements

3 Install libraries (anaconda version)

4 Download Kaggle dataset

5 Supplementary code

6 Run experiments

About

Releases

Packages

Languages

License

ssg-research/language-data-augmentation

Folders and files

Latest commit

History

Repository files navigation

1 Introduction

2 Compute hardware requirements

3 Install libraries (anaconda version)

4 Download Kaggle dataset

5 Supplementary code

6 Run experiments

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages