Skip to content

kvah/ling-573-offensive-tweet-detection

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ling-573-group-repo

Task Description

Primary Task

An end-to-end system for classifying English tweets as offensive or non-offensive, based on the OffensEval 2019 Shared Task (subtask A).

Adaptation Task

An end-to-end system for classifying Greek tweets as offensive or non-offensive, based on the OffensEval 2020 Shared Task (subtask A).

Changes in D4

Primary Task

Embeddings and Classification

  • GloVe embedding + Bidirectional LSTM -> RoBERTa-base model
  • Model finetuning and hypertuning

Adaptation Task

Additional Preproccessing

  • Removing diacritics
  • Convert unicode data into ASCII characters
  • Lemmatization

Embeddings and Classification

  • XLM-RoBERTa model
  • Model finetuning and hypertuning

Instructions

1. Prerequisites

Install Anaconda

If necessary, download and install anaconda by running the following commands:

wget https://repo.anaconda.com/archive/Anaconda3-2021.11-Linux-x86_64.sh
sh Anaconda3-2021.11-Linux-x86_64.sh

Download best models for primary and adaptation tasks

  • (not needed for D4.cmd) Download the best model for primary task and place the entire folder (containing config.json and pytorch.bin) in models/

  • Download the best model for adaptation task and place the entire folder (containing config.json and pytorch.bin) in models/

  • Note that the model for primary task (the folder containing config.json and pytorch.bin) should be named finetune_roberta and the model for adaptation task should be named finetune_xlmr_large_final_greek

  • Both models should be accessible to anyone logged into an UW Google account.

  • Following is an example of the directory structure of the model for the adaptation task:

models/finetune_xlmr_large_final_greek
models/finetune_xlmr_large_final_greek/config.json
models/finetune_xlmr_large_final_greek/pytorch.bin

2. Run the Condor Script

condor_submit D4.cmd

Notes:

  • For the purposes of this deliverable, preprocessing and training are commented out from the main script (D4_run.sh).
  • The condor script activates an existing conda environment on patas. No need to create/update the conda environment.

In summary, the pipeline:

  1. Pre-processes the Offensive Greek Twitter Dataset (OGTD) training and test data.
  2. Finetunes pretained model (XLM-RoBERTa) on Greek training data.
  3. Runs finetuned model predictions on Greek data and save output predictions in outputs/D4/adaptation/evaltest/D4_greek_preds.csv
  4. Saves the final f1-score in results/D4/adaptation/evaltest/D4_scores.out