Solution Summary

Hello!

Here is what you'll find in this repository:

High level overview of components of our solution which led to our final score (Solution Summary)
Summary of various components of our solution (Component Summary)
Complete data processing, model training, scoring and blending code for all the components

Here is what you'll NOT find in this repository:

The input / translated data, chached models (due to Github file size limit)
Point and click reproducibility (due to absence of cached input & processed data / models)

Some additional info can be found on this blog

Solution Summary

Component Summary

Code / Implementation

MODEL BUILD: There are three options to produce the solution.

very fast prediction a) runs in a few minutes b) uses precomputed neural network predictions
ordinary prediction (use models from Output/Models/Igor/,Output/Models/Moiz/,Input/Ujjwal/Data/step-[2-3]/*h5) a) expect this to run for 5-6 hours b) uses binary model files
retrain models a) expect this to run about 1-2 days b) trains all models from scratch c) follow this with (2) to produce entire solution from scratch

command to run each build is below

very fast prediction (overwrites Output/Predictions/submission.csv) python ./blend.py
ordinary prediction (overwrites Output/Predictions/submission.csv, Output/Predictions/Moiz/*csv, Output/Predictions/Ujjwal/*csv, Output/Models/Igor/{lang}/*probs.csv ) python ./inference.py
retrain models (overwrites models in Output/Models/Igor_dev/,Output/Models/Moiz_dev/,Input/Ujjwal/Data/step-[2-3]/*h5) python ./train.py

HARDWARE: (The following specs were used to create the original solution)

v3-128 TPU - need for TF training (All TF models were trained via Kaggle). It's important that instance has 16Gb memory per core (128 totally) 64Gb memory - need for PyTorch training (All PyTorch models were trained via Google Colab Pro which has more memory than Kaggle instance but less TPU memory (8 vs 16)) Access to internet for downloading packages

SOFTWARE (python packages are detailed separately in `requirements.txt`):

Python 3.6.9

WARNING! Do no install pytorch-xla-env-setup.py before starting TF code. In this case there is an incompatibility in using TPU via TF and via PyTorch in the same instance runtime. The valid sequence of running (including install packages) is in ./train.py and ./inference.py.

DATA SETUP

DATA PROCESSING

The train/predict code will also call this script if it has not already been run on the relevant data.

python ./prepare_data_train.py python ./prepare_data_inference.py

############## Ujjwal model description The following code repository produces the the submission file for MLM part.

The code is borrowed from @riblidezso's following notebooks:

These codes assume TPU access.

Overview

Input Data:

There are two sources of input data:

source_1
- train_english: given english dataset for toxic comments
- train_foreign: train_english dataset translated to foreign dataset
- valid_english: validation data translated to english
- valid_foreign: original validation dataset
- test_english: test dataset translated to english
- test_foreign: original test dataset
- subtitle: open subtitle dataset
- pseudo_label: given test dataset pseudo-labeled based on our model prediction scores
source_2:
- Public Dataset

We used three different input pipelines to pre-train XLM models. We translated each record to various languages (en, es, fr, tr, ru, it, pt) to obtain more data for pre-training the model.

Version 1: Translated train, valid and test
Version 2: Translated train and open subtitle dataset
Version 3: Translated validation and test

Step - 1: Data Processing

Code: encode.py Input: source/source_1 files Output: encoded npz arrays step_1

We encoded the text in CSV files to create numpy arrays with numerical encodings. We did this to reduce the TPU runtime of the notebooks. The encoded arrays can be found in this Kaggle Dataset

Step - 2: Pre-training

Code: pretrain_xlm.py Input: encoded npz arrays step_1 Output: xlm-model weights step_2/version*

We used the three input versions to pre-train three XLM-Roberta models using masked language modeling.

The Kaggle Scripts corresponding to three versions are:

These three versions output three XLM-Roberta models that are used for supervised training in the next step. The models are saved here.

Step - 3: Fine-tuning

Code: finetune_xlm.py Input: source/source_2 files, step-2/input models, fold-idx Output: step-3 model weights

The three models from previous version are fine-tuned using task labels in this step. The models train best when downsampled 1:1 ratio of toxic and non-toxic labels. To ensure this, each model is fine-tuning task in triggered ~10 times with each a different subset for non-toxic labels. To add more diversity to the training pipeline, in half of the runs pseudo-labels (generated from our predictions) were added to the validation dataset.

The Kaggle scripts for this version can be found at:

Step - 4: Inference

Code: inference.py Input: step-3 model weights Output: step-3 score files

These codes can be used for running inference based on the models trained in Step-2. For running scoring on new file, replace it with the one located in /Input/Ujjwal/Data/source/source_1/test_foreign.csv.

Step - 5: Post-Processing

Code: post-process.py Input: step-3 score files Output: with_tta.csv, without_tta.csv

The final output is blended is generated by averaging the two versions with and without test-time augmentation (TTA).

Without TTA: use only those records present in original test. Give zero weight to everything else.
With TTA: use records present in original file (weight=5.) and records obtained from translation (weight=1.)

These files are avialable in output folder. A Simple average of these files scores 0.9460 and 0.9446 on pulic and private leaderboards respectively. These files are then combined with the scores from my other team mates in the final blend.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Solution Summary

Component Summary

Code / Implementation

MODEL BUILD: There are three options to produce the solution.

CONTENTS

HARDWARE: (The following specs were used to create the original solution)

SOFTWARE (python packages are detailed separately in `requirements.txt`):

DATA SETUP

DATA PROCESSING

The train/predict code will also call this script if it has not already been run on the relevant data.

Overview

Input Data:

Step - 1: Data Processing

Step - 2: Pre-training

Step - 3: Fine-tuning

Step - 4: Inference

Step - 5: Post-Processing

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
Code		Code
img		img
README.md		README.md
SETTINGS.json		SETTINGS.json
blend.py		blend.py
directory_structure.txt		directory_structure.txt
entry_points.md		entry_points.md
inference.py		inference.py
prepare_data_inference.py		prepare_data_inference.py
prepare_data_train.py		prepare_data_train.py
pytorch-xla-env-setup.py		pytorch-xla-env-setup.py
requirements.txt		requirements.txt
train.py		train.py

moizsaifee/kaggle-jigsaw-multilingual-toxic-comment-classification-3rd-place-solution

Folders and files

Latest commit

History

Repository files navigation

Solution Summary

Component Summary

Code / Implementation

MODEL BUILD: There are three options to produce the solution.

CONTENTS

HARDWARE: (The following specs were used to create the original solution)

SOFTWARE (python packages are detailed separately in requirements.txt):

DATA SETUP

DATA PROCESSING

The train/predict code will also call this script if it has not already been run on the relevant data.

Overview

Input Data:

Step - 1: Data Processing

Step - 2: Pre-training

Step - 3: Fine-tuning

Step - 4: Inference

Step - 5: Post-Processing

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

SOFTWARE (python packages are detailed separately in `requirements.txt`):

Packages