This is a first place solution to the DataSolve 2022 competition organized by Wolters Kluwer.
Weights & Biases Dashboard Link
- HuggingFace transformers and Datasets library.
- HuggingFace Trainer and PyTorch.
- XGBoost and Catboost library.
- Weights & Biases for experiment tracking.
- The transformer based models were primarily trained on jarvislabs.ai and lambdalabs.com on RTX5000, A5000 and A100 GPU instances.
- XGBoost and Catboost models were trained on Kaggle’s P100 GPU.
- Data preprocessing notebook → this notebook contains the EDA and the Cross-validation setup used for the competition. The processed and splitted data is saved in the dataset here.
- Transformer train-test-split pipeline → Transformer based pipeline. Note, this pipeline just use a simple split from train set for evaluation which might be not the most general way to evaluate models and have a huge risk of overfitting on the validation set. This pipeline was just used for running quick and dirty experiments and the final experiments were ran with the 5-fold pipeline for more robust results. Same notebook can be found on GitHub repo too.
- Transformer 5-fold pipeline → All the transformer models used for the final ensemble were based out of this notebook. This pipeline runs a 5-fold training. The folds are stratified using the iterstrat package which helps to stratify multi-label data. Specifically,
MultilabelStratifiedKFold
was used to create the folds. You can also check the complete data preprocessing and preparation stage in this notebook. - XGBoost pipeline - traditional approaches like XGBoost with TFIDF/CountVectorizer was also used for diversity in the final ensemble. This used the same 5-folds used in the transformer pipleline for comparing the results and leak-free ensemble. Same notebook can be found on GitHub repo too.
- Catboost pipeline - replaced XGBoost model with Catboost (this was trained on CPU as catboost doesn’t support multi-label training on GPU yet). Same notebook can be found on GitHub repo too.
- Hill climbing ensemble - The final leaderboard score (
0.92276 private LB
) was obtained from this notebook. This notebook uses hill climbing algorithm for selecting the final models with corresponding weights obtaining overall best score on the cross-validation setup and then take the weighted average of them.
- Many experiments were run during the course of the competition. To keep track of all the experiments, I used wandb.ai. The W&B dashboard can be accessed here. All the configuration, console logs, saved artifacts/models can be viewed there. Furthermore, all the code that went into each experiment can be also viewed. For example: best single model’s code.
- To download the out-of-fold (OOF) predictions, test set predictions and submission files saved for each experiment, this notebook was used and the output was used as a dataset for Hill Climbing Ensemble notebook
- Even though model efficiency was not the main aim of competition, still I tried Knowledge distillation to make the base model (which is much smaller and easier to deploy) as performant as the large model (or an ensemble of models) for faster inference (ideally suited for deployment environments) out of curiosity. However, due to the limited time I couldn’t make it work. This is the corresponding notebook for the same.