Low resource text classification
Welcome to the repo for final class project for CS 505 (NLP). In this project we are tasked with this Malawi News Classification dataset. In limited time span, we tested a few techniques in data augmentation, creating / finetuning better embedding space with Transformer-based models, as well as some data science techniques to boost performance in feature space.
See project presentation here
For any baseline models.
- Support Vector Machines
- Random Forests
- XGBoost
- Multi-layer Perceptron
- Logistic Regression
For Classification Results from all the models:
python3.9 experiments/main.py -<data_dir> -<embedding_file>
-
data_dir : Directory where the training data is located (Text)
-
embedding_file : Name of the embedding file
-
The results will be generated as a csv file in this location
-
Mixup - Script
python mixUp.py -<train_data_dir> -<embeddings type>
- "Embeddings type" means the kind of embeddings to use when augmenting the data
- Mixup Augmented data will be generated in this Location
-
NLPAug - Script
-
Manual News Scraping - Data
- Count Vectorization
- TFIDF
- English aligned Chichewa MT5 embeddings - Script
python train_mt5_contrastive.py
For our alignment experiment, we created our own parallel news dataset. To recreate such data, you need to:
- Download realnews dataset from GROVER Repo
- Split files into smaller chunks for parallel translation (if running models) or small enough for Google Translation
./split_file_process_template.sh <input_path> <num_partition>
- Translating the files!
- If you are running in SCC and translating with Marian English-Chichewa Translation Model, you can run
qsub utils/run_translation_en_ny.qsub
- If you choose to use Google, the easiest free way is to convert them into chunks of excel sheets no bigger than 2
mb, and submit them as files manually.
utils
should have some file conversion file you may find helpful.
- Once you obtain translation files (Or, check SCC
/projectnb/cs505/projects/realnews
), you can run alignment training with:
python experiments/train_mt5_contrastive.py
(make sure you modify the paths to the Chichewa and English files in main section.)