This repository consists of the code for proccessing the data and training NER for Russian language at BSNLP SlavNER 2021 using Spacy
pip install -r requirements.txt
- Look at this article if you're experiencing troubles with gpu to fine-tune Bert
- Add pre-trained vectors
- Add pretraining corpus
- Clone this repository
- Download the data from BSNLP Shared Task page and put it into data/bsnlp2021_train_r1
- Run
python save_data.py $folder1 $folder2 $folder3 split_train split_dev folder4 train_file dev_file test_file
wherefolder1
,folder2
,folder3
are the names of the folders for training and development sets,split_train
andsplit_dev
are percentage of data used in training and dev sets,folder4
is the name of the folder for test_set,train_file
,dev_file
,test_file
are the file names of the resulting spacy binary data sets. Now you have your training data for Spacy v.3 - (optional) Run
save_pretraining.py $folder_names
wherefolder_names
are the different folder names you want to use for the pretraining corpus. Check it out here how pretraining might help you to obtain better results. - Run
python -m spacy train config_ner_ruVec_pretrain.cfg --output ./tok2vec_output
to train the model. Specify paths to the training and development sets inside the config file before training. Also you may use a pretraining corpus and choose different pretrained vectors to potentially obtain better results. By default there are vectors fromru_core_news_lg
Russian model. You may find more info on training command and config editting here - Run
python -m spacy train config_spacy_trans.cfg --output ./multilingual_output
to train the model. Specify paths to the training and development sets inside the config file before training. You can change hyperparameters there and choose different models from https://huggingface.co/models. By default the model is "bert-base-multilingual-uncased", which was used for the training.