We trained model during 3 stages. Each stage requires different data. All data should be preprocessed with preprocessing script before. We use single GPU for all stages. More details can be found in the paper. For stage1 we used shuffled 9m sentence from PIE corpus (a1 part only) For stage2 we used shuffled combination of NUCLE, FCE, Lang8, W&I + locness datasets. Notice that we used dump of Lang8 which contained only 947,344 sentences (in 52.5% of them source/target sentences were different). If you use newer dump which has more sentences - consider sampling. For stage3 we used shuffled version of W&I + locness datasets.
We used same fixed vocabulary for all stages (vocab_path=data/output_vocabulary)
In our experiments, we used an early stopping mechanism and a fixed number of epochs.
n_epoch: 20
patience: 3
The problem with this approach is sensitivity to random seeds, model initialization, data order, etc. The longer you train, the higher recall you get, but for the price of precision, so it's important to stop training at the right time. For reproducibility reasons, we are providing further the exact number of epochs for each model and each stage.
tune_bert: 1
skip_correct: 1
skip_complex: 0
max_len: 50
batch_size: 64
tag_strategy: keep_one
cold_steps_count: 0
cold_lr: 1e-3
lr: 1e-5
predictor_dropout: 0.0
lowercase_tokens: 0
pieces_per_token: 5
vocab_path: data/output_vocabulary
label_smoothing: 0.0
patience: 0
transformer_model: bert
special_tokens_fix: 0
transformer_model: xlnet
special_tokens_fix: 0
transformer_model: roberta
special_tokens_fix: 1
n_epoch: 20
cold_steps_count: 2
accumulation_size: 4
updates_per_epoch: 10000
tn_prob: 0
tp_prob: 1
pretrain: ''
cold_steps_count: 2
accumulation_size: 2
updates_per_epoch: 0
tn_prob: 0
tp_prob: 1
pretrain_folder: FOLDER_OF_BEST_MODEL_FROM_STAGE1
pretrain: BEST_MODEL_FROM_STAGE1
n_epoch: 9
n_epoch: 5
cold_steps_count: 0
accumulation_size: 2
updates_per_epoch: 0
tn_prob: 1
tp_prob: 1
pretrain_folder: FOLDER_OF_BEST_MODEL_FROM_STAGE2
pretrain: BEST_MODEL_FROM_STAGE2
n_epoch: 4
n_epoch: 3
iteration_count: 5
additional_confidence: 0
min_error_probability: 0
additional_confidence: 0.35
min_error_probability: 0.66
additional_confidence: 0.2
min_error_probability: 0.5
Notice that these parameters might need to be calibrated for your model. Consider using dev set for this.
For evaluating ensemble you need to name your models like "xlnet_0_SOMETHING.th", "roberta_1_SOMETHING.th" and pass them all to model_path
parameter. You also need to set is_ensemble
parameter.