Uncertainty-Aware Curriculum Learning for Neural Machine Translation

Requirements

Python version >= 3.7.4
Pytorch version >= 1.2.0

Getting Started

Data Preprocessing

We use the standard validation and test sets provided in each translation task.

Language Models

The main language models used in this paper can be obtained from the following links.

KenLM https://github.com/kpu/kenlm
Bert as LM https://github.com/xu-song/bert-as-language-model

Data Uncertainty

Follow these steps to calculate the data uncertainty and generate the data difficulty JSON file:

1.Calculate the perplexity of each sentence.

2.Sort from low to high.

3.According to your experimental requirements, divide the sorted data set into several bins.

4.Extract 1000 pairs of sentences from each bin to verify the model uncertainty for a certain stage.

5.Build data difficulty JSON file. This is an example of data difficulty JSON file for 4 bins. These numbers are the indices of the sentences in a data file, the train_set represents all the data of the training set, the esti_set represents the estimation set.

{

"train_set": [

	[116518, 41568, 13049, ..., 39342, 23659, 76413], 

	[12051, 113004, 57498, ..., 51064, 47300, 47552], 

	[73186, 50806, 17741, ..., 94891, 55986, 44589],

	[69885, 114662, 32893, ..., 103985, 85597, 84899]

],	

"esti_set": [

	[28948, 87465, 7934, ..., 7839, 89179, 55998], 

	[297, 84844, 4712, ..., 112400, 105640, 47525], 

	[115014, 71806, 46151, ..., 41996, 43563, 95774], 

	[22106, 66255, 72142, ..., 16703, 45681,  5157]

]

}

Training

An example of a training script could be found in the script folder. Most parameters are quite obvious. Some parameters need to be specially set are explained as follows：

--file_prefix Specify the directory of dataset.

--difficulty_json Specify the path of data difficulty JSON file.

--fold_name Specify the directory to store models.

Translation

An example of a translation script could be found in the script folder.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
corpus		corpus
model		model
optim		optim
scripts		scripts
train_model		train_model
utils		utils
README.md		README.md
main.py		main.py
translate.py		translate.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Uncertainty-Aware Curriculum Learning for Neural Machine Translation

Requirements

Getting Started

About

Releases

Packages

Languages

NLP2CT/ua-cl-nmt

Folders and files

Latest commit

History

Repository files navigation

Uncertainty-Aware Curriculum Learning for Neural Machine Translation

Requirements

Getting Started

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages