Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add preprocess & capability for custom dataset for mt #3269

Merged
merged 69 commits into from
Dec 6, 2022

Conversation

FrostML
Copy link
Contributor

@FrostML FrostML commented Sep 14, 2022

PR types

New features

PR changes

Others

Description

Add preprocess & capability for custom dataset for mt.

@FrostML FrostML marked this pull request as ready for review September 19, 2022 03:20
default=None,
type=str,
help=
"The token used for padding. If it's None, the bos_token will be used. Defaults to None. "

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is None used as default setting? Maybe we shoulde use <pad> as default instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for compatibility concern.
In the former example, Transformer used bos_token as pad_token. Hence, during training for now, if pad_token is None, the bos_token will be used for padding. Hence, in preprocessing, we keep this behavior by setting default pad_token is None.

--bos_token "<s>" \
--eos_token "</s>" \
--unk_token "<unk>"
```

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pad_token is not supported in the training step as well as the following steps?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should support pad_token. I'll fix the doc and train/predict. Thanks.

--config ./configs/transformer.base.yaml \
--src_vocab ${DATA_DEST_DIR}/dict.en.txt \
--trg_vocab ${DATA_DEST_DIR}/dict.de.txt \
--bos_token "<s>" \

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The special tokens are also required when converting to static model?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for adapt the vocabulary size and get special token id.
<unk> is not required. I'll delete this.


* `--config`: 指明所使用的 Transformer 的 config 文件,包括模型超参、训练超参等,默认是 `transformer.big.yaml`。即,默认训练 Transformer Big 模型。
* `--data_dir`: 指明训练需要的数据集的路径。无需提供不同的 train、dev 和 test 文件具体的文件名,会自动根据 `--src_lang` 和 `--trg_lang` 指定的语言进行构造。train、dev 和 test 默认的文件名分别为 [train|dev|test].{src_lang}-{trg_lang}.[{src_lang}|{trg_lang}]。且 `--data_dir` 设置的优先级会高于后面提到的 `--train_file`,`--dev_file` 和 `--test_file` 的优先级。
* `--src_lang`(`-s`): 指代翻译模型的源语言。比如 `de` 表示德语,`en` 表示英语,`fr` 表示法语等等。和数据集本身相关。

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the indentation here as expected?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. --src_lang and --trg_lang are only used with --data_dir to specify the train, dev and test data file.

python train.py \
--config ../configs/transformer.base.yaml \
--train_file ${DATA_DEST_DIR}/train.de-en.de ${DATA_DEST_DIR}/train.de-en.en \
--src_vocab ${DATA_DEST_DIR}/dict.en.txt \

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dev_file is not supported in the static mode?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True. There is still no validation step in static mode train.

parser.add_argument(
"--test_file",
nargs='+',
default=None,
type=str,
help=
"The file for testing. Normally, it shouldn't be set and in this case, the default WMT14 dataset will be used to process testing."
"The files for test, including [source language file, target language file]. If it's None, the default WMT14 en-de dataset will be used. "

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need target language file during testing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll fix the help. It not necessary to provide target language file in testing.

以下是各个参数的含义:

* `--config`: 指明所使用的 Transformer 的 config 文件,包括模型超参、训练超参等,默认是 `transformer.big.yaml`。即,默认训练 Transformer Big 模型。
* `--data_dir`: 指明训练需要的数据集的路径。无需提供不同的 train、dev 和 test 文件具体的文件名,会自动根据 `--src_lang` 和 `--trg_lang` 指定的语言进行构造。train、dev 和 test 默认的文件名分别为 [train|dev|test].{src_lang}-{trg_lang}.[{src_lang}|{trg_lang}]。且 `--data_dir` 设置的优先级会高于后面提到的 `--train_file`,`--dev_file` 和 `--test_file` 的优先级。

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this feature during testing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not necessary, just being the same as train scripts.

* `--data_dir`: 指明训练需要的数据集的路径。无需提供不同的 train、dev 和 test 文件具体的文件名,会自动根据 `--src_lang` 和 `--trg_lang` 指定的语言进行构造。train、dev 和 test 默认的文件名分别为 [train|dev|test].{src_lang}-{trg_lang}.[{src_lang}|{trg_lang}]。且 `--data_dir` 设置的优先级会高于后面提到的 `--train_file`,`--dev_file` 和 `--test_file` 的优先级。
* `--src_lang`(`-s`): 指代翻译模型的源语言。比如 `de` 表示德语,`en` 表示英语,`fr` 表示法语等等。和数据集本身相关。
* `--trg_lang`(`-t`): 指代翻译模型的目标语言。比如 `de` 表示德语,`en` 表示英语,`fr` 表示法语等等。和数据集本身相关。
* `--test_file`: 指明训练所需要的 `test` 验证集的数据集的路径。若没有提供 `--data_dir` 或是需要特别指明训练数据的名称的时候指定。指定的方式为,可以是一组平行语料的源语言和目标语言,依次两个文件的路径和名称,`--test_file ${SOURCE_LANG_FILE} ${TARGET_LANG_FILE}`,也可以只转入源语言的文件。比如,`--test_file ${DATA_DEST_DIR}/test.de-en.de ${DATA_DEST_DIR}/test.de-en.en` 或者 `--test_file ${DATA_DEST_DIR}/test.de-en.de`。

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to provide parallel files during testing? Is there any specific scenraio requires such feature?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not necessary. This is for compatibility concern. I will delete this description in doc. Thanks for pointing out.

@gpengzhi
Copy link

gpengzhi commented Sep 21, 2022

To verify the correctness of our changes, could you reproduce the experimental results on the IWSLT14 de->en dataset by following the tutorial?

We could add a new config file (transformer.iwslt14.yaml) to exactly match the model (trainsformer_iwslt_de_en) and training configurations used in the fairseq example. Since we (fairseq and paddlenlp) share the same preprocessing script (prepare-iwslt14.sh), we should achieve similiar results.

We could run this experiment on a single GPU, and it won't take too much time to converge. The final BLEU score should be around 34.85.

bash prepare-wmt14en2fr.sh
```

完成数据处理之后,同样也可以采用上文提到的预处理方式获取词表,完成预处理。以下再以 WMT14 EN-DE 翻译数据集预处理为例:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we decide to use WMT14 en-de as an example, why do we need to mention IWSLT14 de-en above?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. This should be fixed.


本示例这里略去自定义数据下载、处理的步骤,如果需要,可以参考前页文档 [使用自定义翻译数据集](../README.md)。

本示例以处理好的 iwslt14 数据为例。
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to use IWSLT14 as an example, transformer-base maybe too large for this dataset. To keep the instruction consistent, we could simply use WMT14 en-de as the example.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Thanks.

@FrostML
Copy link
Contributor Author

FrostML commented Nov 30, 2022

@guoshengCS

guoshengCS
guoshengCS previously approved these changes Dec 2, 2022
@PaddlePaddle PaddlePaddle locked and limited conversation to collaborators Dec 6, 2022
@PaddlePaddle PaddlePaddle unlocked this conversation Dec 6, 2022
@FrostML FrostML merged commit d364d5b into PaddlePaddle:develop Dec 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants