Add preprocess & capability for custom dataset for mt #3269

FrostML · 2022-09-14T10:17:44Z

PR types

New features

PR changes

Others

Description

Add preprocess & capability for custom dataset for mt.

…nto preprocess

gpengzhi · 2022-09-21T03:52:41Z

examples/machine_translation/preprocessor/preprocessor.py

+        default=None,
+        type=str,
+        help=
+        "The token used for padding. If it's None, the bos_token will be used. Defaults to None. "


Why is None used as default setting? Maybe we shoulde use <pad> as default instead.

This is for compatibility concern.
In the former example, Transformer used bos_token as pad_token. Hence, during training for now, if pad_token is None, the bos_token will be used for padding. Hence, in preprocessing, we keep this behavior by setting default pad_token is None.

gpengzhi · 2022-09-21T04:03:06Z

examples/machine_translation/transformer/README.md

+    --bos_token "<s>" \
+    --eos_token "</s>" \
+    --unk_token "<unk>"
+```


pad_token is not supported in the training step as well as the following steps?

It should support pad_token. I'll fix the doc and train/predict. Thanks.

gpengzhi · 2022-09-21T04:13:05Z

examples/machine_translation/transformer/README.md

+    --config ./configs/transformer.base.yaml \
+    --src_vocab ${DATA_DEST_DIR}/dict.en.txt \
+    --trg_vocab ${DATA_DEST_DIR}/dict.de.txt \
+    --bos_token "<s>" \


The special tokens are also required when converting to static model?

This is for adapt the vocabulary size and get special token id.
<unk> is not required. I'll delete this.

gpengzhi · 2022-09-21T04:17:32Z

examples/machine_translation/transformer/README.md

+
+* `--config`: 指明所使用的 Transformer 的 config 文件，包括模型超参、训练超参等，默认是 `transformer.big.yaml`。即，默认训练 Transformer Big 模型。
+* `--data_dir`: 指明训练需要的数据集的路径。无需提供不同的 train、dev 和 test 文件具体的文件名，会自动根据 `--src_lang` 和 `--trg_lang` 指定的语言进行构造。train、dev 和 test 默认的文件名分别为 [train|dev|test].{src_lang}-{trg_lang}.[{src_lang}|{trg_lang}]。且 `--data_dir` 设置的优先级会高于后面提到的 `--train_file`，`--dev_file` 和 `--test_file` 的优先级。
+  * `--src_lang`(`-s`): 指代翻译模型的源语言。比如 `de` 表示德语，`en` 表示英语，`fr` 表示法语等等。和数据集本身相关。


Is the indentation here as expected?

Yes. --src_lang and --trg_lang are only used with --data_dir to specify the train, dev and test data file.

gpengzhi · 2022-09-21T04:19:57Z

examples/machine_translation/transformer/README.md

+python train.py \
+    --config ../configs/transformer.base.yaml \
+    --train_file ${DATA_DEST_DIR}/train.de-en.de ${DATA_DEST_DIR}/train.de-en.en \
+    --src_vocab ${DATA_DEST_DIR}/dict.en.txt \


dev_file is not supported in the static mode?

True. There is still no validation step in static mode train.

gpengzhi · 2022-09-21T04:22:45Z

examples/machine_translation/transformer/deploy/python/inference.py

    parser.add_argument(
        "--test_file",
        nargs='+',
        default=None,
        type=str,
        help=
-        "The file for testing. Normally, it shouldn't be set and in this case, the default WMT14 dataset will be used to process testing."
+        "The files for test, including [source language file, target language file]. If it's None, the default WMT14 en-de dataset will be used. "


Why do we need target language file during testing?

I'll fix the help. It not necessary to provide target language file in testing.

gpengzhi · 2022-09-21T04:39:07Z

examples/machine_translation/transformer/README.md

+以下是各个参数的含义：
+
+* `--config`: 指明所使用的 Transformer 的 config 文件，包括模型超参、训练超参等，默认是 `transformer.big.yaml`。即，默认训练 Transformer Big 模型。
+* `--data_dir`: 指明训练需要的数据集的路径。无需提供不同的 train、dev 和 test 文件具体的文件名，会自动根据 `--src_lang` 和 `--trg_lang` 指定的语言进行构造。train、dev 和 test 默认的文件名分别为 [train|dev|test].{src_lang}-{trg_lang}.[{src_lang}|{trg_lang}]。且 `--data_dir` 设置的优先级会高于后面提到的 `--train_file`，`--dev_file` 和 `--test_file` 的优先级。


Do we need this feature during testing?

Not necessary, just being the same as train scripts.

gpengzhi · 2022-09-21T04:41:10Z

examples/machine_translation/transformer/README.md

+* `--data_dir`: 指明训练需要的数据集的路径。无需提供不同的 train、dev 和 test 文件具体的文件名，会自动根据 `--src_lang` 和 `--trg_lang` 指定的语言进行构造。train、dev 和 test 默认的文件名分别为 [train|dev|test].{src_lang}-{trg_lang}.[{src_lang}|{trg_lang}]。且 `--data_dir` 设置的优先级会高于后面提到的 `--train_file`，`--dev_file` 和 `--test_file` 的优先级。
+  * `--src_lang`(`-s`): 指代翻译模型的源语言。比如 `de` 表示德语，`en` 表示英语，`fr` 表示法语等等。和数据集本身相关。
+  * `--trg_lang`(`-t`): 指代翻译模型的目标语言。比如 `de` 表示德语，`en` 表示英语，`fr` 表示法语等等。和数据集本身相关。
+* `--test_file`: 指明训练所需要的 `test` 验证集的数据集的路径。若没有提供 `--data_dir` 或是需要特别指明训练数据的名称的时候指定。指定的方式为，可以是一组平行语料的源语言和目标语言，依次两个文件的路径和名称，`--test_file ${SOURCE_LANG_FILE} ${TARGET_LANG_FILE}`，也可以只转入源语言的文件。比如，`--test_file ${DATA_DEST_DIR}/test.de-en.de ${DATA_DEST_DIR}/test.de-en.en` 或者 `--test_file ${DATA_DEST_DIR}/test.de-en.de`。


Why do we need to provide parallel files during testing? Is there any specific scenraio requires such feature?

It's not necessary. This is for compatibility concern. I will delete this description in doc. Thanks for pointing out.

gpengzhi · 2022-09-21T04:46:33Z

To verify the correctness of our changes, could you reproduce the experimental results on the IWSLT14 de->en dataset by following the tutorial?

We could add a new config file (transformer.iwslt14.yaml) to exactly match the model (trainsformer_iwslt_de_en) and training configurations used in the fairseq example. Since we (fairseq and paddlenlp) share the same preprocessing script (prepare-iwslt14.sh), we should achieve similiar results.

We could run this experiment on a single GPU, and it won't take too much time to converge. The final BLEU score should be around 34.85.

…nto preprocess

gpengzhi · 2022-11-01T08:46:22Z

examples/machine_translation/README.md

+bash prepare-wmt14en2fr.sh
+```
+
+完成数据处理之后，同样也可以采用上文提到的预处理方式获取词表，完成预处理。以下再以 WMT14 EN-DE 翻译数据集预处理为例：


If we decide to use WMT14 en-de as an example, why do we need to mention IWSLT14 de-en above?

Thanks. This should be fixed.

gpengzhi · 2022-11-01T08:53:28Z

examples/machine_translation/transformer/README.md

+
+本示例这里略去自定义数据下载、处理的步骤，如果需要，可以参考前页文档 [使用自定义翻译数据集](../README.md)。
+
+本示例以处理好的 iwslt14 数据为例。


If we want to use IWSLT14 as an example, transformer-base maybe too large for this dataset. To keep the instruction consistent, we could simply use WMT14 en-de as the example.

Fixed. Thanks.

…nto preprocess

…o preprocess

FrostML · 2022-11-30T04:27:07Z

@guoshengCS

…o preprocess

…nto preprocess

FrostML added 7 commits September 14, 2022 10:15

add preprocess for mt

7d4b849

refine the code

6239eb6

dataset support

5da947f

refine

7b989c7

update

ce87721

refine code and add doc(wip)

84ef304

update

9fee279

FrostML marked this pull request as ready for review September 19, 2022 03:20

FrostML added 6 commits September 19, 2022 09:52

update doc

d3bf9dc

update doc

670d524

Merge branch 'develop' of https://github.com/PaddlePaddle/PaddleNLP i…

7cfe7b7

…nto preprocess

copy right

c23eba3

copy right

d7aba2c

cr

0bff658

gpengzhi reviewed Sep 21, 2022

View reviewed changes

FrostML added 13 commits September 23, 2022 06:59

update according to comments

c28d9a9

add weight_decay

bbc9ff4

add post-norm support and refine pad id support

c457b34

merge from develop

700de43

compatibility for normalize_before

f13c24b

fix post-normalize

dfbc323

update

bf76715

Merge branch 'develop' into preprocess

8d8ca6d

typo

4fad7cf

delete useless

999d487

Merge branch 'develop' of https://github.com/PaddlePaddle/PaddleNLP i…

4261327

…nto preprocess

fix Criterion

6e0c88e

merge from develop and resolve conflict

0526aed

gpengzhi reviewed Nov 1, 2022

View reviewed changes

FrostML added 9 commits November 28, 2022 09:31

Merge branch 'develop' of https://github.com/PaddlePaddle/PaddleNLP i…

0e89386

…nto preprocess

Merge branch 'develop' into preprocess

415e2a2

Merge branch 'develop' into preprocess

5908411

Merge branch 'develop' of https://github.com/PaddlePaddle/PaddleNLP i…

f4a1ee5

…nto preprocess

update codestyle

7231b10

Merge branch 'preprocess' of https://github.com/FrostML/PaddleNLP int…

c98be33

…o preprocess

update code style

4c563d5

Merge branch 'develop' into preprocess

40da1e8

Merge branch 'develop' into preprocess

7c46e86

FrostML added 4 commits November 30, 2022 15:24

merge from develop and resolve conflict

c5125a2

Merge branch 'develop' into preprocess

a809acf

add warning

fc9be90

Merge branch 'preprocess' of https://github.com/FrostML/PaddleNLP int…

8d6bf66

…o preprocess

guoshengCS previously approved these changes Dec 2, 2022

View reviewed changes

FrostML added 7 commits December 2, 2022 14:55

Merge branch 'develop' into preprocess

721f0a7

Merge branch 'develop' into preprocess

fcb9be6

Merge branch 'develop' into preprocess

4714f2b

Merge branch 'develop' into preprocess

aa46c22

Merge branch 'develop' into preprocess

639c422

Merge branch 'develop' of https://github.com/PaddlePaddle/PaddleNLP i…

67294e3

…nto preprocess

relative path

057a20b

FrostML dismissed guoshengCS’s stale review via 057a20b December 6, 2022 05:08

Merge branch 'develop' of https://github.com/PaddlePaddle/PaddleNLP i…

8f938d5

…nto preprocess

guoshengCS approved these changes Dec 6, 2022

View reviewed changes

PaddlePaddle locked and limited conversation to collaborators Dec 6, 2022

PaddlePaddle unlocked this conversation Dec 6, 2022

FrostML merged commit d364d5b into PaddlePaddle:develop Dec 6, 2022

FrostML mentioned this pull request Dec 8, 2022

PaddleNLP 2.4.5 Release Note Candidate #4030

Closed

FrostML mentioned this pull request Jan 12, 2023

PaddleNLP 2.5.0 Release Note Candidate #4439

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add preprocess & capability for custom dataset for mt #3269

Add preprocess & capability for custom dataset for mt #3269

FrostML commented Sep 14, 2022

gpengzhi Sep 21, 2022

FrostML Sep 21, 2022

gpengzhi Sep 21, 2022

FrostML Sep 21, 2022

gpengzhi Sep 21, 2022

FrostML Sep 21, 2022

gpengzhi Sep 21, 2022

FrostML Sep 21, 2022

gpengzhi Sep 21, 2022

FrostML Sep 21, 2022

gpengzhi Sep 21, 2022

FrostML Sep 21, 2022

gpengzhi Sep 21, 2022

FrostML Sep 21, 2022

gpengzhi Sep 21, 2022

FrostML Sep 21, 2022

gpengzhi commented Sep 21, 2022 •

edited

Loading

gpengzhi Nov 1, 2022

FrostML Nov 8, 2022

gpengzhi Nov 1, 2022

FrostML Nov 8, 2022

FrostML commented Nov 30, 2022


		本示例这里略去自定义数据下载、处理的步骤，如果需要，可以参考前页文档 [使用自定义翻译数据集](../README.md)。

		本示例以处理好的 iwslt14 数据为例。

Add preprocess & capability for custom dataset for mt #3269

Add preprocess & capability for custom dataset for mt #3269

Conversation

FrostML commented Sep 14, 2022

PR types

PR changes

Description

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gpengzhi commented Sep 21, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

FrostML commented Nov 30, 2022

gpengzhi commented Sep 21, 2022 •

edited

Loading