-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add preprocess & capability for custom dataset for mt #3269
Conversation
default=None, | ||
type=str, | ||
help= | ||
"The token used for padding. If it's None, the bos_token will be used. Defaults to None. " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is None
used as default setting? Maybe we shoulde use <pad>
as default instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is for compatibility concern.
In the former example, Transformer used bos_token
as pad_token
. Hence, during training for now, if pad_token
is None
, the bos_token
will be used for padding. Hence, in preprocessing, we keep this behavior by setting default pad_token
is None
.
--bos_token "<s>" \ | ||
--eos_token "</s>" \ | ||
--unk_token "<unk>" | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pad_token
is not supported in the training step as well as the following steps?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should support pad_token
. I'll fix the doc and train/predict. Thanks.
--config ./configs/transformer.base.yaml \ | ||
--src_vocab ${DATA_DEST_DIR}/dict.en.txt \ | ||
--trg_vocab ${DATA_DEST_DIR}/dict.de.txt \ | ||
--bos_token "<s>" \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The special tokens are also required when converting to static model?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is for adapt the vocabulary size and get special token id.
<unk>
is not required. I'll delete this.
|
||
* `--config`: 指明所使用的 Transformer 的 config 文件,包括模型超参、训练超参等,默认是 `transformer.big.yaml`。即,默认训练 Transformer Big 模型。 | ||
* `--data_dir`: 指明训练需要的数据集的路径。无需提供不同的 train、dev 和 test 文件具体的文件名,会自动根据 `--src_lang` 和 `--trg_lang` 指定的语言进行构造。train、dev 和 test 默认的文件名分别为 [train|dev|test].{src_lang}-{trg_lang}.[{src_lang}|{trg_lang}]。且 `--data_dir` 设置的优先级会高于后面提到的 `--train_file`,`--dev_file` 和 `--test_file` 的优先级。 | ||
* `--src_lang`(`-s`): 指代翻译模型的源语言。比如 `de` 表示德语,`en` 表示英语,`fr` 表示法语等等。和数据集本身相关。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the indentation here as expected?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. --src_lang
and --trg_lang
are only used with --data_dir
to specify the train, dev and test data file.
python train.py \ | ||
--config ../configs/transformer.base.yaml \ | ||
--train_file ${DATA_DEST_DIR}/train.de-en.de ${DATA_DEST_DIR}/train.de-en.en \ | ||
--src_vocab ${DATA_DEST_DIR}/dict.en.txt \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dev_file
is not supported in the static mode?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True. There is still no validation step in static mode train.
parser.add_argument( | ||
"--test_file", | ||
nargs='+', | ||
default=None, | ||
type=str, | ||
help= | ||
"The file for testing. Normally, it shouldn't be set and in this case, the default WMT14 dataset will be used to process testing." | ||
"The files for test, including [source language file, target language file]. If it's None, the default WMT14 en-de dataset will be used. " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need target language file during testing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll fix the help
. It not necessary to provide target language file in testing.
以下是各个参数的含义: | ||
|
||
* `--config`: 指明所使用的 Transformer 的 config 文件,包括模型超参、训练超参等,默认是 `transformer.big.yaml`。即,默认训练 Transformer Big 模型。 | ||
* `--data_dir`: 指明训练需要的数据集的路径。无需提供不同的 train、dev 和 test 文件具体的文件名,会自动根据 `--src_lang` 和 `--trg_lang` 指定的语言进行构造。train、dev 和 test 默认的文件名分别为 [train|dev|test].{src_lang}-{trg_lang}.[{src_lang}|{trg_lang}]。且 `--data_dir` 设置的优先级会高于后面提到的 `--train_file`,`--dev_file` 和 `--test_file` 的优先级。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this feature during testing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not necessary, just being the same as train scripts.
* `--data_dir`: 指明训练需要的数据集的路径。无需提供不同的 train、dev 和 test 文件具体的文件名,会自动根据 `--src_lang` 和 `--trg_lang` 指定的语言进行构造。train、dev 和 test 默认的文件名分别为 [train|dev|test].{src_lang}-{trg_lang}.[{src_lang}|{trg_lang}]。且 `--data_dir` 设置的优先级会高于后面提到的 `--train_file`,`--dev_file` 和 `--test_file` 的优先级。 | ||
* `--src_lang`(`-s`): 指代翻译模型的源语言。比如 `de` 表示德语,`en` 表示英语,`fr` 表示法语等等。和数据集本身相关。 | ||
* `--trg_lang`(`-t`): 指代翻译模型的目标语言。比如 `de` 表示德语,`en` 表示英语,`fr` 表示法语等等。和数据集本身相关。 | ||
* `--test_file`: 指明训练所需要的 `test` 验证集的数据集的路径。若没有提供 `--data_dir` 或是需要特别指明训练数据的名称的时候指定。指定的方式为,可以是一组平行语料的源语言和目标语言,依次两个文件的路径和名称,`--test_file ${SOURCE_LANG_FILE} ${TARGET_LANG_FILE}`,也可以只转入源语言的文件。比如,`--test_file ${DATA_DEST_DIR}/test.de-en.de ${DATA_DEST_DIR}/test.de-en.en` 或者 `--test_file ${DATA_DEST_DIR}/test.de-en.de`。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to provide parallel files during testing? Is there any specific scenraio requires such feature?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not necessary. This is for compatibility concern. I will delete this description in doc. Thanks for pointing out.
To verify the correctness of our changes, could you reproduce the experimental results on the IWSLT14 de->en dataset by following the tutorial? We could add a new config file ( We could run this experiment on a single GPU, and it won't take too much time to converge. The final BLEU score should be around 34.85. |
bash prepare-wmt14en2fr.sh | ||
``` | ||
|
||
完成数据处理之后,同样也可以采用上文提到的预处理方式获取词表,完成预处理。以下再以 WMT14 EN-DE 翻译数据集预处理为例: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we decide to use WMT14 en-de as an example, why do we need to mention IWSLT14 de-en above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. This should be fixed.
|
||
本示例这里略去自定义数据下载、处理的步骤,如果需要,可以参考前页文档 [使用自定义翻译数据集](../README.md)。 | ||
|
||
本示例以处理好的 iwslt14 数据为例。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we want to use IWSLT14 as an example, transformer-base maybe too large for this dataset. To keep the instruction consistent, we could simply use WMT14 en-de as the example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed. Thanks.
PR types
New features
PR changes
Others
Description
Add preprocess & capability for custom dataset for mt.