diff --git a/examples/machine_translation/README.md b/examples/machine_translation/README.md
new file mode 100644
index 000000000000..6d58e14e2015
--- /dev/null
+++ b/examples/machine_translation/README.md
@@ -0,0 +1,114 @@
+# 机器翻译
+
+机器翻译（Machine Translation）是利用计算机将一种自然语言（源语言）转换为另一种自然语言（目标语言）的过程，输入为源语言句子，输出为相应的目标语言的句子。
+
+## 快速开始
+
+### 环境依赖
+
+使用当前机器翻译示例，需要额外安装配置以下环境：
+
+* attrdict
+* pyyaml
+* subword_nmt
+* fastBPE (可选，若不使用 preprocessor.py 的 bpe 分词功能可以不需要)
+
+### 数据准备
+
+数据准备部分分成两种模式，一种是使用 PaddleNLP 内置的已经处理好的 WMT14 EN-DE 翻译的数据集，另一种，提供了当前 Transformer demo 使用自定义数据集的方式。以下分别展开介绍。
+
+#### 使用内置已经处理完成数据集
+
+内置的处理好的数据集是基于公开的数据集：WMT 数据集。
+
+WMT 翻译大赛是机器翻译领域最具权威的国际评测大赛，其中英德翻译任务提供了一个中等规模的数据集，这个数据集是较多论文中使用的数据集，也是 Transformer 论文中用到的一个数据集。我们也将 [WMT'14 EN-DE 数据集](http://www.statmt.org/wmt14/translation-task.html) 作为示例提供。
+
+可以编写如下代码，即可自动载入处理好的上述的数据，对应的 WMT14 EN-DE 的数据集将会自动下载并且解压到 `~/.paddlenlp/datasets/WMT14ende/`。
+
+``` python
+datasets = load_dataset('wmt14ende', splits=('train', 'dev'))
+```
+
+如果使用内置的处理好的数据，那到这里即可完成数据准备一步，可以直接移步 [Transformer 翻译模型](transformer/README.md) 将详细介绍如何使用内置的数据集训一个英德翻译的 Transformer 模型。
+
+#### 使用自定义翻译数据集
+
+本示例同时提供了自定义数据集的方法。可参考以下执行数据处理方式：
+
+``` bash
+# 数据下载、处理，包括 bpe 的训练
+bash preprocessor/prepare-wmt14en2de.sh --icml17
+
+# 数据预处理
+DATA_DIR=examples/translation/wmt14_en_de
+
+python preprocessor/preprocessor.py \
+    --source_lang en \
+    --target_lang de \
+    --train_pref $DATA_DIR/train \
+    --dev_pref $DATA_DIR/dev \
+    --test_pref $DATA_DIR/test \
+    --dest_dir data/wmt17_en_de \
+    --thresholdtgt 0 \
+    --thresholdsrc 0 \
+    --joined_dictionary
+```
+
+`preprocessor/preprocessor.py` 支持了在机器翻译中常见的数据预处理方式。在预处理 `preprocessor/preprocessor.py` 脚本中，则提供词表构建，数据集文件整理，甚至于 bpe 分词的功能（bpe 分词过程可选）。最后获取的处理完成的 train，dev，test 数据可以直接用于后面 Transformer 模型的训练、评估和推理中。具体的参数说明如下：
+
+* `--src_lang`(`-s`): 指明数据处理对应的源语言类型，比如 `de` 表示德语，`en` 表示英语，`fr` 表示法语等等。
+* `--trg_lang`(`-t`): 指明数据处理对应的目标语言的类型，比如 `de` 表示德语，`en` 表示英语，`fr` 表示法语等等。
+* `--train_pref`: 指明前序步骤中，下载的训练数据的路径，以及对应的文件名前缀，比如 `preprocessor/wmt14_en_de/train` 结合 `--src_lang de` 和 `--trg_lang en`，表示在 `preprocessor/wmt14_en_de/` 路径下，源语言是 `preprocessor/wmt14_en_de/train.en`，目标语言是 `preprocessor/wmt14_en_de/train.de`。
+* `--dev_pref`: 指明前序步骤中，下载的验证数据的路径，以及对应的文件名前缀。在验证集语料中，如果有的 token 在训练集中从未出现过，那么将会被 `<unk>` 替换。
+* `--test_pref`: 指明前序步骤中，下载的测试数据的路径，以及对应的文件名前缀。在测试集语料中，如果有的 token 在训练集中从未出现过，那么将会被 `<unk>` 替换。
+* `--dest_dir`: 完成数据处理之后，保存处理完成数据以及词表的路径。
+* `--threshold_src`: 在源语言中，出现频次小于 `--threshold_src` 指定的频次的 token 将会被替换成 `<unk>`。默认为 0，表示不会根据 token 出现的频次忽略 token 本身。
+* `--threshold_trg`: 在目标语言中，出现频次小于 `--threshold_trg` 指定的频次的 token 将会被替换成 `<unk>`。默认为 0，表示不会根据 token 出现的频次忽略 token 本身。
+* `--src_vocab`: 源语言词表，默认为 None，表示需要预处理步骤根据训练集语料重新生成一份词表。如果源语言与目标语言共用同一份词表，那么将使用 `--src_vocab` 指定的词表。
+* `--trg_vocab`: 目标语言词表，默认为 None，表示需要预处理步骤根据训练集语料重新生成一份词表。如果源语言与目标语言共用同一份词表，那么将使用 `--src_vocab` 指定的词表。
+* `--nwords_src`: 源语言词表最大的大小，不包括 special token。默认为 None，表示不限制。若源语言和目标语言共用同一份词表，那么将使用 `--nwords_src` 指定的大小。
+* `--nwords_trg`: 目标语言词表最大的大小，不包括 special token。默认为 None，表示不限制。若源语言和目标语言共用同一份词表，那么将使用 `--nwords_src` 指定的大小。
+* `--align_file`: 是否将平行语料文件整合成一个文件。
+* `--joined_dictionary`: 源语言和目标语言是否使用同一份词表。若不共用同一份词表，无需指定。
+* `--only_source`: 是否仅处理源语言。
+* `--dict_only`: 是否仅处理词表。若指定，则仅完成词表处理。
+* `--bos_token`: 指明翻译所用的 `bos_token`，表示一个句子开始。
+* `--eos_token`: 指明翻译所用的 `eos_token`，表示一个句子的结束。
+* `--pad_token`: 指明 `pad_token`，用于将一个 batch 内不同长度的句子 pad 到合适长度。
+* `--unk_token`: 指明 `unk_token`，用于当一个 token 在词表中未曾出现的情况，将使用 `--unk_token` 指明的字符替换。
+* `--apply_bpe`: 是否需要对数据作 bpe 分词。若指定则会在 preprocessor.py 脚本开始执行 bpe 分词。如果是使用提供的 shell 脚本完成的数据下载，则无需设置，在 shell 脚本中会作 bpe 分词处理。
+* `--bpe_code`: 若指明 `--apply_bpe` 使用 bpe 分词，则需同时提供训练好的 bpe code 文件。
+
+除了 WMT14 德英翻译数据集外，我们也提供了其他的 shell 脚本完成数据下载处理，比如 WMT14 英法翻译数据。
+
+``` bash
+# WMT14 英法翻译的数据下载、处理
+bash prepare-wmt14en2fr.sh
+```
+
+完成数据处理之后，同样也可以采用上文提到的预处理方式获取词表，完成预处理。
+
+如果有或者需要使用其他的平行语料，可以自行完成下载和简单的处理。
+
+在下载部分，即在 shell 脚本中，处理需要用到 [mosesdecoder](https://github.com/moses-smt/mosesdecoder) 和 [subword-nmt](https://github.com/rsennrich/subword-nmt) 这两个工具。包括:
+
+* 使用 `mosesdecoder/scripts/tokenizer/tokenizer.perl` 完成对词做一个初步的切分；
+* 基于 `mosesdecoder/scripts/training/clean-corpus-n.perl` 完成数据的清洗；
+* 使用 `subword-nmt/subword_nmt/learn_bpe.py` 完成 bpe 的学习；
+
+此外，基于学到的 bpe code 进行分词的操作目前提供了两种选项，其一是，可以在以上的 shell 脚本中处理完成，使用以下的工具：
+
+* 使用 `subword-nmt/subword_nmt/apply_bpe.py` 完成分词工作。
+
+其二，也可以直接在后面的 `preprocessor/preprocessor.py` 脚本中，指明 `--apply_bpe` 完成分词操作。
+
+
+### 如何训一个翻译模型
+
+前文介绍了如何快速开始完成翻译训练所需平行语料的准备，关于进一步的，模型训练、评估和推理部分，可以根据需要，参考对应的模型的文档：
+
+* [Transformer 翻译模型](transformer/README.md)
+
+## Acknowledge
+
+我们借鉴了 facebookresearch 的 [fairseq](https://github.com/facebookresearch/fairseq) 在翻译数据的预处理上优秀的设计，在此对 fairseq 作者以及其开源社区表示感谢。
diff --git a/examples/machine_translation/preprocessor/prepare-iwslt14.sh b/examples/machine_translation/preprocessor/prepare-iwslt14.sh
new file mode 100644
index 000000000000..4746b27c82e7
--- /dev/null
+++ b/examples/machine_translation/preprocessor/prepare-iwslt14.sh
@@ -0,0 +1,135 @@
+#!/usr/bin/env bash
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# Copyright (c) Facebook, Inc. and its affiliates.
+#
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.
+# Adapted from https://github.com/facebookresearch/MIXER/blob/master/prepareData.sh
+
+cd preprocessor/
+
+echo 'Cloning Moses github repository (for tokenization scripts)...'
+git clone https://github.com/moses-smt/mosesdecoder.git
+
+echo 'Cloning Subword NMT repository (for BPE pre-processing)...'
+git clone https://github.com/rsennrich/subword-nmt.git
+
+SCRIPTS=mosesdecoder/scripts
+TOKENIZER=$SCRIPTS/tokenizer/tokenizer.perl
+LC=$SCRIPTS/tokenizer/lowercase.perl
+CLEAN=$SCRIPTS/training/clean-corpus-n.perl
+BPEROOT=subword-nmt/subword_nmt
+BPE_TOKENS=10000
+
+URL="http://dl.fbaipublicfiles.com/fairseq/data/iwslt14/de-en.tgz"
+GZ=de-en.tgz
+
+if [ ! -d "$SCRIPTS" ]; then
+    echo "Please set SCRIPTS variable correctly to point to Moses scripts."
+    exit
+fi
+
+src=de
+tgt=en
+lang=de-en
+prep=iwslt14.tokenized.de-en
+tmp=$prep/tmp
+origin=origin
+
+mkdir -p $origin $tmp $prep
+
+echo "Downloading data from ${URL}..."
+cd $origin
+wget "$URL"
+
+if [ -f $GZ ]; then
+    echo "Data successfully downloaded."
+else
+    echo "Data not successfully downloaded."
+    exit
+fi
+
+tar zxvf $GZ
+cd ..
+
+echo "pre-processing train data..."
+for l in $src $tgt; do
+    f=train.tags.$lang.$l
+    tok=train.tags.$lang.tok.$l
+
+    cat $origin/$lang/$f | \
+    grep -v '<url>' | \
+    grep -v '<talkid>' | \
+    grep -v '<keywords>' | \
+    sed -e 's/<title>//g' | \
+    sed -e 's/<\/title>//g' | \
+    sed -e 's/<description>//g' | \
+    sed -e 's/<\/description>//g' | \
+    perl $TOKENIZER -threads 8 -l $l > $tmp/$tok
+    echo ""
+done
+perl $CLEAN -ratio 1.5 $tmp/train.tags.$lang.tok $src $tgt $tmp/train.tags.$lang.clean 1 175
+for l in $src $tgt; do
+    perl $LC < $tmp/train.tags.$lang.clean.$l > $tmp/train.tags.$lang.$l
+done
+
+echo "pre-processing dev/test data..."
+for l in $src $tgt; do
+    for o in `ls $origin/$lang/IWSLT14.TED*.$l.xml`; do
+    fname=${o##*/}
+    f=$tmp/${fname%.*}
+    echo $o $f
+    grep '<seg id' $o | \
+        sed -e 's/<seg id="[0-9]*">\s*//g' | \
+        sed -e 's/\s*<\/seg>\s*//g' | \
+        sed -e "s/\’/\'/g" | \
+    perl $TOKENIZER -threads 8 -l $l | \
+    perl $LC > $f
+    echo ""
+    done
+done
+
+
+echo "creating train, dev, test..."
+for l in $src $tgt; do
+    awk '{if (NR%23 == 0)  print $0; }' $tmp/train.tags.de-en.$l > $tmp/dev.$l
+    awk '{if (NR%23 != 0)  print $0; }' $tmp/train.tags.de-en.$l > $tmp/train.$l
+
+    cat $tmp/IWSLT14.TED.dev2010.de-en.$l \
+        $tmp/IWSLT14.TEDX.dev2012.de-en.$l \
+        $tmp/IWSLT14.TED.tst2010.de-en.$l \
+        $tmp/IWSLT14.TED.tst2011.de-en.$l \
+        $tmp/IWSLT14.TED.tst2012.de-en.$l \
+        > $tmp/test.$l
+done
+
+TRAIN=$tmp/train.en-de
+BPE_CODE=$prep/code
+rm -f $TRAIN
+for l in $src $tgt; do
+    cat $tmp/train.$l >> $TRAIN
+done
+
+echo "learn_bpe.py on ${TRAIN}..."
+python $BPEROOT/learn_bpe.py -s $BPE_TOKENS < $TRAIN > $BPE_CODE
+
+for L in $src $tgt; do
+    for f in train.$L dev.$L test.$L; do
+        echo "apply_bpe.py to ${f}..."
+        python $BPEROOT/apply_bpe.py -c $BPE_CODE < $tmp/$f > $prep/$f
+    done
+done
+
+cd -
diff --git a/examples/machine_translation/preprocessor/prepare-wmt14en2de.sh b/examples/machine_translation/preprocessor/prepare-wmt14en2de.sh
new file mode 100644
index 000000000000..32926c4aa96b
--- /dev/null
+++ b/examples/machine_translation/preprocessor/prepare-wmt14en2de.sh
@@ -0,0 +1,163 @@
+#!/bin/bash
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# Copyright (c) Facebook, Inc. and its affiliates.
+#
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.
+# Adapted from https://github.com/facebookresearch/MIXER/blob/master/prepareData.sh
+
+cd preprocessor/
+
+echo 'Cloning Moses github repository (for tokenization scripts)...'
+git clone https://github.com/moses-smt/mosesdecoder.git
+
+echo 'Cloning Subword NMT repository (for BPE pre-processing)...'
+git clone https://github.com/rsennrich/subword-nmt.git
+
+SCRIPTS=mosesdecoder/scripts
+TOKENIZER=$SCRIPTS/tokenizer/tokenizer.perl
+CLEAN=$SCRIPTS/training/clean-corpus-n.perl
+NORM_PUNC=$SCRIPTS/tokenizer/normalize-punctuation.perl
+REM_NON_PRINT_CHAR=$SCRIPTS/tokenizer/remove-non-printing-char.perl
+BPEROOT=subword-nmt/subword_nmt
+BPE_TOKENS=40000
+
+URLS=(
+    "http://statmt.org/wmt13/training-parallel-europarl-v7.tgz"
+    "http://statmt.org/wmt13/training-parallel-commoncrawl.tgz"
+    "http://data.statmt.org/wmt17/translation-task/training-parallel-nc-v12.tgz"
+    "http://data.statmt.org/wmt17/translation-task/dev.tgz"
+    "http://statmt.org/wmt14/test-full.tgz"
+)
+FILES=(
+    "training-parallel-europarl-v7.tgz"
+    "training-parallel-commoncrawl.tgz"
+    "training-parallel-nc-v12.tgz"
+    "dev.tgz"
+    "test-full.tgz"
+)
+CORPORA=(
+    "training/europarl-v7.de-en"
+    "commoncrawl.de-en"
+    "training/news-commentary-v12.de-en"
+)
+
+# This will make the dataset compatible to the one used in "Convolutional Sequence to Sequence Learning"
+# https://arxiv.org/abs/1705.03122
+if [ "$1" == "--icml17" ]; then
+    URLS[2]="http://statmt.org/wmt14/training-parallel-nc-v9.tgz"
+    FILES[2]="training-parallel-nc-v9.tgz"
+    CORPORA[2]="training/news-commentary-v9.de-en"
+    OUTDIR=wmt14_en_de
+else
+    OUTDIR=wmt17_en_de
+fi
+
+if [ ! -d "$SCRIPTS" ]; then
+    echo "Please set SCRIPTS variable correctly to point to Moses scripts."
+    exit
+fi
+
+src=en
+tgt=de
+lang=en-de
+prep=$OUTDIR
+tmp=$prep/tmp
+origin=origin
+dev=dev/newstest2013
+
+mkdir -p $origin $tmp $prep
+
+cd $origin
+
+for ((i=0;i<${#URLS[@]};++i)); do
+    file=${FILES[i]}
+    if [ -f $file ]; then
+        echo "$file already exists, skipping download"
+    else
+        url=${URLS[i]}
+        wget "$url" --no-check-certificate
+        if [ -f $file ]; then
+            echo "$url successfully downloaded."
+        else
+            echo "$url not successfully downloaded."
+            exit -1
+        fi
+        if [ ${file: -4} == ".tgz" ]; then
+            tar zxvf $file
+        elif [ ${file: -4} == ".tar" ]; then
+            tar xvf $file
+        fi
+    fi
+done
+cd ..
+
+echo "pre-processing train data..."
+for l in $src $tgt; do
+    rm $tmp/train.tags.$lang.tok.$l
+    for f in "${CORPORA[@]}"; do
+        cat $origin/$f.$l | \
+            perl $NORM_PUNC $l | \
+            perl $REM_NON_PRINT_CHAR | \
+            perl $TOKENIZER -threads 8 -a -l $l >> $tmp/train.tags.$lang.tok.$l
+    done
+done
+
+echo "pre-processing test data..."
+for l in $src $tgt; do
+    if [ "$l" == "$src" ]; then
+        t="src"
+    else
+        t="ref"
+    fi
+    grep '<seg id' $origin/test-full/newstest2014-deen-$t.$l.sgm | \
+        sed -e 's/<seg id="[0-9]*">\s*//g' | \
+        sed -e 's/\s*<\/seg>\s*//g' | \
+        sed -e "s/\’/\'/g" | \
+    perl $TOKENIZER -threads 8 -a -l $l > $tmp/test.$l
+    echo ""
+done
+
+echo "splitting train and dev..."
+for l in $src $tgt; do
+    awk '{if (NR%100 == 0)  print $0; }' $tmp/train.tags.$lang.tok.$l > $tmp/dev.$l
+    awk '{if (NR%100 != 0)  print $0; }' $tmp/train.tags.$lang.tok.$l > $tmp/train.$l
+done
+
+TRAIN=$tmp/train.de-en
+BPE_CODE=$prep/code
+rm -f $TRAIN
+for l in $src $tgt; do
+    cat $tmp/train.$l >> $TRAIN
+done
+
+echo "learn_bpe.py on ${TRAIN}..."
+python $BPEROOT/learn_bpe.py -s $BPE_TOKENS < $TRAIN > $BPE_CODE
+
+for L in $src $tgt; do
+    for f in train.$L dev.$L test.$L; do
+        echo "apply_bpe.py to ${f}..."
+        python $BPEROOT/apply_bpe.py -c $BPE_CODE < $tmp/$f > $tmp/bpe.$f
+    done
+done
+
+perl $CLEAN -ratio 1.5 $tmp/bpe.train $src $tgt $prep/train 1 250
+perl $CLEAN -ratio 1.5 $tmp/bpe.dev $src $tgt $prep/dev 1 250
+
+for L in $src $tgt; do
+    cp $tmp/bpe.test.$L $prep/test.$L
+done
+
+cd -
diff --git a/examples/machine_translation/preprocessor/prepare-wmt14en2fr.sh b/examples/machine_translation/preprocessor/prepare-wmt14en2fr.sh
new file mode 100644
index 000000000000..3fc3bc10f632
--- /dev/null
+++ b/examples/machine_translation/preprocessor/prepare-wmt14en2fr.sh
@@ -0,0 +1,157 @@
+#!/bin/bash
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# Copyright (c) Facebook, Inc. and its affiliates.
+#
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.
+# Adapted from https://github.com/facebookresearch/MIXER/blob/master/prepareData.sh
+
+cd preprocessor/
+
+echo 'Cloning Moses github repository (for tokenization scripts)...'
+git clone https://github.com/moses-smt/mosesdecoder.git
+
+echo 'Cloning Subword NMT repository (for BPE pre-processing)...'
+git clone https://github.com/rsennrich/subword-nmt.git
+
+SCRIPTS=mosesdecoder/scripts
+TOKENIZER=$SCRIPTS/tokenizer/tokenizer.perl
+CLEAN=$SCRIPTS/training/clean-corpus-n.perl
+NORM_PUNC=$SCRIPTS/tokenizer/normalize-punctuation.perl
+REM_NON_PRINT_CHAR=$SCRIPTS/tokenizer/remove-non-printing-char.perl
+BPEROOT=subword-nmt/subword_nmt
+BPE_TOKENS=40000
+
+URLS=(
+    "http://statmt.org/wmt13/training-parallel-europarl-v7.tgz"
+    "http://statmt.org/wmt13/training-parallel-commoncrawl.tgz"
+    "http://statmt.org/wmt13/training-parallel-un.tgz"
+    "http://statmt.org/wmt14/training-parallel-nc-v9.tgz"
+    "http://statmt.org/wmt10/training-giga-fren.tar"
+    "http://statmt.org/wmt14/test-full.tgz"
+)
+FILES=(
+    "training-parallel-europarl-v7.tgz"
+    "training-parallel-commoncrawl.tgz"
+    "training-parallel-un.tgz"
+    "training-parallel-nc-v9.tgz"
+    "training-giga-fren.tar"
+    "test-full.tgz"
+)
+CORPORA=(
+    "training/europarl-v7.fr-en"
+    "commoncrawl.fr-en"
+    "un/undoc.2000.fr-en"
+    "training/news-commentary-v9.fr-en"
+    "giga-fren.release2.fixed"
+)
+
+if [ ! -d "$SCRIPTS" ]; then
+    echo "Please set SCRIPTS variable correctly to point to Moses scripts."
+    exit
+fi
+
+src=en
+tgt=fr
+lang=en-fr
+prep=wmt14_en_fr
+tmp=$prep/tmp
+origin=origin
+
+mkdir -p $origin $tmp $prep
+
+cd $origin
+
+for ((i=0;i<${#URLS[@]};++i)); do
+    file=${FILES[i]}
+    if [ -f $file ]; then
+        echo "$file already exists, skipping download"
+    else
+        url=${URLS[i]}
+        wget "$url" --no-check-certificate
+        if [ -f $file ]; then
+            echo "$url successfully downloaded."
+        else
+            echo "$url not successfully downloaded."
+            exit -1
+        fi
+        if [ ${file: -4} == ".tgz" ]; then
+            tar zxvf $file
+        elif [ ${file: -4} == ".tar" ]; then
+            tar xvf $file
+        fi
+    fi
+done
+
+gunzip giga-fren.release2.fixed.*.gz
+cd ..
+
+echo "pre-processing train data..."
+for l in $src $tgt; do
+    rm $tmp/train.tags.$lang.tok.$l
+    for f in "${CORPORA[@]}"; do
+        cat $origin/$f.$l | \
+            perl $NORM_PUNC $l | \
+            perl $REM_NON_PRINT_CHAR | \
+            perl $TOKENIZER -threads 8 -a -l $l >> $tmp/train.tags.$lang.tok.$l
+    done
+done
+
+echo "pre-processing test data..."
+for l in $src $tgt; do
+    if [ "$l" == "$src" ]; then
+        t="src"
+    else
+        t="ref"
+    fi
+    grep '<seg id' $origin/test-full/newstest2014-fren-$t.$l.sgm | \
+        sed -e 's/<seg id="[0-9]*">\s*//g' | \
+        sed -e 's/\s*<\/seg>\s*//g' | \
+        sed -e "s/\’/\'/g" | \
+    perl $TOKENIZER -threads 8 -a -l $l > $tmp/test.$l
+    echo ""
+done
+
+echo "splitting train and dev..."
+for l in $src $tgt; do
+    awk '{if (NR%1333 == 0)  print $0; }' $tmp/train.tags.$lang.tok.$l > $tmp/dev.$l
+    awk '{if (NR%1333 != 0)  print $0; }' $tmp/train.tags.$lang.tok.$l > $tmp/train.$l
+done
+
+TRAIN=$tmp/train.fr-en
+BPE_CODE=$prep/code
+rm -f $TRAIN
+for l in $src $tgt; do
+    cat $tmp/train.$l >> $TRAIN
+done
+
+echo "learn_bpe.py on ${TRAIN}..."
+python $BPEROOT/learn_bpe.py -s $BPE_TOKENS < $TRAIN > $BPE_CODE
+
+for L in $src $tgt; do
+    for f in train.$L dev.$L test.$L; do
+        echo "apply_bpe.py to ${f}..."
+        python $BPEROOT/apply_bpe.py -c $BPE_CODE < $tmp/$f > $tmp/bpe.$f
+    done
+done
+
+perl $CLEAN -ratio 1.5 $tmp/bpe.train $src $tgt $prep/train 1 250
+perl $CLEAN -ratio 1.5 $tmp/bpe.dev $src $tgt $prep/dev 1 250
+
+for L in $src $tgt; do
+    cp $tmp/bpe.test.$L $prep/test.$L
+done
+
+cd -
diff --git a/examples/machine_translation/preprocessor/preprocessor.py b/examples/machine_translation/preprocessor/preprocessor.py
new file mode 100644
index 000000000000..bc434a764787
--- /dev/null
+++ b/examples/machine_translation/preprocessor/preprocessor.py
@@ -0,0 +1,326 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# Copyright (c) Facebook, Inc. and its affiliates.
+#
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.
+
+import argparse
+import os
+import shutil
+from itertools import zip_longest
+from pprint import pprint
+
+from paddlenlp.data import Vocab
+from paddlenlp.utils.log import logger
+
+
+def get_preprocessing_parser():
+    parser = argparse.ArgumentParser()
+
+    parser.add_argument("-s", "--src_lang", default=None, type=str, help="Source language. ")
+    parser.add_argument("-t", "--trg_lang", default=None, type=str, help="Target language. ")
+    parser.add_argument(
+        "--train_pref", default=None, type=str, help="The prefix for train file and also used to save dict. "
+    )
+    parser.add_argument(
+        "--dev_pref",
+        default=None,
+        type=str,
+        help="The prefixes for dev file and use comma to separate. "
+        "(words missing from train set are replaced with <unk>)",
+    )
+    parser.add_argument(
+        "--test_pref",
+        default=None,
+        type=str,
+        help="The prefixes for test file and use comma to separate. "
+        "(words missing from train set are replaced with <unk>)",
+    )
+    parser.add_argument(
+        "--dest_dir",
+        default="./data/",
+        type=str,
+        help="The destination dir to save processed train, dev and test file. ",
+    )
+    parser.add_argument(
+        "--threshold_trg", default=0, type=int, help="Map words appearing less than threshold times to unknown. "
+    )
+    parser.add_argument(
+        "--threshold_src", default=0, type=int, help="Map words appearing less than threshold times to unknown. "
+    )
+    parser.add_argument("--src_vocab", default=None, type=str, help="Reuse given source dictionary. ")
+    parser.add_argument("--trg_vocab", default=None, type=str, help="Reuse given target dictionary. ")
+    parser.add_argument("--nwords_trg", default=None, type=int, help="The number of target words to retain. ")
+    parser.add_argument("--nwords_src", default=None, type=int, help="The number of source words to retain. ")
+    parser.add_argument("--align_file", default=None, help="An alignment file (optional). ")
+    parser.add_argument("--joined_dictionary", action="store_true", help="Generate joined dictionary. ")
+    parser.add_argument("--only_source", action="store_true", help="Only process the source language. ")
+    parser.add_argument(
+        "--dict_only", action="store_true", help="Only builds a dictionary and then exits if it's set."
+    )
+    parser.add_argument("--bos_token", default="<s>", type=str, help="bos_token. ")
+    parser.add_argument("--eos_token", default="</s>", type=str, help="eos_token. ")
+    parser.add_argument(
+        "--pad_token",
+        default=None,
+        type=str,
+        help="The token used for padding. If it's None, the bos_token will be used. Defaults to None. ",
+    )
+    parser.add_argument("--unk_token", default="<unk>", type=str, help="Unk token. ")
+    parser.add_argument("--apply_bpe", action="store_true", help="Whether to apply bpe to the files. ")
+    parser.add_argument(
+        "--bpe_code", default=None, type=str, help="The code used for bpe. Must be provided when --apply_bpe is set. "
+    )
+
+    args = parser.parse_args()
+    return args
+
+
+def _train_path(lang, train_pref):
+    return "{}{}".format(train_pref, ("." + lang) if lang else "")
+
+
+def _dev_path(lang, dev_pref):
+    return "{}{}".format(dev_pref, ("." + lang) if lang else "")
+
+
+def _test_path(lang, test_pref):
+    return "{}{}".format(test_pref, ("." + lang) if lang else "")
+
+
+def _file_name(prefix, lang):
+    fname = prefix
+    if lang is not None:
+        fname += ".{lang}".format(lang=lang)
+    return fname
+
+
+def _dest_path(prefix, lang, dest_dir):
+    return os.path.join(dest_dir, _file_name(prefix, lang))
+
+
+def _dict_path(lang, dest_dir):
+    return _dest_path("dict", lang, dest_dir) + ".txt"
+
+
+def _build_dictionary(filenames, args, src=False, trg=False):
+    assert src ^ trg, "src and trg cannot be both True or both False. "
+
+    if not isinstance(filenames, (list, tuple)):
+        filenames = [filenames]
+
+    tokens = []
+    for file in filenames:
+        with open(file, "r") as f:
+            lines = f.readlines()
+            for line in lines:
+                tokens.append(line.strip().split())
+
+    return Vocab.build_vocab(
+        tokens,
+        max_size=args.nwords_src if src else args.nwords_trg,
+        min_freq=args.threshold_src if src else args.threshold_trg,
+        unk_token=args.unk_token,
+        pad_token=args.pad_token,
+        bos_token=args.bos_token,
+        eos_token=args.eos_token,
+    )
+
+
+def _make_dataset(vocab, input_prefix, output_prefix, lang, args):
+    # Copy original text file to destination folder
+    output_text_file = _dest_path(
+        output_prefix + ".{}-{}".format(args.src_lang, args.trg_lang),
+        lang,
+        args.dest_dir,
+    )
+
+    shutil.copyfile(_file_name(input_prefix, lang), output_text_file)
+
+
+def _make_all(lang, vocab, args):
+    if args.train_pref:
+        _make_dataset(vocab, args.train_pref, "train", lang, args=args)
+
+    if args.dev_pref:
+        for k, dev_pref in enumerate(args.dev_pref.split(",")):
+            out_prefix = "dev{}".format(k) if k > 0 else "dev"
+            _make_dataset(vocab, dev_pref, out_prefix, lang, args=args)
+
+    if args.test_pref:
+        for k, test_pref in enumerate(args.test_pref.split(",")):
+            out_prefix = "test{}".format(k) if k > 0 else "test"
+            _make_dataset(vocab, test_pref, out_prefix, lang, args=args)
+
+
+def _align_files(args, src_vocab, trg_vocab):
+    assert args.train_pref, "--train_pref must be set if --align_file is specified"
+    src_file_name = _train_path(args.src_lang, args.train_pref)
+    trg_file_name = _train_path(args.trg_lang, args.train_pref)
+    freq_map = {}
+
+    with open(args.align_file, "r", encoding="utf-8") as align_file:
+        with open(src_file_name, "r", encoding="utf-8") as src_file:
+            with open(trg_file_name, "r", encoding="utf-8") as trg_file:
+                for a, s, t in zip_longest(align_file, src_file, trg_file):
+                    si = src_vocab.to_indices(s)
+                    ti = trg_vocab.to_indices(t)
+                    ai = list(map(lambda x: tuple(x.split("\t")), a.split()))
+                    for sai, tai in ai:
+                        src_idx = si[int(sai)]
+                        trg_idx = ti[int(tai)]
+                        if src_idx != src_vocab.get_unk_token_id() and trg_idx != trg_vocab.get_unk_token_id():
+                            assert src_idx != src_vocab.get_pad_token_id()
+                            assert src_idx != src_vocab.get_eos_token_id()
+                            assert trg_idx != trg_vocab.get_pad_token_id()
+                            assert trg_idx != trg_vocab.get_eos_token_id()
+                            if src_idx not in freq_map:
+                                freq_map[src_idx] = {}
+                            if trg_idx not in freq_map[src_idx]:
+                                freq_map[src_idx][trg_idx] = 1
+                            else:
+                                freq_map[src_idx][trg_idx] += 1
+
+    align_dict = {}
+    for src_idx in freq_map.keys():
+        align_dict[src_idx] = max(freq_map[src_idx], key=freq_map[src_idx].get)
+
+    with open(
+        os.path.join(
+            args.dest_dir,
+            "alignment.{}-{}.txt".format(args.src_lang, args.trg_lang),
+        ),
+        "w",
+        encoding="utf-8",
+    ) as f:
+        for k, v in align_dict.items():
+            print("{} {}".format(src_vocab[k], trg_vocab[v]), file=f)
+
+
+def main(args):
+    os.makedirs(args.dest_dir, exist_ok=True)
+    pprint(args)
+
+    if args.apply_bpe:
+        import fastBPE
+
+        bpe = fastBPE.fastBPE(args.bpe_code)
+        filenames = [_train_path(lang, args.train_pref) for lang in [args.src_lang, args.trg_lang]]
+        for k, dev_pref in enumerate(args.dev_pref.split(",")):
+            filenames.extend([_dev_path(lang, args.dev_pref) for lang in [args.src_lang, args.trg_lang]])
+        for k, test_pref in enumerate(args.test_pref.split(",")):
+            filenames.extend([_test_path(lang, args.test_pref) for lang in [args.src_lang, args.trg_lang]])
+
+        for file in filenames:
+            sequences = []
+            with open(file, "r") as f:
+                lines = f.readlines()
+                for seq in lines:
+                    sequences.append(seq.strip())
+
+            bpe_sequences = bpe.apply(sequences)
+            os.makedirs(os.path.join(args.train_pref, "tmp_bpe"), exist_ok=True)
+            shutil.copyfile(file, os.path.join(args.train_pref, "tmp_bpe", os.path.split(file)[-1]))
+
+            with open(file, "w") as f:
+                for bpe_seq in bpe_sequences:
+                    f.write(bpe_seq + "\n")
+
+    # build dictionaries
+    target = not args.only_source
+
+    if not args.src_vocab and os.path.exists(_dict_path(args.src_lang, args.dest_dir)):
+        raise FileExistsError(_dict_path(args.src_lang, args.dest_dir))
+
+    if target and not args.trg_vocab and os.path.exists(_dict_path(args.trg_lang, args.dest_dir)):
+        raise FileExistsError(_dict_path(args.trg_lang, args.dest_dir))
+
+    if args.joined_dictionary:
+        assert (
+            not args.src_vocab or not args.trg_vocab
+        ), "Cannot use both --src_vocab and --trg_vocab with --joined_dictionary"
+
+        if args.src_vocab:
+            src_vocab = Vocab.load_vocabulary(
+                filepath=args.src_vocab,
+                unk_token=args.unk_token,
+                bos_token=args.bos_token,
+                eos_token=args.eos_token,
+                pad_token=args.pad_token,
+            )
+        elif args.trg_vocab:
+            src_vocab = Vocab.load_vocabulary(
+                filepath=args.trg_vocab,
+                unk_token=args.unk_token,
+                bos_token=args.bos_token,
+                eos_token=args.eos_token,
+                pad_token=args.pad_token,
+            )
+        else:
+            assert args.train_pref, "--train_pref must be set if --src_vocab is not specified. "
+            src_vocab = _build_dictionary(
+                [_train_path(lang, args.train_pref) for lang in [args.src_lang, args.trg_lang]], args=args, src=True
+            )
+
+        trg_vocab = src_vocab
+    else:
+        if args.src_vocab:
+            src_vocab = Vocab.load_vocabulary(
+                filepath=args.src_vocab,
+                unk_token=args.unk_token,
+                bos_token=args.bos_token,
+                eos_token=args.eos_token,
+                pad_token=args.pad_token,
+            )
+        else:
+            assert args.train_pref, "--train_pref must be set if --src_vocab is not specified"
+            src_vocab = _build_dictionary([_train_path(args.src_lang, args.train_pref)], args=args, src=True)
+
+        if target:
+            if args.trg_vocab:
+                trg_vocab = Vocab.load_vocabulary(
+                    filepath=args.trg_vocab,
+                    unk_token=args.unk_token,
+                    bos_token=args.bos_token,
+                    eos_token=args.eos_token,
+                    pad_token=args.pad_token,
+                )
+            else:
+                assert args.train_pref, "--train_pref must be set if --trg_vocab is not specified"
+                trg_vocab = _build_dictionary([_train_path(args.trg_lang, args.train_pref)], args=args, trg=True)
+        else:
+            trg_vocab = None
+
+    # save dictionaries
+    src_vocab.save_vocabulary(_dict_path(args.src_lang, args.dest_dir))
+    if target and trg_vocab is not None:
+        trg_vocab.save_vocabulary(_dict_path(args.trg_lang, args.dest_dir))
+
+    if args.dict_only:
+        return
+
+    _make_all(args.src_lang, src_vocab, args)
+    if target:
+        _make_all(args.trg_lang, trg_vocab, args)
+
+    logger.info("Wrote preprocessed data to {}".format(args.dest_dir))
+
+    if args.align_file:
+        _align_files(args, src_vocab=src_vocab, trg_vocab=trg_vocab)
+
+
+if __name__ == "__main__":
+    args = get_preprocessing_parser()
+    main(args)
diff --git a/examples/machine_translation/requirements.txt b/examples/machine_translation/requirements.txt
new file mode 100644
index 000000000000..ddfd632b3409
--- /dev/null
+++ b/examples/machine_translation/requirements.txt
@@ -0,0 +1,3 @@
+attrdict
+pyyaml
+subword_nmt
diff --git a/examples/machine_translation/transformer/README.md b/examples/machine_translation/transformer/README.md
index 20c1ba794158..1e4b0ba74188 100644
--- a/examples/machine_translation/transformer/README.md
+++ b/examples/machine_translation/transformer/README.md
@@ -29,7 +29,7 @@ Transformer 中的 Encoder 由若干相同的 layer 堆叠组成，每个 layer
 - Multi-Head Attention 在这里用于实现 Self-Attention，相比于简单的 Attention 机制，其将输入进行多路线性变换后分别计算 Attention 的结果，并将所有结果拼接后再次进行线性变换作为输出。参见图2，其中 Attention 使用的是点积（Dot-Product），并在点积后进行了 scale 的处理以避免因点积结果过大进入 softmax 的饱和区域。
 - Feed-Forward 网络会对序列中的每个位置进行相同的计算（Position-wise），其采用的是两次线性变换中间加以 ReLU 激活的结构。
 
-此外，每个 sub-layer 后还施以 Residual Connection [3]和 Layer Normalization [4]来促进梯度传播和模型收敛。
+此外，每个 sub-layer 后还施以 Residual Connection [3] 和 Layer Normalization [4] 来促进梯度传播和模型收敛。
 
 <p align="center">
 <img src="images/multi_head_attention.png" height=300 hspace='10'/> <br />
@@ -38,27 +38,15 @@ Transformer 中的 Encoder 由若干相同的 layer 堆叠组成，每个 layer
 
 Decoder 具有和 Encoder 类似的结构，只是相比于组成 Encoder 的 layer ，在组成 Decoder 的 layer 中还多了一个 Multi-Head Attention 的 sub-layer 来实现对 Encoder 输出的 Attention，这个 Encoder-Decoder Attention 在其他 Seq2Seq 模型中也是存在的。
 
-## 环境依赖
-  - attrdict
-  - pyyaml
-
-安装命令：`pip install attrdict pyyaml`
-
-**注意：如果需要使用混合精度训练，需要使用基于 PaddlePaddle develop 分支编译的包。**
-
 ## 数据准备
 
-公开数据集：WMT 翻译大赛是机器翻译领域最具权威的国际评测大赛，其中英德翻译任务提供了一个中等规模的数据集，这个数据集是较多论文中使用的数据集，也是 Transformer 论文中用到的一个数据集。我们也将[WMT'14 EN-DE 数据集](http://www.statmt.org/wmt14/translation-task.html)作为示例提供。
-
-同时，我们提供了一份已经处理好的数据集，可以编写如下代码，对应的数据集将会自动下载并且解压到 `~/.paddlenlp/datasets/WMT14ende/`。
-
-``` python
-datasets = load_dataset('wmt14ende', splits=('train', 'dev'))
-```
+本示例可以使用 PaddleNLP 内置的处理好的 WMT14 EN-DE 翻译的数据进行训练、预测，也可以使用自定义数据集。数据准备部分可以参考前页文档 [使用自定义翻译数据集](../README.md)。
 
 ## 动态图
 
-### 单机训练
+### 使用内置数据集进行训练
+
+以下文档，介绍了使用 PaddleNLP 内置的处理好的 WMT14 EN-DE 翻译数据集的训练方式。
 
 #### 单机单卡
 
@@ -85,11 +73,76 @@ python -m paddle.distributed.launch --gpus "0,1,2,3,4,5,6,7" train.py --config .
 
 与上面的情况相似，可以在 `configs/transformer.big.yaml` 和 `configs/transformer.base.yaml` 文件中设置相应的参数。如果执行不提供 `--config` 选项，程序将默认使用 big model 的配置。
 
+### 使用自定义数据集进行训练
+
+自定义数据集与内置数据集训练的方式基本上是一致的，不过需要额外提供数据文件的路径。可以参照以下文档。
+
+#### 单机单卡
+
+本示例这里略去自定义数据下载、处理的步骤，如果需要，可以参考前页文档 [使用自定义翻译数据集](../README.md)。
+
+本示例以处理好的 WMT14 数据为例。
+
+``` bash
+DATA_DEST_DIR=${PATH_TO_PADDLENLP}/PaddleNLP/examples/machine_translation/data/wmt14_en_de/
+
+python train.py \
+    --config configs/transformer.base.yaml \
+    --train_file ${DATA_DEST_DIR}/train.de-en.en ${DATA_DEST_DIR}/train.de-en.de \
+    --dev_file ${DATA_DEST_DIR}/dev.de-en.en ${DATA_DEST_DIR}/dev.de-en.de \
+    --src_vocab ${DATA_DEST_DIR}/dict.en.txt \
+    --trg_vocab ${DATA_DEST_DIR}/dict.de.txt \
+    --bos_token "<s>" \
+    --eos_token "</s>" \
+    --unk_token "<unk>" \
+    --pad_token "<s>"
+```
+
+`train.py` 脚本中，各个参数的含义如下：
+
+* `--config`: 指明所使用的 Transformer 的 config 文件，包括模型超参、训练超参等，默认是 `transformer.big.yaml`。即，默认训练 Transformer Big 模型。
+* `--data_dir`: 指明训练需要的数据集的路径。无需提供不同的 train、dev 和 test 文件具体的文件名，会自动根据 `--src_lang` 和 `--trg_lang` 指定的语言进行构造。train、dev 和 test 默认的文件名分别为 [train|dev|test].{src_lang}-{trg_lang}.[{src_lang}|{trg_lang}]。且 `--data_dir` 设置的优先级会高于后面提到的 `--train_file`，`--dev_file` 和 `--test_file` 的优先级。
+  * `--src_lang`(`-s`): 指代翻译模型的源语言。比如 `de` 表示德语，`en` 表示英语，`fr` 表示法语等等。和数据集本身相关。
+  * `--trg_lang`(`-t`): 指代翻译模型的目标语言。比如 `de` 表示德语，`en` 表示英语，`fr` 表示法语等等。和数据集本身相关。
+* `--train_file`: 指明训练所需要的 `train` 训练集的数据集的路径。若没有提供 `--data_dir` 或是需要特别指明训练数据的名称的时候指定。指定的方式为，一组平行语料的源语言和目标语言，依次两个文件的路径和名称，`--train_file ${SOURCE_LANG_FILE} ${TARGET_LANG_FILE}`。比如，`--train_file ${DATA_DEST_DIR}/train.de-en.de ${DATA_DEST_DIR}/train.de-en.en`。
+* `--dev_file`: 指明训练所需要的 `dev` 验证集的数据集的路径。若没有提供 `--data_dir` 或是需要特别指明训练数据的名称的时候指定。指定的方式为，一组平行语料的源语言和目标语言，依次两个文件的路径和名称，`--dev_file ${SOURCE_LANG_FILE} ${TARGET_LANG_FILE}`。比如，`--dev_file ${DATA_DEST_DIR}/dev.de-en.de ${DATA_DEST_DIR}/dev.de-en.en`。
+* `--vocab_file`: 指明训练所需的词表文件的路径和名称。若指定 `--vocab_file` 则默认是源语言和目标语言使用同一个词表。且 `--vocab_file` 设置的优先级会高于后面提到的 `--src_vocab` 和 `--trg_vocab` 优先级。
+* `--src_vocab`: 指明训练所需的源语言的词表文件的路径和名称。可以与 `--trg_vocab` 相同，若相同，则视为源语言和目标语言共用同一个词表。
+* `--trg_vocab`: 指明训练所需的目标语言的词表文件的路径和名称。可以与 `--src_vocab` 相同，若相同，则视为源语言和目标语言共用同一个词表。
+* `--unk_token`: 若提供了自定义的词表，则需要额外指明词表中未登录词 `[UNK]` 具体的 token。比如，`--unk_token "<unk>"`。默认为 `<unk>`，与数据预处理脚本设定默认值相同。
+* `--bos_token`: 若提供了自定义的词表，则需要额外指明词表中起始词 `[BOS]` 具体的 token。比如，`--bos_token "<s>"`。默认为 `<s>`，与数据预处理脚本设定默认值相同。
+* `--eos_token`: 若提供了自定义的词表，则需要额外指明词表中结束词 `[EOS]` 具体的 token。比如，`--eos_token "</s>"`。默认为 `</s>`，与数据预处理脚本设定默认值相同。
+* `--pad_token`: 若提供了自定义的词表，原则上，需要额外指定词表中用于表示 `[PAD]` 具体的 token。比如，`--pad_token "<pad>"`。默认为 None，若使用 None，则使用 `--bos_token` 作为 `pad_token` 使用。
+* `--batch_size`: 指明训练时，一个 batch 里面，最多的 token 的数目。默认为 config 中设置的 4096。
+* `--max_iter`: 指明训练时，需要训练的最大的 step 的数目，默认为 None。表示使用 config 中指定的 `epoch: 30` 来作为最大的迭代的 epoch 的数量，而不是 step。
+* `--use_amp`: 是否使用混合精度训练。设置的类型是一个 `str`，可以是 `['true', 'false', 'True', 'False']` 中任意一个。默认不使用混合精度训练。
+* `--amp_level`: 若使用混合精度，则指明混合精度的级别。可以是 `['O1', 'O2']` 中任意一个。默认是 `O1`。
+
+#### 单机多卡
+
+单机多卡的执行方式与单机打卡差别不大，需要额外加上单机多卡的启动命令，如下所示：
+
+``` bash
+DATA_DEST_DIR=${PATH_TO_PADDLENLP}/PaddleNLP/examples/machine_translation/data/wmt14_en_de/
+
+python -m paddle.distributed.launch --gpus "0,1,2,3,4,5,6,7" train.py \
+    --config configs/transformer.base.yaml \
+    --train_file ${DATA_DEST_DIR}/train.de-en.en ${DATA_DEST_DIR}/train.de-en.de \
+    --dev_file ${DATA_DEST_DIR}/dev.de-en.en ${DATA_DEST_DIR}/dev.de-en.de \
+    --src_vocab ${DATA_DEST_DIR}/dict.en.txt \
+    --trg_vocab ${DATA_DEST_DIR}/dict.de.txt \
+    --bos_token "<s>" \
+    --eos_token "</s>" \
+    --unk_token "<unk>"
+```
+
+其余启动参数与单机单卡相同，这里不再累述。
+
 ### 模型推断
 
-#### 使用动态图预测
+#### 使用内置数据集进行预测
 
-以英德翻译数据为例，模型训练完成后可以执行以下命令对指定文件中的文本进行翻译：
+如果是基于内置的数据集训练得到的英德翻译的模型，模型训练完成后可以执行以下命令对指定文件中的文本进行翻译：
 
 ``` sh
 # setting visible devices for prediction
@@ -99,9 +152,44 @@ python predict.py --config ./configs/transformer.base.yaml
 
 翻译结果会输出到 `output_file` 指定的文件。执行预测时需要设置 `init_from_params` 来给出模型所在目录，更多参数的使用可以在 `configs/transformer.big.yaml` 和 `configs/transformer.base.yaml` 文件中查阅注释说明并进行更改设置。如果执行不提供 `--config` 选项，程序将默认使用 big model 的配置。
 
- 需要注意的是，目前预测仅实现了单卡的预测，原因在于，翻译后面需要的模型评估依赖于预测结果写入文件顺序，多卡情况下，目前暂未支持将结果按照指定顺序写入文件。
+需要注意的是，目前预测仅实现了单卡的预测，原因在于，翻译后面需要的模型评估依赖于预测结果写入文件顺序，多卡情况下，目前暂未支持将结果按照指定顺序写入文件。
+
+另外 `predict.py` 中使用的 `TransformerGenerator` 接口对于GPU预测将在适配的条件下自动切换到 `FasterTransformer` 预测加速版本（期间会进行jit编译）， `FasterTransformer`的更多内容可以参考 `faster_transformer/README.md`。
+
+#### 基于自定义数据集进行预测
+
+本示例同样支持自定义数据集进行预测。可以参照以下文档。
+
+``` bash
+DATA_DEST_DIR=${PATH_TO_PADDLENLP}/PaddleNLP/examples/machine_translation/data/wmt14_en_de/
+
+python predict.py \
+    --config configs/transformer.base.yaml \
+    --test_file ${DATA_DEST_DIR}/test.de-en.en \
+    --src_vocab ${DATA_DEST_DIR}/dict.en.txt \
+    --trg_vocab ${DATA_DEST_DIR}/dict.de.txt \
+    --bos_token "<s>" \
+    --eos_token "</s>" \
+    --unk_token "<unk>"
+```
+
+以下是各个参数的含义：
+
+* `--config`: 指明所使用的 Transformer 的 config 文件，包括模型超参、训练超参等，默认是 `transformer.big.yaml`。即，默认训练 Transformer Big 模型。
+* `--data_dir`: 指明训练需要的数据集的路径。无需提供不同的 train、dev 和 test 文件具体的文件名，会自动根据 `--src_lang` 和 `--trg_lang` 指定的语言进行构造。train、dev 和 test 默认的文件名分别为 [train|dev|test].{src_lang}-{trg_lang}.[{src_lang}|{trg_lang}]。且 `--data_dir` 设置的优先级会高于后面提到的 `--train_file`，`--dev_file` 和 `--test_file` 的优先级。
+  * `--src_lang`(`-s`): 指代翻译模型的源语言。比如 `de` 表示德语，`en` 表示英语，`fr` 表示法语等等。和数据集本身相关。
+  * `--trg_lang`(`-t`): 指代翻译模型的目标语言。比如 `de` 表示德语，`en` 表示英语，`fr` 表示法语等等。和数据集本身相关。
+* `--test_file`: 指明训练所需要的 `test` 验证集的数据集的路径。若没有提供 `--data_dir` 或是需要特别指明训练数据的名称的时候指定。指定的方式为，传入源语言的文件。比如，`--test_file ${DATA_DEST_DIR}/test.de-en.de`。
+* `--vocab_file`: 指明训练所需的词表文件的路径和名称。若指定 `--vocab_file` 则默认是源语言和目标语言使用同一个词表。且 `--vocab_file` 设置的优先级会高于后面提到的 `--src_vocab` 和 `--trg_vocab` 优先级。
+* `--src_vocab`: 指明训练所需的源语言的词表文件的路径和名称。可以与 `--trg_vocab` 相同，若相同，则视为源语言和目标语言共用同一个词表。
+* `--trg_vocab`: 指明训练所需的目标语言的词表文件的路径和名称。可以与 `--src_vocab` 相同，若相同，则视为源语言和目标语言共用同一个词表。
+* `--unk_token`: 若提供了自定义的词表，则需要额外指明词表中未登录词 `[UNK]` 具体的 token。比如，`--unk_token "<unk>"`。默认为 `<unk>`，与数据预处理脚本设定默认值相同。
+* `--bos_token`: 若提供了自定义的词表，则需要额外指明词表中起始词 `[BOS]` 具体的 token。比如，`--bos_token "<s>"`。默认为 `<s>`，与数据预处理脚本设定默认值相同。
+* `--eos_token`: 若提供了自定义的词表，则需要额外指明词表中结束词 `[EOS]` 具体的 token。比如，`--eos_token "</s>"`。默认为 `</s>`，与数据预处理脚本设定默认值相同。
+* `--pad_token`: 若提供了自定义的词表，原则上，需要额外指定词表中用于表示 `[PAD]` 具体的 token。比如，`--pad_token "<pad>"`。默认为 None，若使用 None，则使用 `--bos_token` 作为 `pad_token` 使用。
+* `--without_ft`: 本示例在预测时，支持了 GPU 的翻译预测的加速，如果不使用加速特性，可以设置 `--without_ft` 即会执行普通的 PaddlePaddle 动态图预测。
 
- 另外 `predict.py` 中使用的 `TransformerGenerator` 接口对于GPU预测将在适配的条件下自动切换到 `FasterTransformer` 预测加速版本（期间会进行jit编译）， `FasterTransformer`的更多内容可以参考 `faster_transformer/README.md`。
+翻译结果会输出到 config 文件中 `output_file` 条目指定的文件中。执行预测时需要设置 `init_from_params` 来给出模型所在目录，更多参数的使用可以在 `configs/transformer.big.yaml` 和 `configs/transformer.base.yaml` 文件中查阅注释说明并进行更改设置。
 
 #### 导出静态图预测模型与预测引擎预测
 
@@ -115,6 +203,28 @@ python export_model.py --config ./configs/transformer.base.yaml
 
 模型默认保存在 `infer_model/` 路径下面。可以在 `configs/` 路径下的配置文件中更改 `inference_model_dir` 配置，从而保存至自定义的路径。
 
+同样，因为模型导出会用到模型的词表等信息，所以如果是**自定义数据集**，仍需要传入所使用的词表。
+
+``` bash
+DATA_DEST_DIR=${PATH_TO_PADDLENLP}/PaddleNLP/examples/machine_translation/data/wmt14_en_de/
+
+python export_model.py \
+    --config ./configs/transformer.base.yaml \
+    --src_vocab ${DATA_DEST_DIR}/dict.en.txt \
+    --trg_vocab ${DATA_DEST_DIR}/dict.de.txt \
+    --bos_token "<s>" \
+    --eos_token "</s>"
+```
+
+其中：
+
+* `--vocab_file`: 指明训练所需的词表文件的路径和名称。若指定 `--vocab_file` 则默认是源语言和目标语言使用同一个词表。且 `--vocab_file` 设置的优先级会高于后面提到的 `--src_vocab` 和 `--trg_vocab` 优先级。
+* `--src_vocab`: 指明训练所需的源语言的词表文件的路径和名称。可以与 `--trg_vocab` 相同，若相同，则视为源语言和目标语言共用同一个词表。
+* `--trg_vocab`: 指明训练所需的目标语言的词表文件的路径和名称。可以与 `--src_vocab` 相同，若相同，则视为源语言和目标语言共用同一个词表。
+* `--bos_token`: 若提供了自定义的词表，则需要额外指明词表中起始词 `[BOS]` 具体的 token。比如，`--bos_token "<s>"`。默认为 `<s>`，与数据预处理脚本设定默认值相同。
+* `--eos_token`: 若提供了自定义的词表，则需要额外指明词表中结束词 `[EOS]` 具体的 token。比如，`--eos_token "</s>"`。默认为 `</s>`，与数据预处理脚本设定默认值相同。
+* `--pad_token`: 若提供了自定义的词表，原则上，需要额外指定词表中用于表示 `[PAD]` 具体的 token。比如，`--pad_token "<pad>"`。默认为 None，若使用 None，则使用 `--bos_token` 作为 `pad_token` 使用。
+
 #### 使用 Paddle Inference API 进行推理
 
 准备好以上模型之后，可以使用预测引擎 Paddle Inference API 进行推理。
@@ -129,7 +239,9 @@ python export_model.py --config ./configs/transformer.base.yaml
 
 ## 静态图
 
-### 单机训练
+在静态图中，本示例仍然可以选择内置数据集进行训练或是使用自定义数据集进行训练。
+
+### 使用内置数据集进行训练
 
 #### 单机单卡
 
@@ -140,7 +252,7 @@ export CUDA_VISIBLE_DEVICES=0
 python train.py --config ../configs/transformer.base.yaml
 ```
 
-我们建议可以在单卡执行的时候，尝试增大 `warmup_steps`。可以修改 `configs/transformer.big.yaml` 或是 `configs/transformer.base.yaml` 中对应参数。
+建议可以在单卡执行的时候，尝试增大 `warmup_steps`。可以修改 `configs/transformer.big.yaml` 或是 `configs/transformer.base.yaml` 中对应参数。
 
 #### 单机多卡
 
@@ -162,9 +274,89 @@ python -m paddle.distributed.launch --gpus="0,1,2,3,4,5,6,7" train.py --config .
 
 需要注意的是，使用 fleet 的方式启动单机多卡务必设置 `--distributed`。
 
-#### 模型推断
+### 使用自定义数据集进行训练
+
+静态图和动态图在训练脚本启动上差别不大，仍然需要指明对应的文件的位置。可以参照以下文档。
+
+#### 单机单卡
+
+本示例这里略去自定义数据下载、处理的步骤，如果需要，可以参考前页文档 [使用自定义翻译数据集](../README.md)。
+
+本示例以处理好的 WMT14 数据为例。
+
+``` bash
+cd static/
+export CUDA_VISIBLE_DEVICES=0
+
+DATA_DEST_DIR=${PATH_TO_PADDLENLP}/PaddleNLP/examples/machine_translation/data/wmt14_en_de/
+
+python train.py \
+    --config configs/transformer.base.yaml \
+    --train_file ${DATA_DEST_DIR}/train.de-en.en ${DATA_DEST_DIR}/train.de-en.de \
+    --src_vocab ${DATA_DEST_DIR}/dict.en.txt \
+    --trg_vocab ${DATA_DEST_DIR}/dict.de.txt \
+    --bos_token "<s>" \
+    --eos_token "</s>" \
+    --unk_token "<unk>"
+```
+
+`train.py` 脚本中，各个参数的含义如下：
+
+* `--config`: 指明所使用的 Transformer 的 config 文件，包括模型超参、训练超参等，默认是 `transformer.big.yaml`。即，默认训练 Transformer Big 模型。
+* `--data_dir`: 指明训练需要的数据集的路径。无需提供不同的 train、dev 和 test 文件具体的文件名，会自动根据 `--src_lang` 和 `--trg_lang` 指定的语言进行构造。train、dev 和 test 默认的文件名分别为 [train|dev|test].{src_lang}-{trg_lang}.[{src_lang}|{trg_lang}]。且 `--data_dir` 设置的优先级会高于后面提到的 `--train_file`，`--dev_file` 和 `--test_file` 的优先级。
+  * `--src_lang`(`-s`): 指代翻译模型的源语言。比如 `de` 表示德语，`en` 表示英语，`fr` 表示法语等等。和数据集本身相关。
+  * `--trg_lang`(`-t`): 指代翻译模型的目标语言。比如 `de` 表示德语，`en` 表示英语，`fr` 表示法语等等。和数据集本身相关。
+* `--train_file`: 指明训练所需要的 `train` 训练集的数据集的路径。若没有提供 `--data_dir` 或是需要特别指明训练数据的名称的时候指定。指定的方式为，一组平行语料的源语言和目标语言，依次两个文件的路径和名称，`--train_file ${SOURCE_LANG_FILE} ${TARGET_LANG_FILE}`。比如，`--train_file ${DATA_DEST_DIR}/train.de-en.de ${DATA_DEST_DIR}/train.de-en.en`。
+* `--vocab_file`: 指明训练所需的词表文件的路径和名称。若指定 `--vocab_file` 则默认是源语言和目标语言使用同一个词表。且 `--vocab_file` 设置的优先级会高于后面提到的 `--src_vocab` 和 `--trg_vocab` 优先级。
+* `--src_vocab`: 指明训练所需的源语言的词表文件的路径和名称。可以与 `--trg_vocab` 相同，若相同，则视为源语言和目标语言共用同一个词表。
+* `--trg_vocab`: 指明训练所需的目标语言的词表文件的路径和名称。可以与 `--src_vocab` 相同，若相同，则视为源语言和目标语言共用同一个词表。
+* `--unk_token`: 若提供了自定义的词表，则需要额外指明词表中未登录词 `[UNK]` 具体的 token。比如，`--unk_token "<unk>"`。默认为 `<unk>`，与数据预处理脚本设定默认值相同。
+* `--bos_token`: 若提供了自定义的词表，则需要额外指明词表中起始词 `[BOS]` 具体的 token。比如，`--bos_token "<s>"`。默认为 `<s>`，与数据预处理脚本设定默认值相同。
+* `--eos_token`: 若提供了自定义的词表，则需要额外指明词表中结束词 `[EOS]` 具体的 token。比如，`--eos_token "</s>"`。默认为 `</s>`，与数据预处理脚本设定默认值相同。
+* `--pad_token`: 若提供了自定义的词表，原则上，需要额外指定词表中用于表示 `[PAD]` 具体的 token。比如，`--pad_token "<pad>"`。默认为 None，若使用 None，则使用 `--bos_token` 作为 `pad_token` 使用。
+* `--batch_size`: 指明训练时，一个 batch 里面，最多的 token 的数目。默认为 config 中设置的 4096。
+* `--max_iter`: 指明训练时，需要训练的最大的 step 的数目，默认为 None。表示使用 config 中指定的 `epoch: 30` 来作为最大的迭代的 epoch 的数量，而不是 step。
+
+#### 单机多卡
+
+单机多卡下，执行方式与上文所述单机单卡传入自定义数据集方式相同。因静态图多卡有两种方式执行，所以这里会多一个参数：
+
+* `--distributed`:（**多卡训练需要**）指明是否是使用 fleet 来启动多卡。若设置，则使用 fleet 启动多卡。具体使用方式如下。
+
+##### PE 的方式启动单机多卡：
+``` shell
+cd static/
+export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
+python train.py \
+    --config ../configs/transformer.base.yaml \
+    --train_file ${DATA_DEST_DIR}/train.de-en.de ${DATA_DEST_DIR}/train.de-en.en \
+    --src_vocab ${DATA_DEST_DIR}/dict.en.txt \
+    --trg_vocab ${DATA_DEST_DIR}/dict.de.txt \
+    --bos_token "<s>" \
+    --eos_token "</s>" \
+    --unk_token "<unk>"
+```
+
+##### fleet 的方式启动单机多卡：
+``` shell
+cd static/
+unset CUDA_VISIBLE_DEVICES
+python -m paddle.distributed.launch --gpus="0,1,2,3,4,5,6,7" train.py \
+    --config ../configs/transformer.base.yaml \
+    --distributed \
+    --train_file ${DATA_DEST_DIR}/train.de-en.de ${DATA_DEST_DIR}/train.de-en.en \
+    --src_vocab ${DATA_DEST_DIR}/dict.en.txt \
+    --trg_vocab ${DATA_DEST_DIR}/dict.de.txt \
+    --bos_token "<s>" \
+    --eos_token "</s>" \
+    --unk_token "<unk>"
+```
+
+需要注意的是，使用 fleet 的方式启动单机多卡务必设置 `--distributed`。
+
+#### 使用内置数据集进行预测
 
-同样，以英德翻译数据为例，在静态图模式下，模型训练完成后可以执行以下命令对指定文件中的文本进行翻译：
+如果是基于内置的数据集训练得到的英德翻译的模型，模型训练完成后可以执行以下命令对指定文件中的文本进行翻译：
 
 ``` sh
 # setting visible devices for prediction
@@ -173,9 +365,46 @@ export CUDA_VISIBLE_DEVICES=0
 python predict.py --config ../configs/transformer.base.yaml
 ```
 
- 由 `predict_file` 指定的文件中文本的翻译结果会输出到 `output_file` 指定的文件。执行预测时需要设置 `init_from_params` 来给出模型所在目录，更多参数的使用可以在 `configs/transformer.big.yaml` 和 `configs/transformer.base.yaml` 文件中查阅注释说明并进行更改设置。如果执行不提供 `--config` 选项，程序将默认使用 big model 的配置。
+由 `predict_file` 指定的文件中文本的翻译结果会输出到 `output_file` 指定的文件。执行预测时需要设置 `init_from_params` 来给出模型所在目录，更多参数的使用可以在 `configs/transformer.big.yaml` 和 `configs/transformer.base.yaml` 文件中查阅注释说明并进行更改设置。如果执行不提供 `--config` 选项，程序将默认使用 big model 的配置。
+
+需要注意的是，目前预测仅实现了单卡的预测，原因在于，翻译后面需要的模型评估依赖于预测结果写入文件顺序，多卡情况下，目前暂未支持将结果按照指定顺序写入文件。
+
+#### 基于自定义数据集进行预测
+
+本示例同样支持自定义数据集进行预测。可以参照以下文档。
+
+``` bash
+cd static/
+export CUDA_VISIBLE_DEVICES=0
+
+DATA_DEST_DIR=${PATH_TO_PADDLENLP}/PaddleNLP/examples/machine_translation/data/wmt14_en_de/
+python predict.py \
+    --config configs/transformer.base.yaml \
+    --test_file ${DATA_DEST_DIR}/test.de-en.en \
+    --src_vocab ${DATA_DEST_DIR}/dict.en.txt \
+    --trg_vocab ${DATA_DEST_DIR}/dict.de.txt \
+    --bos_token "<s>" \
+    --eos_token "</s>" \
+    --unk_token "<unk>"
+```
+
+以下是各个参数的含义：
+
+* `--config`: 指明所使用的 Transformer 的 config 文件，包括模型超参、训练超参等，默认是 `transformer.big.yaml`。即，默认训练 Transformer Big 模型。
+* `--data_dir`: 指明训练需要的数据集的路径。无需提供不同的 train、dev 和 test 文件具体的文件名，会自动根据 `--src_lang` 和 `--trg_lang` 指定的语言进行构造。train、dev 和 test 默认的文件名分别为 [train|dev|test].{src_lang}-{trg_lang}.[{src_lang}|{trg_lang}]。且 `--data_dir` 设置的优先级会高于后面提到的 `--train_file`，`--dev_file` 和 `--test_file` 的优先级。
+  * `--src_lang`(`-s`): 指代翻译模型的源语言。比如 `de` 表示德语，`en` 表示英语，`fr` 表示法语等等。和数据集本身相关。
+  * `--trg_lang`(`-t`): 指代翻译模型的目标语言。比如 `de` 表示德语，`en` 表示英语，`fr` 表示法语等等。和数据集本身相关。
+* `--test_file`: 指明训练所需要的 `test` 验证集的数据集的路径。若没有提供 `--data_dir` 或是需要特别指明训练数据的名称的时候指定。指定的方式为，传入源语言的文件。比如，`--test_file ${DATA_DEST_DIR}/test.de-en.de`。
+* `--vocab_file`: 指明训练所需的词表文件的路径和名称。若指定 `--vocab_file` 则默认是源语言和目标语言使用同一个词表。且 `--vocab_file` 设置的优先级会高于后面提到的 `--src_vocab` 和 `--trg_vocab` 优先级。
+* `--src_vocab`: 指明训练所需的源语言的词表文件的路径和名称。可以与 `--trg_vocab` 相同，若相同，则视为源语言和目标语言共用同一个词表。
+* `--trg_vocab`: 指明训练所需的目标语言的词表文件的路径和名称。可以与 `--src_vocab` 相同，若相同，则视为源语言和目标语言共用同一个词表。
+* `--unk_token`: 若提供了自定义的词表，则需要额外指明词表中未登录词 `[UNK]` 具体的 token。比如，`--unk_token "<unk>"`。默认为 `<unk>`，与数据预处理脚本设定默认值相同。
+* `--bos_token`: 若提供了自定义的词表，则需要额外指明词表中起始词 `[BOS]` 具体的 token。比如，`--bos_token "<s>"`。默认为 `<s>`，与数据预处理脚本设定默认值相同。
+* `--eos_token`: 若提供了自定义的词表，则需要额外指明词表中结束词 `[EOS]` 具体的 token。比如，`--eos_token "</s>"`。默认为 `</s>`，与数据预处理脚本设定默认值相同。
+* `--pad_token`: 若提供了自定义的词表，原则上，需要额外指定词表中用于表示 `[PAD]` 具体的 token。比如，`--pad_token "<pad>"`。默认为 None，若使用 None，则使用 `--bos_token` 作为 `pad_token` 使用。
+* `--without_ft`: 本示例在预测时，支持了 GPU 的翻译预测的加速，如果不使用加速特性，可以设置 `--without_ft` 即会执行普通的 PaddlePaddle 动态图预测。
 
- 需要注意的是，目前预测仅实现了单卡的预测，原因在于，翻译后面需要的模型评估依赖于预测结果写入文件顺序，多卡情况下，目前暂未支持将结果按照指定顺序写入文件。
+翻译结果会输出到 config 文件中 `output_file` 条目指定的文件中。执行预测时需要设置 `init_from_params` 来给出模型所在目录，更多参数的使用可以在 `configs/transformer.big.yaml` 和 `configs/transformer.base.yaml` 文件中查阅注释说明并进行更改设置。
 
 ## 使用 FasterTransformer 实现预测
 
@@ -202,7 +431,7 @@ BLEU = 27.48, 58.6/33.2/21.1/13.9 (BP=1.000, ratio=1.012, hyp_len=65312, ref_len
 ## FAQ
 
 **Q:** 预测结果中样本数少于输入的样本数是什么原因
-**A:** 若样本中最大长度超过 `transformer.yaml` 中 `max_length` 的默认设置，请注意运行时增大 `--max_length` 的设置，否则超长样本将被过滤。
+**A:** 若样本中最大长度超过 `transformer.base.yaml` 或是 `transformer.big.yaml` 中 `max_length` 的默认设置，请注意运行时增大 `max_length` 的设置，否则超长样本将被过滤。
 
 **Q:** 预测时最大长度超过了训练时的最大长度怎么办
 **A:** 由于训练时 `max_length` 的设置决定了保存模型 position encoding 的大小，若预测时长度超过 `max_length`，请调大该值，会重新生成更大的 position encoding 表。
diff --git a/examples/machine_translation/transformer/configs/transformer.base.yaml b/examples/machine_translation/transformer/configs/transformer.base.yaml
index 2aab9089fd46..78ab6e385527 100644
--- a/examples/machine_translation/transformer/configs/transformer.base.yaml
+++ b/examples/machine_translation/transformer/configs/transformer.base.yaml
@@ -123,6 +123,8 @@ dropout: 0.1
 # The flag indicating whether to share embedding and softmax weights.
 # Vocabularies in source and target should be same for weight sharing.
 weight_sharing: True
+# Whether to apply pre-normalization or not. 
+normalize_before: True
 
 # Mixed precision training
 use_amp: False
diff --git a/examples/machine_translation/transformer/configs/transformer.big.yaml b/examples/machine_translation/transformer/configs/transformer.big.yaml
index b2f73ef93552..a5da31f84f22 100644
--- a/examples/machine_translation/transformer/configs/transformer.big.yaml
+++ b/examples/machine_translation/transformer/configs/transformer.big.yaml
@@ -123,6 +123,8 @@ dropout: 0.1
 # The flag indicating whether to share embedding and softmax weights.
 # Vocabularies in source and target should be same for weight sharing.
 weight_sharing: True
+# Whether to apply pre-normalization or not. 
+normalize_before: True
 
 # Mixed precision training
 use_amp: False
diff --git a/examples/machine_translation/transformer/deploy/python/inference.py b/examples/machine_translation/transformer/deploy/python/inference.py
index 161b513526c9..b4fe9a77c6c1 100644
--- a/examples/machine_translation/transformer/deploy/python/inference.py
+++ b/examples/machine_translation/transformer/deploy/python/inference.py
@@ -12,22 +12,20 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 
+import argparse
 import os
 import sys
-
-import argparse
-import numpy as np
-import yaml
-from attrdict import AttrDict
 from pprint import pprint
 
 import paddle
+import yaml
+from attrdict import AttrDict
 from paddle import inference
 
 from paddlenlp.utils.log import logger
 
 sys.path.append(os.path.abspath(os.path.join(os.path.dirname(__file__), os.pardir, os.pardir)))
-import reader
+import reader  # noqa: E402
 
 
 def parse_args():
@@ -52,12 +50,18 @@ def parse_args():
         help="Whether to print logs on each cards and use benchmark vocab. Normally, not necessary to set --benchmark. ",
     )
     parser.add_argument("--profile", action="store_true", help="Whether to profile. ")
+    parser.add_argument(
+        "--data_dir",
+        default=None,
+        type=str,
+        help="The dir of train, dev and test datasets. If data_dir is given, train_file and dev_file and test_file will be replaced by data_dir/[train|dev|test].\{src_lang\}-\{trg_lang\}.[\{src_lang\}|\{trg_lang\}]. ",
+    )
     parser.add_argument(
         "--test_file",
         nargs="+",
         default=None,
         type=str,
-        help="The file for testing. Normally, it shouldn't be set and in this case, the default WMT14 dataset will be used to process testing.",
+        help="The files for test. Can be set by using --test_file source_language_file. If it's None, the default WMT14 en-de dataset will be used. ",
     )
     parser.add_argument(
         "--save_log_path",
@@ -71,6 +75,20 @@ def parse_args():
         type=str,
         help="The vocab file. Normally, it shouldn't be set and in this case, the default WMT14 dataset will be used.",
     )
+    parser.add_argument(
+        "--src_vocab",
+        default=None,
+        type=str,
+        help="The vocab file for source language. If --vocab_file is given, the --vocab_file will be used. ",
+    )
+    parser.add_argument(
+        "--trg_vocab",
+        default=None,
+        type=str,
+        help="The vocab file for target language. If --vocab_file is given, the --vocab_file will be used. ",
+    )
+    parser.add_argument("-s", "--src_lang", default=None, type=str, help="Source language. ")
+    parser.add_argument("-t", "--trg_lang", default=None, type=str, help="Target language. ")
     parser.add_argument(
         "--unk_token",
         default=None,
@@ -83,6 +101,12 @@ def parse_args():
     parser.add_argument(
         "--eos_token", default=None, type=str, help="The eos token. It should be provided when use custom vocab_file. "
     )
+    parser.add_argument(
+        "--pad_token",
+        default=None,
+        type=str,
+        help="The pad token. It should be provided when use custom vocab_file. And if it's None, bos_token will be used. ",
+    )
     args = parser.parse_args()
     return args
 
@@ -253,16 +277,48 @@ def do_inference(args):
     args.model_name = "transformer_base" if "base" in ARGS.config else "transformer_big"
     if ARGS.model_dir != "":
         args.inference_model_dir = ARGS.model_dir
-    args.test_file = ARGS.test_file
     args.save_log_path = ARGS.save_log_path
-    args.vocab_file = ARGS.vocab_file
+    args.data_dir = ARGS.data_dir
+    args.test_file = ARGS.test_file
+
+    if ARGS.vocab_file is not None:
+        args.src_vocab = ARGS.vocab_file
+        args.trg_vocab = ARGS.vocab_file
+        args.joined_dictionary = True
+    elif ARGS.src_vocab is not None and ARGS.trg_vocab is None:
+        args.vocab_file = args.trg_vocab = args.src_vocab = ARGS.src_vocab
+        args.joined_dictionary = True
+    elif ARGS.src_vocab is None and ARGS.trg_vocab is not None:
+        args.vocab_file = args.trg_vocab = args.src_vocab = ARGS.trg_vocab
+        args.joined_dictionary = True
+    else:
+        args.src_vocab = ARGS.src_vocab
+        args.trg_vocab = ARGS.trg_vocab
+        args.joined_dictionary = not (
+            args.src_vocab is not None and args.trg_vocab is not None and args.src_vocab != args.trg_vocab
+        )
+    if args.weight_sharing != args.joined_dictionary:
+        if args.weight_sharing:
+            raise ValueError("The src_vocab and trg_vocab must be consistency when weight_sharing is True. ")
+        else:
+            raise ValueError(
+                "The src_vocab and trg_vocab must be specified respectively when weight sharing is False. "
+            )
+
+    if ARGS.src_lang is not None:
+        args.src_lang = ARGS.src_lang
+    if ARGS.trg_lang is not None:
+        args.trg_lang = ARGS.trg_lang
+
     args.unk_token = ARGS.unk_token
     args.bos_token = ARGS.bos_token
     args.eos_token = ARGS.eos_token
+    args.pad_token = ARGS.pad_token
     pprint(args)
 
     if args.profile:
         import importlib
+
         import tls.recorder as recorder
 
         try:
diff --git a/examples/machine_translation/transformer/export_model.py b/examples/machine_translation/transformer/export_model.py
index ac9a236dd49f..f23f8b0ce1e6 100644
--- a/examples/machine_translation/transformer/export_model.py
+++ b/examples/machine_translation/transformer/export_model.py
@@ -1,12 +1,25 @@
-import os
-import yaml
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
 import argparse
+import os
 from pprint import pprint
-from attrdict import AttrDict
 
 import paddle
-
 import reader
+import yaml
+from attrdict import AttrDict
 
 from paddlenlp.transformers import InferTransformerModel, position_encoding_init
 from paddlenlp.utils.log import logger
@@ -29,10 +42,16 @@ def parse_args():
         help="The vocab file. Normally, it shouldn't be set and in this case, the default WMT14 dataset will be used.",
     )
     parser.add_argument(
-        "--unk_token",
+        "--src_vocab",
+        default=None,
+        type=str,
+        help="The vocab file for source language. If --vocab_file is given, the --vocab_file will be used. ",
+    )
+    parser.add_argument(
+        "--trg_vocab",
         default=None,
         type=str,
-        help="The unknown token. It should be provided when use custom vocab_file. ",
+        help="The vocab file for target language. If --vocab_file is given, the --vocab_file will be used. ",
     )
     parser.add_argument(
         "--bos_token", default=None, type=str, help="The bos token. It should be provided when use custom vocab_file. "
@@ -40,6 +59,12 @@ def parse_args():
     parser.add_argument(
         "--eos_token", default=None, type=str, help="The eos token. It should be provided when use custom vocab_file. "
     )
+    parser.add_argument(
+        "--pad_token",
+        default=None,
+        type=str,
+        help="The pad token. It should be provided when use custom vocab_file. And if it's None, bos_token will be used. ",
+    )
     args = parser.parse_args()
     return args
 
@@ -61,9 +86,11 @@ def do_export(args):
         weight_sharing=args.weight_sharing,
         bos_id=args.bos_idx,
         eos_id=args.eos_idx,
+        pad_id=args.pad_idx,
         beam_size=args.beam_size,
         max_out_len=args.max_out_len,
         beam_search_version=args.beam_search_version,
+        normalize_before=args.get("normalize_before", True),
         rel_len=args.use_rel_len,
         alpha=args.alpha,
     )
@@ -104,10 +131,34 @@ def do_export(args):
     with open(yaml_file, "rt") as f:
         args = AttrDict(yaml.safe_load(f))
     args.benchmark = ARGS.benchmark
-    args.vocab_file = ARGS.vocab_file
-    args.unk_token = ARGS.unk_token
+
+    if ARGS.vocab_file is not None:
+        args.src_vocab = ARGS.vocab_file
+        args.trg_vocab = ARGS.vocab_file
+        args.joined_dictionary = True
+    elif ARGS.src_vocab is not None and ARGS.trg_vocab is None:
+        args.vocab_file = args.trg_vocab = args.src_vocab = ARGS.src_vocab
+        args.joined_dictionary = True
+    elif ARGS.src_vocab is None and ARGS.trg_vocab is not None:
+        args.vocab_file = args.trg_vocab = args.src_vocab = ARGS.trg_vocab
+        args.joined_dictionary = True
+    else:
+        args.src_vocab = ARGS.src_vocab
+        args.trg_vocab = ARGS.trg_vocab
+        args.joined_dictionary = not (
+            args.src_vocab is not None and args.trg_vocab is not None and args.src_vocab != args.trg_vocab
+        )
+    if args.weight_sharing != args.joined_dictionary:
+        if args.weight_sharing:
+            raise ValueError("The src_vocab and trg_vocab must be consistency when weight_sharing is True. ")
+        else:
+            raise ValueError(
+                "The src_vocab and trg_vocab must be specified respectively when weight sharing is False. "
+            )
+
     args.bos_token = ARGS.bos_token
     args.eos_token = ARGS.eos_token
+    args.pad_token = ARGS.pad_token
     pprint(args)
 
     do_export(args)
diff --git a/examples/machine_translation/transformer/faster_transformer/README.md b/examples/machine_translation/transformer/faster_transformer/README.md
index 595d3881556a..5319cc421b3c 100644
--- a/examples/machine_translation/transformer/faster_transformer/README.md
+++ b/examples/machine_translation/transformer/faster_transformer/README.md
@@ -58,14 +58,7 @@ transformer = FasterTransformer(
 
 #### 数据准备
 
-公开数据集：WMT 翻译大赛是机器翻译领域最具权威的国际评测大赛，其中英德翻译任务提供了一个中等规模的数据集，这个数据集是较多论文中使用的数据集，也是 Transformer 论文中用到的一个数据集。我们也将[WMT'14 EN-DE 数据集](http://www.statmt.org/wmt14/translation-task.html)作为示例提供。
-
-同时，我们提供了一份已经处理好的数据集，可以编写如下代码，对应的数据集将会自动下载并且解压到 `~/.paddlenlp/datasets/WMT14ende/`。
-
-``` python
-datasets = load_dataset('wmt14ende', splits=('test'))
-```
-
+本示例可以使用 PaddleNLP 内置的处理好的 WMT14 EN-DE 翻译的数据进行训练、预测，也可以使用自定义数据集。数据准备部分可以参考前页文档 [使用自定义翻译数据集](../README.md)。
 
 #### 模型推断
 
@@ -87,11 +80,14 @@ tar -zxf transformer-base-wmt_ende_bpe.tar.gz
 ``` sh
 # setting visible devices for prediction
 export CUDA_VISIBLE_DEVICES=0
-export FLAGS_fraction_of_gpu_memory_to_use=0.1
 # 执行 decoding_gemm 目的是基于当前环境、配置，提前确定一个性能最佳的矩阵乘算法，不是必要的步骤
 cp -rf ../../../../paddlenlp/ops/build/third-party/build/fastertransformer/bin/decoding_gemm ./
 ./decoding_gemm 8 4 8 64 38512 32 512 0
-python encoder_decoding_predict.py --config ../configs/transformer.base.yaml --decoding_lib ../../../../paddlenlp/ops/build/lib/libdecoding_op.so --decoding_strategy beam_search --beam_size 5
+python encoder_decoding_predict.py \
+    --config ../configs/transformer.base.yaml \
+    --decoding_lib ../../../../paddlenlp/ops/build/lib/libdecoding_op.so \
+    --decoding_strategy beam_search \
+    --beam_size 5
 ```
 
 其中:
@@ -108,7 +104,6 @@ python encoder_decoding_predict.py --config ../configs/transformer.base.yaml --d
 
 翻译结果会输出到 `output_file` 指定的文件。执行预测时需要设置 `init_from_params` 来给出模型所在目录，更多参数的使用可以在 `./sample/config/transformer.base.yaml` 文件中查阅注释说明并进行更改设置。如果执行不提供 `--config` 选项，程序将默认使用 base model 的配置。
 
-
 #### 使用动态图预测(使用 float16 decoding 预测)
 
 float16 与 float32 预测的基本流程相同，不过在使用 float16 的 decoding 进行预测的时候，需要再加上 `--use_fp16_decoding` 选项，表示使用 fp16 进行预测。后按照与之前相同的方式执行即可。具体执行方式如下：
@@ -116,11 +111,15 @@ float16 与 float32 预测的基本流程相同，不过在使用 float16 的 de
 ``` sh
 # setting visible devices for prediction
 export CUDA_VISIBLE_DEVICES=0
-export FLAGS_fraction_of_gpu_memory_to_use=0.1
 # 执行 decoding_gemm 目的是基于当前环境、配置，提前确定一个性能最佳的矩阵乘算法，不是必要的步骤
 cp -rf ../../../../paddlenlp/ops/build/third-party/build/fastertransformer/bin/decoding_gemm ./
 ./decoding_gemm 8 4 8 64 38512 32 512 1
-python encoder_decoding_predict.py --config ../configs/transformer.base.yaml --decoding_lib ../../../../paddlenlp/ops/build/lib/libdecoding_op.so --use_fp16_decoding --decoding_strategy beam_search --beam_size 5
+python encoder_decoding_predict.py \
+    --config ../configs/transformer.base.yaml \
+    --decoding_lib ../../../../paddlenlp/ops/build/lib/libdecoding_op.so \
+    --use_fp16_decoding \
+    --decoding_strategy beam_search \
+    --beam_size 5
 ```
 
 其中，`--config` 选项用于指明配置文件的位置，而 `--decoding_lib` 选项用于指明编译好的 FasterTransformer decoding lib 的位置。
@@ -129,6 +128,47 @@ python encoder_decoding_predict.py --config ../configs/transformer.base.yaml --d
 
 需要注意的是，目前预测仅实现了单卡的预测，原因在于，翻译后面需要的模型评估依赖于预测结果写入文件顺序，多卡情况下，目前暂未支持将结果按照指定顺序写入文件。
 
+#### 使用自定义数据集进行预测
+
+如果需要使用准备好的自定义数据集进行高性能推理，同样可以通过在执行 `encoder_decoding_predict.py` 脚本时指明以下参数，从而引入自定义数据集。
+
+* `--data_dir`: 指明训练需要的数据集的路径。无需提供不同的 train、dev 和 test 文件具体的文件名，会自动根据 `--src_lang` 和 `--trg_lang` 指定的语言进行构造。train、dev 和 test 默认的文件名分别为 [train|dev|test].{src_lang}-{trg_lang}.[{src_lang}|{trg_lang}]。且 `--data_dir` 设置的优先级会高于后面提到的 `--train_file`，`--dev_file` 和 `--test_file` 的优先级。
+  * `--src_lang`(`-s`): 指代翻译模型的源语言。比如 `de` 表示德语，`en` 表示英语，`fr` 表示法语等等。和数据集本身相关。
+  * `--trg_lang`(`-t`): 指代翻译模型的目标语言。比如 `de` 表示德语，`en` 表示英语，`fr` 表示法语等等。和数据集本身相关。
+* `--test_file`: 指明训练所需要的 `test` 验证集的数据集的路径。若没有提供 `--data_dir` 或是需要特别指明训练数据的名称的时候指定。指定的方式为，传入源语言的文件。比如，`--test_file ${DATA_DEST_DIR}/test.de-en.de`。
+* `--vocab_file`: 指明训练所需的词表文件的路径和名称。若指定 `--vocab_file` 则默认是源语言和目标语言使用同一个词表。且 `--vocab_file` 设置的优先级会高于后面提到的 `--src_vocab` 和 `--trg_vocab` 优先级。
+* `--src_vocab`: 指明训练所需的源语言的词表文件的路径和名称。可以与 `--trg_vocab` 相同，若相同，则视为源语言和目标语言共用同一个词表。
+* `--trg_vocab`: 指明训练所需的目标语言的词表文件的路径和名称。可以与 `--src_vocab` 相同，若相同，则视为源语言和目标语言共用同一个词表。
+* `--unk_token`: 若提供了自定义的词表，则需要额外指明词表中未登录词 `[UNK]` 具体的 token。比如，`--unk_token "<unk>"`。默认为 `<unk>`，与数据预处理脚本设定默认值相同。
+* `--bos_token`: 若提供了自定义的词表，则需要额外指明词表中起始词 `[BOS]` 具体的 token。比如，`--bos_token "<s>"`。默认为 `<s>`，与数据预处理脚本设定默认值相同。
+* `--eos_token`: 若提供了自定义的词表，则需要额外指明词表中结束词 `[EOS]` 具体的 token。比如，`--eos_token "</s>"`。默认为 `</s>`，与数据预处理脚本设定默认值相同。
+* `--pad_token`: 若提供了自定义的词表，原则上，需要额外指定词表中用于表示 `[PAD]` 具体的 token。比如，`--pad_token "<pad>"`。默认为 None，若使用 None，则使用 `--bos_token` 作为 `pad_token` 使用。
+
+比如：
+
+``` bash
+# setting visible devices for prediction
+export CUDA_VISIBLE_DEVICES=0
+DATA_DEST_DIR=${PATH_TO_PADDLENLP}/PaddleNLP/examples/machine_translation/data/iwslt14.tokenized.de-en/
+
+# 执行 decoding_gemm 目的是基于当前环境、配置，提前确定一个性能最佳的矩阵乘算法，不是必要的步骤
+cp -rf ../../../../paddlenlp/ops/build/third-party/build/fastertransformer/bin/decoding_gemm ./
+./decoding_gemm 8 4 8 64 38512 32 512 1
+
+python encoder_decoding_predict.py \
+    --config ../configs/transformer.base.yaml \
+    --decoding_lib ../../../../paddlenlp/ops/build/lib/libdecoding_op.so \
+    --use_fp16_decoding \
+    --decoding_strategy beam_search \
+    --beam_size 5 \
+    --test_file ${DATA_DEST_DIR}/test.de-en.de \
+    --src_vocab ${DATA_DEST_DIR}/dev.de-en.de \
+    --trg_vocab ${DATA_DEST_DIR}/dev.de-en.en \
+    --bos_token "<s>" \
+    --eos_token "</s>" \
+    --unk_token "<unk>"
+```
+
 #### 导出基于 FasterTransformer 的预测库使用模型文件
 
 我们提供一个已经基于动态图训练好的 base model 的 checkpoint 以供使用，当前 checkpoint 是基于 WMT 英德翻译的任务训练。可以通过[transformer-base-wmt_ende_bpe](https://bj.bcebos.com/paddlenlp/models/transformers/transformer/transformer-base-wmt_ende_bpe.tar.gz)下载。
@@ -136,7 +176,10 @@ python encoder_decoding_predict.py --config ../configs/transformer.base.yaml --d
 使用 C++ 预测库，首先，我们需要做的是将动态图的 checkpoint 导出成预测库能使用的模型文件和参数文件。可以执行 `export_model.py` 实现这个过程。
 
 ``` sh
-python export_model.py --config ../configs/transformer.base.yaml  --decoding_strategy beam_search --beam_size 5
+python export_model.py \
+    --config ../configs/transformer.base.yaml \
+    --decoding_strategy beam_search \
+    --beam_size 5
 ```
 
 若当前环境下没有需要的自定义 op 的动态库，将会使用 JIT 自动编译需要的动态库。如果需要自行编译自定义 op 所需的动态库，可以参考 [文本生成高性能加速](../../../../paddlenlp/ops/README.md)。编译好后，可以在执行 `export_model.py` 时使用 `--decoding_lib ../../../../paddlenlp/ops/build/lib/libdecoding_op.so` 可以完成导入。
diff --git a/examples/machine_translation/transformer/faster_transformer/encoder_decoding_predict.py b/examples/machine_translation/transformer/faster_transformer/encoder_decoding_predict.py
index 46026e3a4442..0bf680f795e1 100644
--- a/examples/machine_translation/transformer/faster_transformer/encoder_decoding_predict.py
+++ b/examples/machine_translation/transformer/faster_transformer/encoder_decoding_predict.py
@@ -12,27 +12,21 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 
-import sys
-import os
-import numpy as np
-from attrdict import AttrDict
 import argparse
-import time
+import os
+import sys
+from pprint import pprint
 
+import numpy as np
 import paddle
-import paddle.nn as nn
-import paddle.nn.functional as F
-
 import yaml
-from pprint import pprint
+from attrdict import AttrDict
 
-from paddlenlp.transformers import TransformerModel
-from paddlenlp.transformers import position_encoding_init
 from paddlenlp.ops import FasterTransformer
 from paddlenlp.utils.log import logger
 
 sys.path.append("../")
-import reader
+import reader  # noqa: E402
 
 
 def parse_args():
@@ -73,12 +67,18 @@ def parse_args():
     parser.add_argument(
         "--profile", action="store_true", help="Whether to profile the performance using newstest2014 dataset. "
     )
+    parser.add_argument(
+        "--data_dir",
+        default=None,
+        type=str,
+        help="The dir of train, dev and test datasets. If data_dir is given, train_file and dev_file and test_file will be replaced by data_dir/[train|dev|test].\{src_lang\}-\{trg_lang\}.[\{src_lang\}|\{trg_lang\}]. ",
+    )
     parser.add_argument(
         "--test_file",
         nargs="+",
         default=None,
         type=str,
-        help="The file for testing. Normally, it shouldn't be set and in this case, the default WMT14 dataset will be used to process testing.",
+        help="The files for test. Can be set by using --test_file source_language_file. If it's None, the default WMT14 en-de dataset will be used. ",
     )
     parser.add_argument(
         "--benchmark",
@@ -91,6 +91,20 @@ def parse_args():
         type=str,
         help="The vocab file. Normally, it shouldn't be set and in this case, the default WMT14 dataset will be used.",
     )
+    parser.add_argument(
+        "--src_vocab",
+        default=None,
+        type=str,
+        help="The vocab file for source language. If --vocab_file is given, the --vocab_file will be used. ",
+    )
+    parser.add_argument(
+        "--trg_vocab",
+        default=None,
+        type=str,
+        help="The vocab file for target language. If --vocab_file is given, the --vocab_file will be used. ",
+    )
+    parser.add_argument("-s", "--src_lang", default=None, type=str, help="Source language. ")
+    parser.add_argument("-t", "--trg_lang", default=None, type=str, help="Target language. ")
     parser.add_argument(
         "--unk_token",
         default=None,
@@ -103,6 +117,12 @@ def parse_args():
     parser.add_argument(
         "--eos_token", default=None, type=str, help="The eos token. It should be provided when use custom vocab_file. "
     )
+    parser.add_argument(
+        "--pad_token",
+        default=None,
+        type=str,
+        help="The pad token. It should be provided when use custom vocab_file. And if it's None, bos_token will be used. ",
+    )
     args = parser.parse_args()
     return args
 
@@ -145,6 +165,7 @@ def do_predict(args):
         weight_sharing=args.weight_sharing,
         bos_id=args.bos_idx,
         eos_id=args.eos_idx,
+        pad_id=args.pad_idx,
         decoding_strategy=args.decoding_strategy,
         beam_size=args.beam_size,
         max_out_len=args.max_out_len,
@@ -230,11 +251,42 @@ def do_predict(args):
     args.benchmark = ARGS.benchmark
     if ARGS.batch_size:
         args.infer_batch_size = ARGS.batch_size
+    args.data_dir = ARGS.data_dir
     args.test_file = ARGS.test_file
-    args.vocab_file = ARGS.vocab_file
+
+    if ARGS.vocab_file is not None:
+        args.src_vocab = ARGS.vocab_file
+        args.trg_vocab = ARGS.vocab_file
+        args.joined_dictionary = True
+    elif ARGS.src_vocab is not None and ARGS.trg_vocab is None:
+        args.vocab_file = args.trg_vocab = args.src_vocab = ARGS.src_vocab
+        args.joined_dictionary = True
+    elif ARGS.src_vocab is None and ARGS.trg_vocab is not None:
+        args.vocab_file = args.trg_vocab = args.src_vocab = ARGS.trg_vocab
+        args.joined_dictionary = True
+    else:
+        args.src_vocab = ARGS.src_vocab
+        args.trg_vocab = ARGS.trg_vocab
+        args.joined_dictionary = not (
+            args.src_vocab is not None and args.trg_vocab is not None and args.src_vocab != args.trg_vocab
+        )
+    if args.weight_sharing != args.joined_dictionary:
+        if args.weight_sharing:
+            raise ValueError("The src_vocab and trg_vocab must be consistency when weight_sharing is True. ")
+        else:
+            raise ValueError(
+                "The src_vocab and trg_vocab must be specified respectively when weight sharing is False. "
+            )
+
+    if ARGS.src_lang is not None:
+        args.src_lang = ARGS.src_lang
+    if ARGS.trg_lang is not None:
+        args.trg_lang = ARGS.trg_lang
+
     args.unk_token = ARGS.unk_token
     args.bos_token = ARGS.bos_token
     args.eos_token = ARGS.eos_token
+    args.pad_token = ARGS.pad_token
     pprint(args)
 
     do_predict(args)
diff --git a/examples/machine_translation/transformer/faster_transformer/export_model.py b/examples/machine_translation/transformer/faster_transformer/export_model.py
index c64f96ef8e79..1bb30dc1a7e8 100644
--- a/examples/machine_translation/transformer/faster_transformer/export_model.py
+++ b/examples/machine_translation/transformer/faster_transformer/export_model.py
@@ -1,24 +1,31 @@
-import sys
-import os
-import numpy as np
-from attrdict import AttrDict
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
 import argparse
-import time
+import os
+import sys
+from pprint import pprint
 
 import paddle
-import paddle.nn as nn
-import paddle.nn.functional as F
-
 import yaml
-from pprint import pprint
+from attrdict import AttrDict
 
-from paddlenlp.transformers import TransformerModel
-from paddlenlp.transformers import position_encoding_init
 from paddlenlp.ops import FasterTransformer
 from paddlenlp.utils.log import logger
 
 sys.path.append("../")
-import reader
+import reader  # noqa: E402
 
 
 def parse_args():
@@ -66,10 +73,16 @@ def parse_args():
         help="The vocab file. Normally, it shouldn't be set and in this case, the default WMT14 dataset will be used.",
     )
     parser.add_argument(
-        "--unk_token",
+        "--src_vocab",
         default=None,
         type=str,
-        help="The unknown token. It should be provided when use custom vocab_file. ",
+        help="The vocab file for source language. If --vocab_file is given, the --vocab_file will be used. ",
+    )
+    parser.add_argument(
+        "--trg_vocab",
+        default=None,
+        type=str,
+        help="The vocab file for target language. If --vocab_file is given, the --vocab_file will be used. ",
     )
     parser.add_argument(
         "--bos_token", default=None, type=str, help="The bos token. It should be provided when use custom vocab_file. "
@@ -77,6 +90,12 @@ def parse_args():
     parser.add_argument(
         "--eos_token", default=None, type=str, help="The eos token. It should be provided when use custom vocab_file. "
     )
+    parser.add_argument(
+        "--pad_token",
+        default=None,
+        type=str,
+        help="The pad token. It should be provided when use custom vocab_file. And if it's None, bos_token will be used. ",
+    )
     args = parser.parse_args()
     return args
 
@@ -100,6 +119,7 @@ def do_predict(args):
         weight_sharing=args.weight_sharing,
         bos_id=args.bos_idx,
         eos_id=args.eos_idx,
+        pad_id=args.pad_idx,
         decoding_strategy=args.decoding_strategy,
         beam_size=args.beam_size,
         max_out_len=args.max_out_len,
@@ -150,10 +170,35 @@ def do_predict(args):
     args.topk = ARGS.topk
     args.topp = ARGS.topp
     args.benchmark = ARGS.benchmark
-    args.vocab_file = ARGS.vocab_file
+
+    if ARGS.vocab_file is not None:
+        args.src_vocab = ARGS.vocab_file
+        args.trg_vocab = ARGS.vocab_file
+        args.joined_dictionary = True
+    elif ARGS.src_vocab is not None and ARGS.trg_vocab is None:
+        args.vocab_file = args.trg_vocab = args.src_vocab = ARGS.src_vocab
+        args.joined_dictionary = True
+    elif ARGS.src_vocab is None and ARGS.trg_vocab is not None:
+        args.vocab_file = args.trg_vocab = args.src_vocab = ARGS.trg_vocab
+        args.joined_dictionary = True
+    else:
+        args.src_vocab = ARGS.src_vocab
+        args.trg_vocab = ARGS.trg_vocab
+        args.joined_dictionary = not (
+            args.src_vocab is not None and args.trg_vocab is not None and args.src_vocab != args.trg_vocab
+        )
+    if args.weight_sharing != args.joined_dictionary:
+        if args.weight_sharing:
+            raise ValueError("The src_vocab and trg_vocab must be consistency when weight_sharing is True. ")
+        else:
+            raise ValueError(
+                "The src_vocab and trg_vocab must be specified respectively when weight sharing is False. "
+            )
+
     args.unk_token = ARGS.unk_token
     args.bos_token = ARGS.bos_token
     args.eos_token = ARGS.eos_token
+    args.pad_token = ARGS.pad_token
     pprint(args)
 
     do_predict(args)
diff --git a/examples/machine_translation/transformer/predict.py b/examples/machine_translation/transformer/predict.py
index adf0dc6e2103..8a7275150bdf 100644
--- a/examples/machine_translation/transformer/predict.py
+++ b/examples/machine_translation/transformer/predict.py
@@ -12,18 +12,16 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 
-import os
-import yaml
-import logging
 import argparse
-import numpy as np
+import os
 from pprint import pprint
-from attrdict import AttrDict
 
 import paddle
-from paddlenlp.ops import TransformerGenerator
-
 import reader
+import yaml
+from attrdict import AttrDict
+
+from paddlenlp.ops import TransformerGenerator
 
 
 def parse_args():
@@ -36,12 +34,18 @@ def parse_args():
         action="store_true",
         help="Whether to print logs on each cards and use benchmark vocab. Normally, not necessary to set --benchmark. ",
     )
+    parser.add_argument(
+        "--data_dir",
+        default=None,
+        type=str,
+        help="The dir of train, dev and test datasets. If data_dir is given, train_file and dev_file and test_file will be replaced by data_dir/[train|dev|test].\{src_lang\}-\{trg_lang\}.[\{src_lang\}|\{trg_lang\}]. ",
+    )
     parser.add_argument(
         "--test_file",
         nargs="+",
         default=None,
         type=str,
-        help="The file for testing. Normally, it shouldn't be set and in this case, the default WMT14 dataset will be used to process testing.",
+        help="The files for test. Can be set by using --test_file source_language_file. If it's None, the default WMT14 en-de dataset will be used. ",
     )
     parser.add_argument("--without_ft", action="store_true", help="Whether to use FasterTransformer to do predict. ")
     parser.add_argument(
@@ -50,6 +54,20 @@ def parse_args():
         type=str,
         help="The vocab file. Normally, it shouldn't be set and in this case, the default WMT14 dataset will be used.",
     )
+    parser.add_argument(
+        "--src_vocab",
+        default=None,
+        type=str,
+        help="The vocab file for source language. If --vocab_file is given, the --vocab_file will be used. ",
+    )
+    parser.add_argument(
+        "--trg_vocab",
+        default=None,
+        type=str,
+        help="The vocab file for target language. If --vocab_file is given, the --vocab_file will be used. ",
+    )
+    parser.add_argument("-s", "--src_lang", default=None, type=str, help="Source language. ")
+    parser.add_argument("-t", "--trg_lang", default=None, type=str, help="Target language. ")
     parser.add_argument(
         "--unk_token",
         default=None,
@@ -62,6 +80,12 @@ def parse_args():
     parser.add_argument(
         "--eos_token", default=None, type=str, help="The eos token. It should be provided when use custom vocab_file. "
     )
+    parser.add_argument(
+        "--pad_token",
+        default=None,
+        type=str,
+        help="The pad token. It should be provided when use custom vocab_file. And if it's None, bos_token will be used. ",
+    )
     parser.add_argument(
         "--device", default="gpu", choices=["gpu", "cpu", "xpu", "npu", "mlu"], help="Device selected for inference."
     )
@@ -116,10 +140,12 @@ def do_predict(args):
         weight_sharing=args.weight_sharing,
         bos_id=args.bos_idx,
         eos_id=args.eos_idx,
+        pad_id=args.pad_idx,
         beam_size=args.beam_size,
         max_out_len=args.max_out_len,
         use_ft=not args.without_ft,
         beam_search_version=args.beam_search_version,
+        normalize_before=args.get("normalize_before", True),
         rel_len=args.use_rel_len,  # only works when using FT or beam search v2
         alpha=args.alpha,  # only works when using beam search v2
         diversity_rate=args.diversity_rate,  # only works when using FT
@@ -164,12 +190,44 @@ def do_predict(args):
     with open(yaml_file, "rt") as f:
         args = AttrDict(yaml.safe_load(f))
     args.benchmark = ARGS.benchmark
-    args.test_file = ARGS.test_file
     args.without_ft = ARGS.without_ft
-    args.vocab_file = ARGS.vocab_file
+    args.data_dir = ARGS.data_dir
+    args.test_file = ARGS.test_file
+
+    if ARGS.vocab_file is not None:
+        args.src_vocab = ARGS.vocab_file
+        args.trg_vocab = ARGS.vocab_file
+        args.joined_dictionary = True
+    elif ARGS.src_vocab is not None and ARGS.trg_vocab is None:
+        args.vocab_file = args.trg_vocab = args.src_vocab = ARGS.src_vocab
+        args.joined_dictionary = True
+    elif ARGS.src_vocab is None and ARGS.trg_vocab is not None:
+        args.vocab_file = args.trg_vocab = args.src_vocab = ARGS.trg_vocab
+        args.joined_dictionary = True
+    else:
+        args.src_vocab = ARGS.src_vocab
+        args.trg_vocab = ARGS.trg_vocab
+        args.joined_dictionary = not (
+            args.src_vocab is not None and args.trg_vocab is not None and args.src_vocab != args.trg_vocab
+        )
+    if args.weight_sharing != args.joined_dictionary:
+        if args.weight_sharing:
+            raise ValueError("The src_vocab and trg_vocab must be consistency when weight_sharing is True. ")
+        else:
+            raise ValueError(
+                "The src_vocab and trg_vocab must be specified respectively when weight sharing is False. "
+            )
+
+    if ARGS.src_lang is not None:
+        args.src_lang = ARGS.src_lang
+    if ARGS.trg_lang is not None:
+        args.trg_lang = ARGS.trg_lang
+
     args.unk_token = ARGS.unk_token
     args.bos_token = ARGS.bos_token
     args.eos_token = ARGS.eos_token
+    args.pad_token = ARGS.pad_token
+
     args.device = ARGS.device
     pprint(args)
 
diff --git a/examples/machine_translation/transformer/reader.py b/examples/machine_translation/transformer/reader.py
index 84d381b33fba..6f4764ad0c7e 100644
--- a/examples/machine_translation/transformer/reader.py
+++ b/examples/machine_translation/transformer/reader.py
@@ -12,61 +12,137 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 
-import sys
-import os
-import io
 import itertools
+import os
+import sys
 from functools import partial
 
 import numpy as np
-from paddle.io import BatchSampler, DataLoader, Dataset
 import paddle.distributed as dist
+from paddle.io import BatchSampler, DataLoader
+
 from paddlenlp.data import Pad, Vocab
-from paddlenlp.datasets import load_dataset
-from paddlenlp.data.sampler import SamplerHelper
 
 
 def min_max_filer(data, max_len, min_len=0):
     # 1 for special tokens.
-    data_min_len = min(len(data[0]), len(data[1])) + 1
-    data_max_len = max(len(data[0]), len(data[1])) + 1
+    data_min_len = min(len(data["source"]), len(data["target"])) + 1
+    data_max_len = max(len(data["source"]), len(data["target"])) + 1
     return (data_min_len >= min_len) and (data_max_len <= max_len)
 
 
+def padding_vocab(x, args):
+    return (x + args.pad_factor - 1) // args.pad_factor * args.pad_factor
+
+
 def create_data_loader(args, places=None):
-    if args.train_file is not None and args.dev_file is not None:
-        datasets = load_dataset("wmt14ende", data_files=[args.train_file, args.dev_file], splits=("train", "dev"))
-    elif args.train_file is None and args.dev_file is None:
-        datasets = load_dataset("wmt14ende", splits=("train", "dev"))
+    use_custom_dataset = args.train_file is not None or args.dev_file is not None or args.data_dir is not None
+    map_kwargs = {}
+    if use_custom_dataset:
+        data_files = {}
+        if args.data_dir is not None:
+            if os.path.exist(
+                os.path.join(args.data_dir, "train.{}-{}.{}".format(args.src_lang, args.trg_lang, args.src_lang))
+            ):
+                data_files["train"] = [
+                    os.path.join(args.data_dir, "train.{}-{}.{}".format(args.src_lang, args.trg_lang, args.src_lang)),
+                    os.path.join(args.data_dir, "train.{}-{}.{}".format(args.src_lang, args.trg_lang, args.trg_lang)),
+                ]
+            if os.path.exist(
+                os.path.join(args.data_dir, "dev.{}-{}.{}".format(args.src_lang, args.trg_lang, args.src_lang))
+            ):
+                data_files["dev"] = [
+                    os.path.join(args.data_dir, "dev.{}-{}.{}".format(args.src_lang, args.trg_lang, args.src_lang)),
+                    os.path.join(args.data_dir, "dev.{}-{}.{}".format(args.src_lang, args.trg_lang, args.trg_lang)),
+                ]
+        else:
+            # datasets.load_dataset doesn't support tuple
+            if args.train_file is not None:
+                data_files["train"] = list(args.train_file)
+            if args.dev_file is not None:
+                data_files["dev"] = list(args.dev_file)
+
+        from datasets import load_dataset
+
+        if len(data_files) > 0:
+            for split in data_files:
+                if isinstance(data_files[split], (list, tuple)):
+                    for i, path in enumerate(data_files[split]):
+                        data_files[split][i] = os.path.abspath(data_files[split][i])
+                else:
+                    data_files[split] = os.path.abspath(data_files[split])
+
+        datasets = load_dataset("language_pair", data_files=data_files, split=("train", "dev"))
+
+        if args.src_vocab is not None:
+            src_vocab = Vocab.load_vocabulary(
+                filepath=args.src_vocab,
+                unk_token=args.unk_token,
+                bos_token=args.bos_token,
+                eos_token=args.eos_token,
+                pad_token=args.pad_token,
+            )
+        else:
+            raise ValueError("The --src_vocab must be specified when using custom dataset. ")
+
     else:
-        raise ValueError("--train_file and --dev_file must be both or neither set. ")
+        from paddlenlp.datasets import load_dataset
 
-    if args.vocab_file is not None:
-        src_vocab = Vocab.load_vocabulary(
-            filepath=args.vocab_file, unk_token=args.unk_token, bos_token=args.bos_token, eos_token=args.eos_token
-        )
-    elif not args.benchmark:
-        src_vocab = Vocab.load_vocabulary(**datasets[0].vocab_info["bpe"])
+        datasets = load_dataset("wmt14ende", splits=("train", "dev"))
+
+        map_kwargs["lazy"] = False
+
+        if args.src_vocab is not None:
+            src_vocab = Vocab.load_vocabulary(
+                filepath=args.src_vocab,
+                unk_token=args.unk_token,
+                bos_token=args.bos_token,
+                eos_token=args.eos_token,
+                pad_token=args.pad_token,
+            )
+        elif not args.benchmark:
+            src_vocab = Vocab.load_vocabulary(**datasets[0].vocab_info["bpe"])
+        else:
+            src_vocab = Vocab.load_vocabulary(**datasets[0].vocab_info["benchmark"])
+
+    if use_custom_dataset and not args.joined_dictionary:
+        if args.trg_vocab is not None:
+            trg_vocab = Vocab.load_vocabulary(
+                filepath=args.trg_vocab,
+                unk_token=args.unk_token,
+                bos_token=args.bos_token,
+                eos_token=args.eos_token,
+                pad_token=args.pad_token,
+            )
+        else:
+            raise ValueError("The --trg_vocab must be specified when the dict is not joined. ")
     else:
-        src_vocab = Vocab.load_vocabulary(**datasets[0].vocab_info["benchmark"])
-    trg_vocab = src_vocab
+        trg_vocab = src_vocab
 
-    padding_vocab = lambda x: (x + args.pad_factor - 1) // args.pad_factor * args.pad_factor
-    args.src_vocab_size = padding_vocab(len(src_vocab))
-    args.trg_vocab_size = padding_vocab(len(trg_vocab))
+    args.src_vocab_size = padding_vocab(len(src_vocab), args)
+    args.trg_vocab_size = padding_vocab(len(trg_vocab), args)
+
+    if args.bos_token is not None:
+        args.bos_idx = src_vocab.get_bos_token_id()
+    if args.eos_token is not None:
+        args.eos_idx = src_vocab.get_eos_token_id()
+    if args.pad_token is not None:
+        args.pad_idx = src_vocab.get_pad_token_id()
+    else:
+        args.pad_idx = args.bos_idx
 
     def convert_samples(sample):
-        source = sample[args.src_lang].split()
-        target = sample[args.trg_lang].split()
+        source = sample["source"].split()
+        sample["source"] = src_vocab.to_indices(source)
 
-        source = src_vocab.to_indices(source)
-        target = trg_vocab.to_indices(target)
+        target = sample["target"].split()
+        sample["target"] = trg_vocab.to_indices(target)
 
-        return source, target
+        return sample
 
     data_loaders = [(None)] * 2
     for i, dataset in enumerate(datasets):
-        dataset = dataset.map(convert_samples, lazy=False).filter(partial(min_max_filer, max_len=args.max_length))
+        dataset = dataset.map(convert_samples, **map_kwargs).filter(partial(min_max_filer, max_len=args.max_length))
         batch_sampler = TransformerBatchSampler(
             dataset=dataset,
             batch_size=args.batch_size,
@@ -91,7 +167,7 @@ def convert_samples(sample):
                 prepare_train_input,
                 bos_idx=args.bos_idx,
                 eos_idx=args.eos_idx,
-                pad_idx=args.bos_idx,
+                pad_idx=args.pad_idx,
                 pad_seq=args.pad_seq,
                 dtype=args.input_dtype,
             ),
@@ -102,46 +178,112 @@ def convert_samples(sample):
 
 
 def create_infer_loader(args):
-    if args.test_file is not None:
-        dataset = load_dataset("wmt14ende", data_files=[args.test_file], splits=["test"])
+    use_custom_dataset = args.test_file is not None or args.data_dir is not None
+    map_kwargs = {}
+    if use_custom_dataset:
+        data_files = {}
+        if args.data_dir is not None:
+            if os.path.exist(
+                os.path.join(args.data_dir, "test.{}-{}.{}".format(args.src_lang, args.trg_lang, args.src_lang))
+            ):
+                data_files["test"] = [
+                    os.path.join(args.data_dir, "test.{}-{}.{}".format(args.src_lang, args.trg_lang, args.src_lang)),
+                    os.path.join(args.data_dir, "test.{}-{}.{}".format(args.src_lang, args.trg_lang, args.trg_lang))
+                    if os.path.exist(
+                        os.path.join(
+                            args.data_dir, "test.{}-{}.{}".format(args.src_lang, args.trg_lang, args.trg_lang)
+                        )
+                    )
+                    else None,
+                ]
+        else:
+            if args.test_file is not None:
+                # datasets.load_dataset doesn't support tuple
+                data_files["test"] = list(args.test_file) if isinstance(args.test_file, tuple) else args.test_file
+
+        from datasets import load_dataset
+
+        dataset = load_dataset("language_pair", data_files=data_files, split=("test"))
+
+        if args.src_vocab is not None:
+            src_vocab = Vocab.load_vocabulary(
+                filepath=args.src_vocab,
+                unk_token=args.unk_token,
+                bos_token=args.bos_token,
+                eos_token=args.eos_token,
+                pad_token=args.pad_token,
+            )
+        else:
+            raise ValueError("The --src_vocab must be specified when using custom dataset. ")
+
     else:
+        from paddlenlp.datasets import load_dataset
+
         dataset = load_dataset("wmt14ende", splits=("test"))
 
-    if args.vocab_file is not None:
-        src_vocab = Vocab.load_vocabulary(
-            filepath=args.vocab_file, unk_token=args.unk_token, bos_token=args.bos_token, eos_token=args.eos_token
-        )
-    elif not args.benchmark:
-        src_vocab = Vocab.load_vocabulary(**dataset.vocab_info["bpe"])
+        map_kwargs["lazy"] = False
+
+        if args.src_vocab is not None:
+            src_vocab = Vocab.load_vocabulary(
+                filepath=args.src_vocab,
+                unk_token=args.unk_token,
+                bos_token=args.bos_token,
+                eos_token=args.eos_token,
+                pad_token=args.pad_token,
+            )
+        elif not args.benchmark:
+            src_vocab = Vocab.load_vocabulary(**dataset.vocab_info["bpe"])
+        else:
+            src_vocab = Vocab.load_vocabulary(**dataset.vocab_info["benchmark"])
+
+    if use_custom_dataset and not args.joined_dictionary:
+        if args.trg_vocab is not None:
+            trg_vocab = Vocab.load_vocabulary(
+                filepath=args.trg_vocab,
+                unk_token=args.unk_token,
+                bos_token=args.bos_token,
+                eos_token=args.eos_token,
+                pad_token=args.pad_token,
+            )
+        else:
+            raise ValueError("The --trg_vocab must be specified when the dict is not joined. ")
     else:
-        src_vocab = Vocab.load_vocabulary(**dataset.vocab_info["benchmark"])
-    trg_vocab = src_vocab
+        trg_vocab = src_vocab
 
-    padding_vocab = lambda x: (x + args.pad_factor - 1) // args.pad_factor * args.pad_factor
-    args.src_vocab_size = padding_vocab(len(src_vocab))
-    args.trg_vocab_size = padding_vocab(len(trg_vocab))
+    args.src_vocab_size = padding_vocab(len(src_vocab), args)
+    args.trg_vocab_size = padding_vocab(len(trg_vocab), args)
 
-    def convert_samples(sample):
-        source = sample[args.src_lang].split()
-        target = sample[args.trg_lang].split()
+    if args.bos_token is not None:
+        args.bos_idx = src_vocab.get_bos_token_id()
+    if args.eos_token is not None:
+        args.eos_idx = src_vocab.get_eos_token_id()
+    if args.pad_token is not None:
+        args.pad_idx = src_vocab.get_pad_token_id()
+    else:
+        args.pad_idx = args.bos_idx
 
-        source = src_vocab.to_indices(source)
-        target = trg_vocab.to_indices(target)
+    def convert_samples(sample):
+        source = sample["source"].split()
+        sample["source"] = src_vocab.to_indices(source)
 
-        return source, target
+        if "target" in sample.keys() and sample["target"] != "":
+            target = sample["target"].split()
+            sample["target"] = trg_vocab.to_indices(target)
 
-    dataset = dataset.map(convert_samples, lazy=False)
+        return sample
 
-    batch_sampler = SamplerHelper(dataset).batch(batch_size=args.infer_batch_size, drop_last=False)
+    dataset = dataset.map(convert_samples, **map_kwargs)
 
     data_loader = DataLoader(
         dataset=dataset,
-        batch_sampler=batch_sampler,
+        batch_size=args.infer_batch_size,
+        shuffle=False,
+        drop_last=False,
         collate_fn=partial(
             prepare_infer_input,
             bos_idx=args.bos_idx,
             eos_idx=args.eos_idx,
-            pad_idx=args.bos_idx,
+            pad_idx=args.pad_idx,
             pad_seq=args.pad_seq,
             dtype=args.input_dtype,
         ),
@@ -152,21 +294,42 @@ def convert_samples(sample):
 
 
 def adapt_vocab_size(args):
-    if args.vocab_file is not None:
+    if args.src_vocab:
         src_vocab = Vocab.load_vocabulary(
-            filepath=args.vocab_file, unk_token=args.unk_token, bos_token=args.bos_token, eos_token=args.eos_token
+            filepath=args.src_vocab, bos_token=args.bos_token, eos_token=args.eos_token, pad_token=args.pad_token
         )
+    elif not args.benchmark:
+        from paddlenlp.datasets import load_dataset
+
+        datasets = load_dataset("wmt14ende", splits=("test"))
+        src_vocab = Vocab.load_vocabulary(**datasets.vocab_info["bpe"])
     else:
-        dataset = load_dataset("wmt14ende", splits=("test"))
-        if not args.benchmark:
-            src_vocab = Vocab.load_vocabulary(**dataset.vocab_info["bpe"])
+        from paddlenlp.datasets import load_dataset
+
+        datasets = load_dataset("wmt14ende", splits=("test"))
+        src_vocab = Vocab.load_vocabulary(**datasets.vocab_info["benchmark"])
+
+    if not args.joined_dictionary:
+        if args.trg_vocab is not None:
+            trg_vocab = Vocab.load_vocabulary(
+                filepath=args.trg_vocab, bos_token=args.bos_token, eos_token=args.eos_token, pad_token=args.pad_token
+            )
         else:
-            src_vocab = Vocab.load_vocabulary(**dataset.vocab_info["benchmark"])
-    trg_vocab = src_vocab
+            raise ValueError("The --trg_vocab must be specified when the dict is not joined. ")
+    else:
+        trg_vocab = src_vocab
+
+    args.src_vocab_size = padding_vocab(len(src_vocab), args)
+    args.trg_vocab_size = padding_vocab(len(trg_vocab), args)
 
-    padding_vocab = lambda x: (x + args.pad_factor - 1) // args.pad_factor * args.pad_factor
-    args.src_vocab_size = padding_vocab(len(src_vocab))
-    args.trg_vocab_size = padding_vocab(len(trg_vocab))
+    if args.bos_token is not None:
+        args.bos_idx = src_vocab.get_bos_token_id()
+    if args.eos_token is not None:
+        args.eos_idx = src_vocab.get_eos_token_id()
+    if args.pad_token is not None:
+        args.pad_idx = src_vocab.get_pad_token_id()
+    else:
+        args.pad_idx = args.bos_idx
 
 
 def prepare_train_input(insts, bos_idx, eos_idx, pad_idx, pad_seq=1, dtype="int64"):
@@ -174,12 +337,18 @@ def prepare_train_input(insts, bos_idx, eos_idx, pad_idx, pad_seq=1, dtype="int6
     Put all padded data needed by training into a list.
     """
     word_pad = Pad(pad_idx, dtype=dtype)
-    src_max_len = (max([len(inst[0]) for inst in insts]) + pad_seq) // pad_seq * pad_seq
-    trg_max_len = (max([len(inst[1]) for inst in insts]) + pad_seq) // pad_seq * pad_seq
-    src_word = word_pad([inst[0] + [eos_idx] + [pad_idx] * (src_max_len - 1 - len(inst[0])) for inst in insts])
-    trg_word = word_pad([[bos_idx] + inst[1] + [pad_idx] * (trg_max_len - 1 - len(inst[1])) for inst in insts])
+
+    src_max_len = (max([len(inst["source"]) for inst in insts]) + pad_seq) // pad_seq * pad_seq
+    trg_max_len = (max([len(inst["target"]) for inst in insts]) + pad_seq) // pad_seq * pad_seq
+    src_word = word_pad(
+        [inst["source"] + [eos_idx] + [pad_idx] * (src_max_len - 1 - len(inst["source"])) for inst in insts]
+    )
+    trg_word = word_pad(
+        [[bos_idx] + inst["target"] + [pad_idx] * (trg_max_len - 1 - len(inst["target"])) for inst in insts]
+    )
     lbl_word = np.expand_dims(
-        word_pad([inst[1] + [eos_idx] + [pad_idx] * (trg_max_len - 1 - len(inst[1])) for inst in insts]), axis=2
+        word_pad([inst["target"] + [eos_idx] + [pad_idx] * (trg_max_len - 1 - len(inst["target"])) for inst in insts]),
+        axis=2,
     )
 
     data_inputs = [src_word, trg_word, lbl_word]
@@ -192,8 +361,11 @@ def prepare_infer_input(insts, bos_idx, eos_idx, pad_idx, pad_seq=1, dtype="int6
     Put all padded data needed by beam search decoder into a list.
     """
     word_pad = Pad(pad_idx, dtype=dtype)
-    src_max_len = (max([len(inst[0]) for inst in insts]) + pad_seq) // pad_seq * pad_seq
-    src_word = word_pad([inst[0] + [eos_idx] + [pad_idx] * (src_max_len - 1 - len(inst[0])) for inst in insts])
+
+    src_max_len = (max([len(inst["source"]) for inst in insts]) + pad_seq) // pad_seq * pad_seq
+    src_word = word_pad(
+        [inst["source"] + [eos_idx] + [pad_idx] * (src_max_len - 1 - len(inst["source"])) for inst in insts]
+    )
 
     return [
         src_word,
@@ -288,7 +460,7 @@ def __init__(
         self._local_rank = rank
         self._sample_infos = []
         for i, data in enumerate(self._dataset):
-            lens = [len(data[0]), len(data[1])]
+            lens = [len(data["source"]), len(data["target"])]
             self._sample_infos.append(SampleInfo(i, lens, self._pad_seq))
 
     def __iter__(self):
diff --git a/examples/machine_translation/transformer/static/predict.py b/examples/machine_translation/transformer/static/predict.py
index 4d5a744031ae..bc5c42f7e417 100644
--- a/examples/machine_translation/transformer/static/predict.py
+++ b/examples/machine_translation/transformer/static/predict.py
@@ -1,20 +1,31 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
 import os
-import time
 import sys
+from pprint import pprint
 
-import argparse
-import logging
 import numpy as np
+import paddle
 import yaml
 from attrdict import AttrDict
-from pprint import pprint
-
-import paddle
 
 from paddlenlp.transformers import InferTransformerModel
 
 sys.path.append(os.path.abspath(os.path.join(os.path.dirname(__file__), os.pardir)))
-import reader
+import reader  # noqa: E402
 
 
 def cast_parameters_to_fp32(place, program, scope=None):
@@ -40,12 +51,18 @@ def parse_args():
         action="store_true",
         help="Whether to print logs on each cards and use benchmark vocab. Normally, not necessary to set --benchmark. ",
     )
+    parser.add_argument(
+        "--data_dir",
+        default=None,
+        type=str,
+        help="The dir of train, dev and test datasets. If data_dir is given, train_file and dev_file and test_file will be replaced by data_dir/[train|dev|test].\{src_lang\}-\{trg_lang\}.[\{src_lang\}|\{trg_lang\}]. ",
+    )
     parser.add_argument(
         "--test_file",
         nargs="+",
         default=None,
         type=str,
-        help="The file for testing. Normally, it shouldn't be set and in this case, the default WMT14 dataset will be used to process testing.",
+        help="The files for test. Can be set by using --test_file source_language_file. If it's None, the default WMT14 en-de dataset will be used. ",
     )
     parser.add_argument(
         "--vocab_file",
@@ -53,6 +70,20 @@ def parse_args():
         type=str,
         help="The vocab file. Normally, it shouldn't be set and in this case, the default WMT14 dataset will be used.",
     )
+    parser.add_argument(
+        "--src_vocab",
+        default=None,
+        type=str,
+        help="The vocab file for source language. If --vocab_file is given, the --vocab_file will be used. ",
+    )
+    parser.add_argument(
+        "--trg_vocab",
+        default=None,
+        type=str,
+        help="The vocab file for target language. If --vocab_file is given, the --vocab_file will be used. ",
+    )
+    parser.add_argument("-s", "--src_lang", default=None, type=str, help="Source language. ")
+    parser.add_argument("-t", "--trg_lang", default=None, type=str, help="Target language. ")
     parser.add_argument(
         "--unk_token",
         default=None,
@@ -65,6 +96,12 @@ def parse_args():
     parser.add_argument(
         "--eos_token", default=None, type=str, help="The eos token. It should be provided when use custom vocab_file. "
     )
+    parser.add_argument(
+        "--pad_token",
+        default=None,
+        type=str,
+        help="The pad token. It should be provided when use custom vocab_file. And if it's None, bos_token will be used. ",
+    )
     args = parser.parse_args()
     return args
 
@@ -111,6 +148,7 @@ def do_predict(args):
             weight_sharing=args.weight_sharing,
             bos_id=args.bos_idx,
             eos_id=args.eos_idx,
+            pad_id=args.pad_idx,
             beam_size=args.beam_size,
             max_out_len=args.max_out_len,
         )
@@ -152,11 +190,42 @@ def do_predict(args):
     with open(yaml_file, "rt") as f:
         args = AttrDict(yaml.safe_load(f))
     args.benchmark = ARGS.benchmark
+    args.data_dir = ARGS.data_dir
     args.test_file = ARGS.test_file
-    args.vocab_file = ARGS.vocab_file
+
+    if ARGS.vocab_file is not None:
+        args.src_vocab = ARGS.vocab_file
+        args.trg_vocab = ARGS.vocab_file
+        args.joined_dictionary = True
+    elif ARGS.src_vocab is not None and ARGS.trg_vocab is None:
+        args.vocab_file = args.trg_vocab = args.src_vocab = ARGS.src_vocab
+        args.joined_dictionary = True
+    elif ARGS.src_vocab is None and ARGS.trg_vocab is not None:
+        args.vocab_file = args.trg_vocab = args.src_vocab = ARGS.trg_vocab
+        args.joined_dictionary = True
+    else:
+        args.src_vocab = ARGS.src_vocab
+        args.trg_vocab = ARGS.trg_vocab
+        args.joined_dictionary = not (
+            args.src_vocab is not None and args.trg_vocab is not None and args.src_vocab != args.trg_vocab
+        )
+    if args.weight_sharing != args.joined_dictionary:
+        if args.weight_sharing:
+            raise ValueError("The src_vocab and trg_vocab must be consistency when weight_sharing is True. ")
+        else:
+            raise ValueError(
+                "The src_vocab and trg_vocab must be specified respectively when weight sharing is False. "
+            )
+
+    if ARGS.src_lang is not None:
+        args.src_lang = ARGS.src_lang
+    if ARGS.trg_lang is not None:
+        args.trg_lang = ARGS.trg_lang
+
     args.unk_token = ARGS.unk_token
     args.bos_token = ARGS.bos_token
     args.eos_token = ARGS.eos_token
+    args.pad_token = ARGS.pad_token
     pprint(args)
 
     do_predict(args)
diff --git a/examples/machine_translation/transformer/static/train.py b/examples/machine_translation/transformer/static/train.py
index 7018bddc88c0..97f0f879e6ae 100644
--- a/examples/machine_translation/transformer/static/train.py
+++ b/examples/machine_translation/transformer/static/train.py
@@ -1,24 +1,37 @@
-import os
-import time
-import sys
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
 
 import argparse
 import logging
-import numpy as np
-import yaml
-from attrdict import AttrDict
+import os
+import sys
+import time
 from pprint import pprint
 
+import numpy as np
 import paddle
-import paddle.distributed.fleet as fleet
 import paddle.distributed as dist
+import paddle.distributed.fleet as fleet
+import yaml
+from attrdict import AttrDict
 
+from paddlenlp.transformers import CrossEntropyCriterion, TransformerModel
 from paddlenlp.utils import profiler
-from paddlenlp.transformers import TransformerModel, CrossEntropyCriterion
 
 sys.path.append(os.path.abspath(os.path.join(os.path.dirname(__file__), os.pardir)))
-import reader
-from tls.record import AverageStatistical
+import reader  # noqa: E402
+from tls.record import AverageStatistical  # noqa: E402
 
 FORMAT = "%(asctime)s-%(levelname)s: %(message)s"
 logging.basicConfig(level=logging.INFO, format=FORMAT)
@@ -37,19 +50,25 @@ def parse_args():
     )
     parser.add_argument("--distributed", action="store_true", help="Whether to use fleet to launch. ")
     parser.add_argument("--max_iter", default=None, type=int, help="The maximum iteration for training. ")
+    parser.add_argument(
+        "--data_dir",
+        default=None,
+        type=str,
+        help="The dir of train, dev and test datasets. If data_dir is given, train_file and dev_file and test_file will be replaced by data_dir/[train|dev|test].\{src_lang\}-\{trg_lang\}.[\{src_lang\}|\{trg_lang\}]. ",
+    )
     parser.add_argument(
         "--train_file",
         nargs="+",
         default=None,
         type=str,
-        help="The files for training, including [source language file, target language file]. Normally, it shouldn't be set and in this case, the default WMT14 dataset will be used to train. ",
+        help="The files for training, including [source language file, target language file]. If it's None, the default WMT14 en-de dataset will be used. ",
     )
     parser.add_argument(
         "--dev_file",
         nargs="+",
         default=None,
         type=str,
-        help="The files for validation, including [source language file, target language file]. Normally, it shouldn't be set and in this case, the default WMT14 dataset will be used to do validation. ",
+        help="The files for validation, including [source language file, target language file]. If it's None, the default WMT14 en-de dataset will be used. ",
     )
     parser.add_argument(
         "--vocab_file",
@@ -57,6 +76,20 @@ def parse_args():
         type=str,
         help="The vocab file. Normally, it shouldn't be set and in this case, the default WMT14 dataset will be used.",
     )
+    parser.add_argument(
+        "--src_vocab",
+        default=None,
+        type=str,
+        help="The vocab file for source language. If --vocab_file is given, the --vocab_file will be used. ",
+    )
+    parser.add_argument(
+        "--trg_vocab",
+        default=None,
+        type=str,
+        help="The vocab file for target language. If --vocab_file is given, the --vocab_file will be used. ",
+    )
+    parser.add_argument("-s", "--src_lang", default=None, type=str, help="Source language. ")
+    parser.add_argument("-t", "--trg_lang", default=None, type=str, help="Target language. ")
     parser.add_argument(
         "--unk_token",
         default=None,
@@ -69,6 +102,13 @@ def parse_args():
     parser.add_argument(
         "--eos_token", default=None, type=str, help="The eos token. It should be provided when use custom vocab_file. "
     )
+    parser.add_argument(
+        "--pad_token",
+        default=None,
+        type=str,
+        help="The pad token. It should be provided when use custom vocab_file. And if it's None, bos_token will be used. ",
+    )
+    parser.add_argument("--weight_decay", default=None, type=float, help="Weight Decay for optimizer. ")
 
     # For benchmark.
     parser.add_argument(
@@ -128,6 +168,7 @@ def do_train(args):
             weight_sharing=args.weight_sharing,
             bos_id=args.bos_idx,
             eos_id=args.eos_idx,
+            pad_id=args.pad_idx,
         )
         # Define loss
         criterion = CrossEntropyCriterion(args.label_smooth_eps, args.bos_idx)
@@ -145,6 +186,7 @@ def do_train(args):
             beta2=args.beta2,
             epsilon=float(args.eps),
             parameters=transformer.parameters(),
+            weight_decay=args.weight_decay,
         )
 
         if args.is_distributed:
@@ -210,7 +252,6 @@ def do_train(args):
     for pass_id in range(args.epoch):
         batch_id = 0
         batch_start = time.time()
-        pass_start_time = batch_start
         for data in train_loader:
             # NOTE: used for benchmark and use None as default.
             if args.max_iter and step_idx == args.max_iter:
@@ -334,12 +375,45 @@ def do_train(args):
     args.is_distributed = ARGS.distributed
     if ARGS.max_iter:
         args.max_iter = ARGS.max_iter
+    args.weight_decay = ARGS.weight_decay
+
+    args.data_dir = ARGS.data_dir
     args.train_file = ARGS.train_file
     args.dev_file = ARGS.dev_file
-    args.vocab_file = ARGS.vocab_file
+
+    if ARGS.vocab_file is not None:
+        args.src_vocab = ARGS.vocab_file
+        args.trg_vocab = ARGS.vocab_file
+        args.joined_dictionary = True
+    elif ARGS.src_vocab is not None and ARGS.trg_vocab is None:
+        args.vocab_file = args.trg_vocab = args.src_vocab = ARGS.src_vocab
+        args.joined_dictionary = True
+    elif ARGS.src_vocab is None and ARGS.trg_vocab is not None:
+        args.vocab_file = args.trg_vocab = args.src_vocab = ARGS.trg_vocab
+        args.joined_dictionary = True
+    else:
+        args.src_vocab = ARGS.src_vocab
+        args.trg_vocab = ARGS.trg_vocab
+        args.joined_dictionary = not (
+            args.src_vocab is not None and args.trg_vocab is not None and args.src_vocab != args.trg_vocab
+        )
+    if args.weight_sharing != args.joined_dictionary:
+        if args.weight_sharing:
+            raise ValueError("The src_vocab and trg_vocab must be consistency when weight_sharing is True. ")
+        else:
+            raise ValueError(
+                "The src_vocab and trg_vocab must be specified respectively when weight sharing is False. "
+            )
+
+    if ARGS.src_lang is not None:
+        args.src_lang = ARGS.src_lang
+    if ARGS.trg_lang is not None:
+        args.trg_lang = ARGS.trg_lang
+
     args.unk_token = ARGS.unk_token
     args.bos_token = ARGS.bos_token
     args.eos_token = ARGS.eos_token
+    args.pad_token = ARGS.pad_token
     pprint(args)
     args.profiler_options = ARGS.profiler_options
 
diff --git a/examples/machine_translation/transformer/train.py b/examples/machine_translation/transformer/train.py
index 819f2fedb0b8..ac0cdff57921 100644
--- a/examples/machine_translation/transformer/train.py
+++ b/examples/machine_translation/transformer/train.py
@@ -12,27 +12,25 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 
+import argparse
+import inspect
 import os
 import time
-
-import yaml
-import argparse
-import numpy as np
 from pprint import pprint
-from attrdict import AttrDict
-import inspect
 
+import numpy as np
 import paddle
 import paddle.distributed as dist
-
 import reader
-from paddlenlp.transformers import TransformerModel, CrossEntropyCriterion
-from paddlenlp.utils.log import logger
-from paddlenlp.utils import profiler
-
+import yaml
+from attrdict import AttrDict
 from tls.record import AverageStatistical
 from tls.to_static import apply_to_static
 
+from paddlenlp.transformers import CrossEntropyCriterion, TransformerModel
+from paddlenlp.utils import profiler
+from paddlenlp.utils.log import logger
+
 
 def parse_args():
     parser = argparse.ArgumentParser()
@@ -45,19 +43,25 @@ def parse_args():
         help="Whether to print logs on each cards and use benchmark vocab. Normally, not necessary to set --benchmark. ",
     )
     parser.add_argument("--max_iter", default=None, type=int, help="The maximum iteration for training. ")
+    parser.add_argument(
+        "--data_dir",
+        default=None,
+        type=str,
+        help="The dir of train, dev and test datasets. If data_dir is given, train_file and dev_file and test_file will be replaced by data_dir/[train|dev|test].\{src_lang\}-\{trg_lang\}.[\{src_lang\}|\{trg_lang\}]. ",
+    )
     parser.add_argument(
         "--train_file",
         nargs="+",
         default=None,
         type=str,
-        help="The files for training, including [source language file, target language file]. Normally, it shouldn't be set and in this case, the default WMT14 dataset will be used to train. ",
+        help="The files for training, including [source language file, target language file]. If it's None, the default WMT14 en-de dataset will be used. ",
     )
     parser.add_argument(
         "--dev_file",
         nargs="+",
         default=None,
         type=str,
-        help="The files for validation, including [source language file, target language file]. Normally, it shouldn't be set and in this case, the default WMT14 dataset will be used to do validation. ",
+        help="The files for validation, including [source language file, target language file]. If it's None, the default WMT14 en-de dataset will be used. ",
     )
     parser.add_argument(
         "--vocab_file",
@@ -65,6 +69,20 @@ def parse_args():
         type=str,
         help="The vocab file. Normally, it shouldn't be set and in this case, the default WMT14 dataset will be used.",
     )
+    parser.add_argument(
+        "--src_vocab",
+        default=None,
+        type=str,
+        help="The vocab file for source language. If --vocab_file is given, the --vocab_file will be used. ",
+    )
+    parser.add_argument(
+        "--trg_vocab",
+        default=None,
+        type=str,
+        help="The vocab file for target language. If --vocab_file is given, the --vocab_file will be used. ",
+    )
+    parser.add_argument("-s", "--src_lang", default=None, type=str, help="Source language. ")
+    parser.add_argument("-t", "--trg_lang", default=None, type=str, help="Target language. ")
     parser.add_argument(
         "--unk_token",
         default=None,
@@ -77,6 +95,12 @@ def parse_args():
     parser.add_argument(
         "--eos_token", default=None, type=str, help="The eos token. It should be provided when use custom vocab_file. "
     )
+    parser.add_argument(
+        "--pad_token",
+        default=None,
+        type=str,
+        help="The pad token. It should be provided when use custom vocab_file. And if it's None, bos_token will be used. ",
+    )
     parser.add_argument("--batch_size", default=None, type=int, help="The maximum tokens per batch. ")
     parser.add_argument(
         "--use_amp",
@@ -95,6 +119,7 @@ def parse_args():
         choices=["O1", "O2"],
         help="The amp level if --use_amp is on. Can be one of [O1, O2]. ",
     )
+    parser.add_argument("--weight_decay", default=None, type=float, help="Weight Decay for optimizer. ")
 
     # For benchmark.
     parser.add_argument(
@@ -154,12 +179,14 @@ def do_train(args):
         weight_sharing=args.weight_sharing,
         bos_id=args.bos_idx,
         eos_id=args.eos_idx,
+        pad_id=args.pad_idx,
+        normalize_before=args.get("normalize_before", True),
     )
 
     transformer = apply_to_static(args, transformer)
 
     # Define loss
-    criterion = CrossEntropyCriterion(args.label_smooth_eps, args.bos_idx)
+    criterion = CrossEntropyCriterion(args.label_smooth_eps, args.bos_idx if args.pad_idx is None else args.pad_idx)
 
     scheduler = paddle.optimizer.lr.NoamDecay(args.d_model, args.warmup_steps, args.learning_rate, last_epoch=0)
 
@@ -171,6 +198,7 @@ def do_train(args):
             beta2=args.beta2,
             epsilon=float(args.eps),
             parameters=transformer.parameters(),
+            weight_decay=args.weight_decay,
         )
     else:
         optimizer = paddle.optimizer.Adam(
@@ -180,6 +208,7 @@ def do_train(args):
             epsilon=float(args.eps),
             parameters=transformer.parameters(),
             use_multi_tensor=True,
+            weight_decay=args.weight_decay,
         )
 
     # Init from some checkpoint, to resume the previous training
@@ -395,12 +424,48 @@ def do_train(args):
             args.use_amp = False
     if ARGS.amp_level:
         args.use_pure_fp16 = ARGS.amp_level == "O2"
+    args.weight_decay = ARGS.weight_decay
+
+    args.data_dir = ARGS.data_dir
     args.train_file = ARGS.train_file
     args.dev_file = ARGS.dev_file
-    args.vocab_file = ARGS.vocab_file
+
+    if ARGS.vocab_file is not None:
+        args.src_vocab = ARGS.vocab_file
+        args.trg_vocab = ARGS.vocab_file
+        args.joined_dictionary = True
+    elif ARGS.src_vocab is not None and ARGS.trg_vocab is None:
+        args.vocab_file = args.trg_vocab = args.src_vocab = ARGS.src_vocab
+        args.joined_dictionary = True
+    elif ARGS.src_vocab is None and ARGS.trg_vocab is not None:
+        args.vocab_file = args.trg_vocab = args.src_vocab = ARGS.trg_vocab
+        args.joined_dictionary = True
+    elif ARGS.src_vocab is None and ARGS.trg_vocab is not None:
+        args.vocab_file = args.trg_vocab = args.src_vocab = ARGS.trg_vocab
+        args.joined_dictionary = True
+    else:
+        args.src_vocab = ARGS.src_vocab
+        args.trg_vocab = ARGS.trg_vocab
+        args.joined_dictionary = not (
+            args.src_vocab is not None and args.trg_vocab is not None and args.src_vocab != args.trg_vocab
+        )
+    if args.weight_sharing != args.joined_dictionary:
+        if args.weight_sharing:
+            raise ValueError("The src_vocab and trg_vocab must be consistency when weight_sharing is True. ")
+        else:
+            raise ValueError(
+                "The src_vocab and trg_vocab must be specified respectively when weight sharing is False. "
+            )
+
+    if ARGS.src_lang is not None:
+        args.src_lang = ARGS.src_lang
+    if ARGS.trg_lang is not None:
+        args.trg_lang = ARGS.trg_lang
+
     args.unk_token = ARGS.unk_token
     args.bos_token = ARGS.bos_token
     args.eos_token = ARGS.eos_token
+    args.pad_token = ARGS.pad_token
     if ARGS.to_static:
         args.to_static = ARGS.to_static
     args.device = ARGS.device
diff --git a/paddlenlp/data/vocab.py b/paddlenlp/data/vocab.py
index aba9dc5a162c..a17810f6ca58 100644
--- a/paddlenlp/data/vocab.py
+++ b/paddlenlp/data/vocab.py
@@ -15,10 +15,11 @@
 import collections
 import io
 import json
-import numpy as np
 import os
 import warnings
 
+import numpy as np
+
 
 class Vocab(object):
     """
@@ -553,3 +554,26 @@ def load_vocabulary(filepath, unk_token=None, pad_token=None, bos_token=None, eo
             token_to_idx, unk_token=unk_token, pad_token=pad_token, bos_token=bos_token, eos_token=eos_token, **kwargs
         )
         return vocab
+
+    def save_vocabulary(self, filepath):
+        """
+        Save the :class:`Vocab` to a specific file. Can be reloaded by calling `load_vocabulary`.
+
+        Args:
+            filepath (str): the path of file to save vocabulary.
+        """
+        with open(filepath, "w") as f:
+            for idx in range(len(self._idx_to_token)):
+                f.write(self._idx_to_token[idx] + "\n")
+
+    def get_unk_token_id(self):
+        return self._token_to_idx[self.unk_token] if self.unk_token is not None else self.unk_token
+
+    def get_bos_token_id(self):
+        return self._token_to_idx[self.bos_token] if self.bos_token is not None else self.bos_token
+
+    def get_eos_token_id(self):
+        return self._token_to_idx[self.eos_token] if self.eos_token is not None else self.eos_token
+
+    def get_pad_token_id(self):
+        return self._token_to_idx[self.pad_token] if self.pad_token is not None else self.pad_token
diff --git a/paddlenlp/datasets/hf_datasets/language_pair.py b/paddlenlp/datasets/hf_datasets/language_pair.py
new file mode 100644
index 000000000000..85643f70959a
--- /dev/null
+++ b/paddlenlp/datasets/hf_datasets/language_pair.py
@@ -0,0 +1,189 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+
+import datasets
+
+logger = datasets.logging.get_logger(__name__)
+
+_DESCRIPTION = """
+LanguagePairDataset used for machine translation between any pair of languages. """
+
+_URL = "https://bj.bcebos.com/paddlenlp/datasets/WMT14.en-de.tar.gz"
+
+
+class LanguagePairConfig(datasets.BuilderConfig):
+    """BuilderConfig for a general LanguagePairDataset."""
+
+    def __init__(self, **kwargs):
+        """BuilderConfig for LanguagePairDataset.
+
+        Args:
+          **kwargs: keyword arguments forwarded to super.
+        """
+        super(LanguagePairConfig, self).__init__(**kwargs)
+
+
+class LanguagePairDataset(datasets.GeneratorBasedBuilder):
+    BUILDER_CONFIGS = [
+        LanguagePairConfig(
+            name="LanguagePair",
+            version=datasets.Version("1.0.0", ""),
+            description=_DESCRIPTION,
+        ),
+    ]
+
+    def _info(self):
+        logger.warning(
+            "LanguagePairDataset is an experimental API which we will continue to optimize and may be changed."
+        )
+
+        return datasets.DatasetInfo(
+            description=_DESCRIPTION,
+            features=datasets.Features(
+                {
+                    "id": datasets.Value("string"),
+                    "source": datasets.Value("string"),
+                    "target": datasets.Value("string"),
+                }
+            ),
+            supervised_keys=None,
+        )
+
+    def _split_generators(self, dl_manager):
+        is_downloaded = False
+
+        # Train files.
+        if hasattr(self.config, "data_files") and "train" in self.config.data_files:
+            train_split = datasets.SplitGenerator(
+                name="train",
+                gen_kwargs={
+                    "source_filepath": os.path.abspath(self.config.data_files["train"][0]),
+                    "target_filepath": os.path.abspath(self.config.data_files["train"][1]),
+                },
+            )
+
+        else:
+            if not is_downloaded:
+                dl_dir = dl_manager.download_and_extract(_URL)
+                is_downloaded = True
+            train_split = datasets.SplitGenerator(
+                name="train",
+                gen_kwargs={
+                    "source_filepath": os.path.join(
+                        dl_dir, "WMT14.en-de", "wmt14_ende_data_bpe", "train.tok.clean.bpe.33708.en"
+                    ),
+                    "target_filepath": os.path.join(
+                        dl_dir, "WMT14.en-de", "wmt14_ende_data_bpe", "train.tok.clean.bpe.33708.de"
+                    ),
+                },
+            )
+
+        # Dev files.
+        if hasattr(self.config, "data_files") and "dev" in self.config.data_files:
+            dev_split = datasets.SplitGenerator(
+                name="dev",
+                gen_kwargs={
+                    "source_filepath": os.path.abspath(self.config.data_files["dev"][0]),
+                    "target_filepath": os.path.abspath(self.config.data_files["dev"][1]),
+                },
+            )
+
+        else:
+            if not is_downloaded:
+                dl_dir = dl_manager.download_and_extract(_URL)
+                is_downloaded = True
+            dev_split = datasets.SplitGenerator(
+                name="dev",
+                gen_kwargs={
+                    "source_filepath": os.path.join(
+                        dl_dir, "WMT14.en-de", "wmt14_ende_data_bpe", "newstest2013.tok.bpe.33708.en"
+                    ),
+                    "target_filepath": os.path.join(
+                        dl_dir, "WMT14.en-de", "wmt14_ende_data_bpe", "newstest2013.tok.bpe.33708.de"
+                    ),
+                },
+            )
+
+        # Test files.
+        if hasattr(self.config, "data_files") and "test" in self.config.data_files:
+            # test may not contain target languages.
+            if isinstance(self.config.data_files["test"], str):
+                self.config.data_files["test"] = [self.config.data_files["test"], None]
+            elif (
+                isinstance(self.config.data_files["test"], (list, tuple)) and len(self.config.data_files["test"]) == 1
+            ):
+                self.config.data_files["test"].append(None)
+
+            test_split = datasets.SplitGenerator(
+                name="test",
+                gen_kwargs={
+                    "source_filepath": os.path.abspath(self.config.data_files["test"][0]),
+                    "target_filepath": os.path.abspath(self.config.data_files["test"][1]),
+                },
+            )
+
+        else:
+            if not is_downloaded:
+                dl_dir = dl_manager.download_and_extract(_URL)
+                is_downloaded = True
+            test_split = datasets.SplitGenerator(
+                name="test",
+                gen_kwargs={
+                    "source_filepath": os.path.join(
+                        dl_dir, "WMT14.en-de", "wmt14_ende_data_bpe", "newstest2014.tok.bpe.33708.en"
+                    ),
+                    "target_filepath": os.path.join(
+                        dl_dir, "WMT14.en-de", "wmt14_ende_data_bpe", "newstest2014.tok.bpe.33708.de"
+                    ),
+                },
+            )
+
+        return [train_split, dev_split, test_split]
+
+    def _generate_examples(self, source_filepath, target_filepath):
+        """This function returns the examples in the raw (text) form."""
+
+        logger.info("generating examples from = source: {} & target: {}".format(source_filepath, target_filepath))
+        key = 0
+
+        with open(source_filepath, "r", encoding="utf-8") as src_fin:
+            if target_filepath is not None:
+                with open(target_filepath, "r", encoding="utf-8") as tgt_fin:
+                    src_seq = src_fin.readlines()
+                    tgt_seq = tgt_fin.readlines()
+
+                    for i, src in enumerate(src_seq):
+                        source = src.strip()
+                        target = tgt_seq[i].strip()
+
+                        yield key, {
+                            "id": str(key),
+                            "source": source,
+                            "target": target,
+                        }
+                        key += 1
+            else:
+                src_seq = src_fin.readlines()
+                for i, src in enumerate(src_seq):
+                    source = src.strip()
+
+                    yield key, {
+                        "id": str(key),
+                        "source": source,
+                        # None is not allowed.
+                        "target": "",
+                    }
+                    key += 1
diff --git a/paddlenlp/datasets/wmt14ende.py b/paddlenlp/datasets/wmt14ende.py
index 0cf3d83bc57e..ffa896d522e2 100644
--- a/paddlenlp/datasets/wmt14ende.py
+++ b/paddlenlp/datasets/wmt14ende.py
@@ -14,12 +14,12 @@
 
 import collections
 import os
-import warnings
 
-from paddle.io import Dataset
 from paddle.dataset.common import md5file
 from paddle.utils.download import get_path_from_url
+
 from paddlenlp.utils.env import DATA_HOME
+
 from . import DatasetBuilder
 
 __all__ = ["WMT14ende"]
@@ -118,7 +118,7 @@ def _read(self, filename, *args):
                     tgt_line = tgt_line.strip()
                     if not src_line and not tgt_line:
                         continue
-                    yield {"en": src_line, "de": tgt_line}
+                    yield {"source": src_line, "target": tgt_line}
 
     def get_vocab(self):
         bpe_vocab_fullname = os.path.join(DATA_HOME, self.__class__.__name__, self.VOCAB_INFO[0][0])
diff --git a/paddlenlp/ops/faster_transformer/transformer/faster_transformer.py b/paddlenlp/ops/faster_transformer/transformer/faster_transformer.py
index 1e364e5ced49..ba2ccbe962c0 100644
--- a/paddlenlp/ops/faster_transformer/transformer/faster_transformer.py
+++ b/paddlenlp/ops/faster_transformer/transformer/faster_transformer.py
@@ -13,47 +13,43 @@
 # limitations under the License.
 import os
 import shutil
-import numpy as np
 
+import numpy as np
 import paddle
 import paddle.nn as nn
 import paddle.nn.functional as F
 
-from paddlenlp.transformers import (
-    TransformerModel,
-    WordEmbedding,
-    PositionalEmbedding,
-    position_encoding_init,
-    InferTransformerModel,
-    GPTModel,
-)
 from paddlenlp.ops import (
-    InferTransformerDecoding,
-    InferGptDecoding,
-    InferUnifiedDecoding,
     InferBartDecoding,
+    InferGptDecoding,
+    InferGptJDecoding,
     InferMBartDecoding,
     InferOptDecoding,
-    InferGptJDecoding,
     InferPegasusDecoding,
+    InferTransformerDecoding,
+    InferUnifiedDecoding,
 )
-
-from .encoder import enable_faster_encoder, disable_faster_encoder
-from paddlenlp.ops.ext_utils import load
-from paddlenlp.utils.log import logger
 from paddlenlp.transformers import (
-    GPTChineseTokenizer,
-    GPTTokenizer,
-    UnifiedTransformerPretrainedModel,
-    UNIMOPretrainedModel,
     BartPretrainedModel,
+    CodeGenPreTrainedModel,
+    GPTChineseTokenizer,
+    GPTJPretrainedModel,
     GPTPretrainedModel,
+    GPTTokenizer,
+    InferTransformerModel,
     MBartPretrainedModel,
     OPTPretrainedModel,
-    GPTJPretrainedModel,
-    CodeGenPreTrainedModel,
     PegasusPretrainedModel,
+    PositionalEmbedding,
+    TransformerModel,
+    UnifiedTransformerPretrainedModel,
+    UNIMOPretrainedModel,
+    WordEmbedding,
+    position_encoding_init,
 )
+from paddlenlp.utils.log import logger
+
+from .encoder import enable_faster_encoder
 
 
 class FasterTransformer(TransformerModel):
@@ -95,6 +91,8 @@ class FasterTransformer(TransformerModel):
             The start token id and also is used as padding id. Defaults to 0.
         eos_id (int, optional):
             The end token id. Defaults to 1.
+        pad_id (int, optional):
+            The pad token id. Defaults to None. If it's None, the bos_id will be used as pad_id.
         decoding_strategy (str, optional):
             Indicating the strategy of decoding. It can be 'beam_search', 'beam_search_v2',
             'topk_sampling' and 'topp_sampling'. For beam search strategies,
@@ -156,6 +154,7 @@ def __init__(
         act_dropout=None,
         bos_id=0,
         eos_id=1,
+        pad_id=None,
         decoding_strategy="beam_search",
         beam_size=4,
         topk=1,
@@ -195,6 +194,7 @@ def __init__(
         self.trg_vocab_size = trg_vocab_size
         self.d_model = d_model
         self.bos_id = bos_id
+        self.pad_id = pad_id if pad_id is not None else self.bos_id
         self.max_length = max_length
         super(FasterTransformer, self).__init__(**args)
 
@@ -234,9 +234,9 @@ def __init__(
     def forward(self, src_word, trg_word=None):
         src_max_len = paddle.shape(src_word)[-1]
         src_slf_attn_bias = (
-            paddle.cast(src_word == self.bos_id, dtype=paddle.get_default_dtype()).unsqueeze([1, 2]) * -1e9
+            paddle.cast(src_word == self.pad_id, dtype=paddle.get_default_dtype()).unsqueeze([1, 2]) * -1e9
         )
-        src_pos = paddle.cast(src_word != self.bos_id, dtype=src_word.dtype) * paddle.arange(start=0, end=src_max_len)
+        src_pos = paddle.cast(src_word != self.pad_id, dtype=src_word.dtype) * paddle.arange(start=0, end=src_max_len)
 
         # Run encoder
         src_emb = self.src_word_embedding(src_word)
@@ -254,7 +254,7 @@ def forward(self, src_word, trg_word=None):
         elif not self.use_fp16_decoding and enc_output.dtype != paddle.float32:
             enc_output = paddle.cast(enc_output, dtype="float32")
 
-        mem_seq_lens = paddle.sum(paddle.cast(src_word != self.bos_id, dtype="int32"), dtype="int32", axis=1)
+        mem_seq_lens = paddle.sum(paddle.cast(src_word != self.pad_id, dtype="int32"), dtype="int32", axis=1)
         ids = self.decoding(enc_output, mem_seq_lens, trg_word=trg_word)
 
         return ids
@@ -475,6 +475,10 @@ class TransformerGenerator(paddle.nn.Layer):
             The beam width for beam search. Defaults to 4.
         max_out_len (int, optional):
             The maximum output length. Defaults to 256.
+        activation (str, optional):
+            The activation used in FFN. Defaults to "relu".
+        normalize_before (bool, optional):
+            Whether to apply pre-normalization. Defaults to True.
         kwargs:
             The key word arguments can be `output_time_major`, `use_ft`, `use_fp16_decoding`,
             `rel_len`, `alpha`:
@@ -532,8 +536,11 @@ def __init__(
         weight_sharing,
         bos_id=0,
         eos_id=1,
+        pad_id=None,
         beam_size=4,
         max_out_len=256,
+        activation="relu",
+        normalize_before=True,
         **kwargs
     ):
         logger.warning("TransformerGenerator is an experimental API and subject to change.")
@@ -553,7 +560,9 @@ def __init__(
         rel_len = kwargs.pop("rel_len", False)
         alpha = kwargs.pop("alpha", 0.6)
 
-        if use_ft:
+        # TODO: Faster version needs to update attr to support custom
+        # activation and normalize_before which are both aupport in cpp codes.
+        if use_ft and activation == "relu" and normalize_before:
             try:
                 decoding_strategy = "beam_search_v2" if beam_search_version == "v2" else "beam_search"
                 self.transformer = FasterTransformer(
@@ -569,6 +578,7 @@ def __init__(
                     weight_sharing=weight_sharing,
                     bos_id=bos_id,
                     eos_id=eos_id,
+                    pad_id=pad_id,
                     beam_size=beam_size,
                     max_out_len=max_out_len,
                     diversity_rate=diversity_rate,
@@ -598,10 +608,13 @@ def __init__(
                     weight_sharing=weight_sharing,
                     bos_id=bos_id,
                     eos_id=eos_id,
+                    pad_id=pad_id,
                     beam_size=beam_size,
                     max_out_len=max_out_len,
                     output_time_major=self.output_time_major,
                     beam_search_version=beam_search_version,
+                    activation=activation,
+                    normalize_before=normalize_before,
                     rel_len=rel_len,
                     alpha=alpha,
                 )
@@ -623,10 +636,13 @@ def __init__(
                 weight_sharing=weight_sharing,
                 bos_id=bos_id,
                 eos_id=eos_id,
+                pad_id=pad_id,
                 beam_size=beam_size,
                 max_out_len=max_out_len,
                 output_time_major=self.output_time_major,
                 beam_search_version=beam_search_version,
+                activation=activation,
+                normalize_before=normalize_before,
                 rel_len=rel_len,
                 alpha=alpha,
             )
@@ -1679,7 +1695,6 @@ def forward(
                 "encoder_output"
             ]
 
-        batch_size = paddle.shape(encoder_output)[0]
         if seq_len is None:
             assert input_ids is not None, "You have to specify either input_ids when generating seq_len."
             seq_len = paddle.sum(paddle.cast(input_ids != self.pad_token_id, dtype="int32"), axis=-1, dtype="int32")
diff --git a/paddlenlp/ops/optimizer/__init__.py b/paddlenlp/ops/optimizer/__init__.py
index e554fdb133f9..dd46359ca17e 100644
--- a/paddlenlp/ops/optimizer/__init__.py
+++ b/paddlenlp/ops/optimizer/__init__.py
@@ -11,9 +11,9 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
-import os
 
+from .adamwdl import AdamWDL, layerwise_lr_decay
 from .ema import ExponentialMovingAverage
-from .adamwdl import layerwise_lr_decay, AdamWDL
+from .lr import InverseSquareRootSchedule
 
-__all__ = ["layerwise_lr_decay", "AdamWDL", "ExponentialMovingAverage"]
+__all__ = ["layerwise_lr_decay", "AdamWDL", "ExponentialMovingAverage", "InverseSquareRootSchedule"]
diff --git a/paddlenlp/ops/optimizer/lr.py b/paddlenlp/ops/optimizer/lr.py
new file mode 100644
index 000000000000..b685cc4fa0ab
--- /dev/null
+++ b/paddlenlp/ops/optimizer/lr.py
@@ -0,0 +1,57 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# Copyright (c) Facebook, Inc. and its affiliates.
+#
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.
+
+from paddle.optimizer.lr import LRScheduler
+
+
+class InverseSquareRootSchedule(LRScheduler):
+    """
+    Decay the LR based on the inverse square root of the update number.
+
+    We also support a warmup phase where we linearly increase the learning rate
+    from some initial learning rate  until the configured learning rate. Thereafter
+    we decay proportional to the number of updates, with a decay factor set to
+    align with the configured learning rate.
+
+    Args:
+        warmup_steps(int):
+            The number of warmup steps. A super parameter.
+        learning_rate(float, optional):
+            The learning rate. It is a python float number. Defaults to 1.0.
+        last_epoch(int, optional):
+            The index of last epoch. Can be set to restart training. Default: -1,
+            means initial learning rate.
+        verbose(bool, optional):
+            If ``True``, prints a message to stdout for each
+            update. Defaults to ``False``.
+    """
+
+    def __init__(self, warmup_steps, learning_rate=1.0, last_epoch=-1, verbose=False):
+        self.warmup_steps = warmup_steps
+        warmup_end_lr = learning_rate
+        self.warmup_init_lr = 0.0
+        self.lr_step = (warmup_end_lr - self.warmup_init_lr) / self.warmup_steps
+        self.decay_factor = warmup_end_lr * (self.warmup_steps**0.5)
+
+        super(InverseSquareRootSchedule, self).__init__(learning_rate, last_epoch, verbose)
+
+    def get_lr(self):
+        if self.last_epoch < self.warmup_steps:
+            return self.warmup_init_lr + self.last_epoch * self.lr_step
+        else:
+            return self.decay_factor * (self.last_epoch**-0.5)
diff --git a/paddlenlp/transformers/transformer/modeling.py b/paddlenlp/transformers/transformer/modeling.py
index 8c3c60178d51..d3e9debb3895 100644
--- a/paddlenlp/transformers/transformer/modeling.py
+++ b/paddlenlp/transformers/transformer/modeling.py
@@ -13,11 +13,16 @@
 # limitations under the License.
 
 import numpy as np
-
 import paddle
 import paddle.nn as nn
 import paddle.nn.functional as F
 from paddle.fluid.layers.utils import map_structure
+from paddle.nn import (
+    TransformerDecoder,
+    TransformerDecoderLayer,
+    TransformerEncoder,
+    TransformerEncoderLayer,
+)
 
 __all__ = [
     "position_encoding_init",
@@ -28,6 +33,7 @@
     "TransformerBeamSearchDecoder",
     "TransformerModel",
     "InferTransformerModel",
+    "LabelSmoothedCrossEntropyCriterion",
 ]
 
 
@@ -288,6 +294,50 @@ def forward(self, predict, label):
         return sum_cost, avg_cost, token_num
 
 
+def label_smoothed_nll_loss(lprobs, target, epsilon, ignore_index=None, reduce=True):
+    if target.dim() == lprobs.dim() - 1:
+        target = target.unsqueeze(-1)
+
+    num_tokens = paddle.shape(lprobs)[0]
+    index = paddle.arange(0, num_tokens, dtype="int64").unsqueeze(-1)
+    index = paddle.concat([index, target], axis=-1)
+    index.stop_gradient = True
+
+    log_probs = -lprobs
+
+    nll_loss = paddle.gather_nd(log_probs, index=index).unsqueeze(-1)
+    smooth_loss = log_probs.sum(axis=-1, keepdim=True)
+
+    pad_mask = paddle.cast(target != ignore_index, dtype=paddle.get_default_dtype())
+    nll_loss = nll_loss * pad_mask
+    smooth_loss = smooth_loss * pad_mask
+    if reduce:
+        nll_loss = nll_loss.sum()
+        smooth_loss = smooth_loss.sum()
+    eps_i = epsilon / (lprobs.shape[-1] - 1)
+    loss = (1.0 - epsilon - eps_i) * nll_loss + eps_i * smooth_loss
+    token_num = paddle.sum(pad_mask)
+    return loss, loss / token_num, token_num
+
+
+class LabelSmoothedCrossEntropyCriterion(nn.Layer):
+    def __init__(self, label_smoothing, padding_idx=0):
+        super().__init__()
+        self.eps = label_smoothing
+        self.padding_idx = padding_idx
+
+    def forward(self, predict, label, reduce=True):
+        return self.compute_loss(predict, label, reduce=reduce)
+
+    def get_lprobs_and_target(self, predict, label):
+        lprobs = paddle.nn.functional.log_softmax(predict, axis=-1)
+        return lprobs.reshape([-1, lprobs.shape[-1]]), label.reshape([-1])
+
+    def compute_loss(self, predict, label, reduce=True):
+        lprobs, label = self.get_lprobs_and_target(predict, label)
+        return label_smoothed_nll_loss(lprobs, label, self.eps, ignore_index=self.padding_idx, reduce=reduce)
+
+
 class TransformerDecodeCell(nn.Layer):
     """
     This layer wraps a Transformer decoder combined with embedding
@@ -650,8 +700,12 @@ class TransformerModel(nn.Layer):
             The start token id and also be used as padding id. Defaults to 0.
         eos_id (int, optional):
             The end token id. Defaults to 1.
+        pad_id (int, optional):
+            The pad token id. Defaults to None. If it's None, the bos_id will be used as pad_id.
         activation (str, optional):
             The activation used in FFN. Defaults to "relu".
+        normalize_before (bool, optional):
+            Whether to apply pre-normalization. Defaults to True.
     """
 
     def __init__(
@@ -670,16 +724,19 @@ def __init__(
         act_dropout=None,
         bos_id=0,
         eos_id=1,
+        pad_id=None,
         activation="relu",
+        normalize_before=True,
     ):
         super(TransformerModel, self).__init__()
         self.trg_vocab_size = trg_vocab_size
         self.emb_dim = d_model
         self.bos_id = bos_id
         self.eos_id = eos_id
+        self.pad_id = pad_id if pad_id is not None else self.bos_id
         self.dropout = dropout
 
-        self.src_word_embedding = WordEmbedding(vocab_size=src_vocab_size, emb_dim=d_model, bos_id=self.bos_id)
+        self.src_word_embedding = WordEmbedding(vocab_size=src_vocab_size, emb_dim=d_model, bos_id=self.pad_id)
         self.src_pos_embedding = PositionalEmbedding(emb_dim=d_model, max_length=max_length)
         if weight_sharing:
             assert (
@@ -688,9 +745,34 @@ def __init__(
             self.trg_word_embedding = self.src_word_embedding
             self.trg_pos_embedding = self.src_pos_embedding
         else:
-            self.trg_word_embedding = WordEmbedding(vocab_size=trg_vocab_size, emb_dim=d_model, bos_id=self.bos_id)
+            self.trg_word_embedding = WordEmbedding(vocab_size=trg_vocab_size, emb_dim=d_model, bos_id=self.pad_id)
             self.trg_pos_embedding = PositionalEmbedding(emb_dim=d_model, max_length=max_length)
 
+        if not normalize_before:
+            encoder_layer = TransformerEncoderLayer(
+                d_model=d_model,
+                nhead=n_head,
+                dim_feedforward=d_inner_hid,
+                dropout=dropout,
+                activation=activation,
+                attn_dropout=attn_dropout,
+                act_dropout=act_dropout,
+                normalize_before=normalize_before,
+            )
+            encoder_with_post_norm = TransformerEncoder(encoder_layer, num_encoder_layers)
+
+            decoder_layer = TransformerDecoderLayer(
+                d_model=d_model,
+                nhead=n_head,
+                dim_feedforward=d_inner_hid,
+                dropout=dropout,
+                activation=activation,
+                attn_dropout=attn_dropout,
+                act_dropout=act_dropout,
+                normalize_before=normalize_before,
+            )
+            decoder_with_post_norm = TransformerDecoder(decoder_layer, num_decoder_layers)
+
         self.transformer = paddle.nn.Transformer(
             d_model=d_model,
             nhead=n_head,
@@ -701,7 +783,9 @@ def __init__(
             attn_dropout=attn_dropout,
             act_dropout=act_dropout,
             activation=activation,
-            normalize_before=True,
+            normalize_before=normalize_before,
+            custom_encoder=None if normalize_before else encoder_with_post_norm,
+            custom_decoder=None if normalize_before else decoder_with_post_norm,
         )
 
         if weight_sharing:
@@ -761,16 +845,16 @@ def forward(self, src_word, trg_word):
         src_max_len = paddle.shape(src_word)[-1]
         trg_max_len = paddle.shape(trg_word)[-1]
         src_slf_attn_bias = (
-            paddle.cast(src_word == self.bos_id, dtype=paddle.get_default_dtype()).unsqueeze([1, 2]) * -1e4
+            paddle.cast(src_word == self.pad_id, dtype=paddle.get_default_dtype()).unsqueeze([1, 2]) * -1e4
         )
         src_slf_attn_bias.stop_gradient = True
         trg_slf_attn_bias = self.transformer.generate_square_subsequent_mask(trg_max_len)
         trg_slf_attn_bias.stop_gradient = True
         trg_src_attn_bias = src_slf_attn_bias
-        src_pos = paddle.cast(src_word != self.bos_id, dtype=src_word.dtype) * paddle.arange(
+        src_pos = paddle.cast(src_word != self.pad_id, dtype=src_word.dtype) * paddle.arange(
             start=0, end=src_max_len, dtype=src_word.dtype
         )
-        trg_pos = paddle.cast(trg_word != self.bos_id, dtype=src_word.dtype) * paddle.arange(
+        trg_pos = paddle.cast(trg_word != self.pad_id, dtype=src_word.dtype) * paddle.arange(
             start=0, end=trg_max_len, dtype=trg_word.dtype
         )
 
@@ -835,6 +919,8 @@ class InferTransformerModel(TransformerModel):
             The start token id and also is used as padding id. Defaults to 0.
         eos_id (int, optional):
             The end token id. Defaults to 1.
+        pad_id (int, optional):
+            The pad token id. Defaults to None. If it's None, the bos_id will be used as pad_id.
         beam_size (int, optional):
             The beam width for beam search. Defaults to 4.
         max_out_len (int, optional):
@@ -851,6 +937,8 @@ class InferTransformerModel(TransformerModel):
             penalty. Default to `v1`.
         activation (str, optional):
             The activation used in FFN. Defaults to "relu".
+        normalize_before (bool, optional):
+            Whether to apply pre-normalization. Defaults to True.
         kwargs:
             The key word arguments can be `rel_len` and `alpha`:
 
@@ -880,11 +968,13 @@ def __init__(
         act_dropout=None,
         bos_id=0,
         eos_id=1,
+        pad_id=None,
         beam_size=4,
         max_out_len=256,
         output_time_major=False,
         beam_search_version="v1",
         activation="relu",
+        normalize_before=True,
         **kwargs
     ):
         args = dict(locals())
@@ -955,17 +1045,17 @@ def forward(self, src_word, trg_word=None):
                     src_word=paddle.randint(low=3, high=30000, shape=[batch_size, seq_len]))
         """
         if trg_word is not None:
-            trg_length = paddle.sum(paddle.cast(trg_word != self.bos_id, dtype="int32"), axis=-1)
+            trg_length = paddle.sum(paddle.cast(trg_word != self.pad_id, dtype="int32"), axis=-1)
         else:
             trg_length = None
 
         if self.beam_search_version == "v1":
             src_max_len = paddle.shape(src_word)[-1]
             src_slf_attn_bias = (
-                paddle.cast(src_word == self.bos_id, dtype=paddle.get_default_dtype()).unsqueeze([1, 2]) * -1e4
+                paddle.cast(src_word == self.pad_id, dtype=paddle.get_default_dtype()).unsqueeze([1, 2]) * -1e4
             )
             trg_src_attn_bias = src_slf_attn_bias
-            src_pos = paddle.cast(src_word != self.bos_id, dtype=src_word.dtype) * paddle.arange(
+            src_pos = paddle.cast(src_word != self.pad_id, dtype=src_word.dtype) * paddle.arange(
                 start=0, end=src_max_len, dtype=src_word.dtype
             )
 
@@ -1036,10 +1126,10 @@ def merge_beam_dim(tensor):
         # run encoder
         src_max_len = paddle.shape(src_word)[-1]
         src_slf_attn_bias = (
-            paddle.cast(src_word == self.bos_id, dtype=paddle.get_default_dtype()).unsqueeze([1, 2]) * -1e4
+            paddle.cast(src_word == self.pad_id, dtype=paddle.get_default_dtype()).unsqueeze([1, 2]) * -1e4
         )
         src_slf_attn_bias.stop_gradient = True
-        src_pos = paddle.cast(src_word != self.bos_id, dtype=src_word.dtype) * paddle.arange(
+        src_pos = paddle.cast(src_word != self.pad_id, dtype=src_word.dtype) * paddle.arange(
             start=0, end=src_max_len, dtype=src_word.dtype
         )
         src_emb = self.src_word_embedding(src_word)
@@ -1058,8 +1148,8 @@ def merge_beam_dim(tensor):
             else (enc_output.shape[1] + max_len if self.rel_len else max_len)
         )
 
-        ### initialize states of beam search ###
-        ## init for the alive ##
+        # initialize states of beam search
+        # init for the alive
         initial_log_probs = paddle.assign(np.array([[0.0] + [-inf] * (beam_size - 1)], dtype="float32"))
         alive_log_probs = paddle.tile(initial_log_probs, [batch_size, 1])
 
@@ -1067,7 +1157,7 @@ def merge_beam_dim(tensor):
             paddle.cast(paddle.assign(np.array([[[self.bos_id]]])), src_word.dtype), [batch_size, beam_size, 1]
         )
 
-        ## init for the finished ##
+        # init for the finished
         finished_scores = paddle.assign(np.array([[-inf] * beam_size], dtype="float32"))
         finished_scores = paddle.tile(finished_scores, [batch_size, 1])
 
@@ -1076,14 +1166,14 @@ def merge_beam_dim(tensor):
         )
         finished_flags = paddle.zeros_like(finished_scores)
 
-        ### initialize inputs and states of transformer decoder ###
-        ## init inputs for decoder, shaped `[batch_size*beam_size, ...]`
+        # initialize inputs and states of transformer decoder
+        # init inputs for decoder, shaped `[batch_size*beam_size, ...]`
         pre_word = paddle.reshape(alive_seq[:, :, -1], [batch_size * beam_size, 1])
         trg_src_attn_bias = src_slf_attn_bias
         trg_src_attn_bias = merge_beam_dim(expand_to_beam_size(trg_src_attn_bias, beam_size))
         enc_output = merge_beam_dim(expand_to_beam_size(enc_output, beam_size))
 
-        ## init states (caches) for transformer, need to be updated according to selected beam
+        # init states (caches) for transformer, need to be updated according to selected beam
         caches = self.transformer.decoder.gen_cache(enc_output, do_zip=False)
 
         if trg_word is not None:
diff --git a/tests/test_tipc/configs/transformer/base/train_infer_python.txt b/tests/test_tipc/configs/transformer/base/train_infer_python.txt
index 8a07c2cdf64f..5a0b92339837 100644
--- a/tests/test_tipc/configs/transformer/base/train_infer_python.txt
+++ b/tests/test_tipc/configs/transformer/base/train_infer_python.txt
@@ -27,7 +27,7 @@ null:null
 ===========================infer_params===========================
 null:null
 null:null
-norm_export:../examples/machine_translation/transformer/export_model.py --config ../examples/machine_translation/transformer/configs/transformer.base.yaml --vocab_file ../examples/machine_translation/transformer/vocab_all.bpe.33712 --unk_token "<unk>" --bos_token "<s>" --eos_token "<e>" --benchmark
+norm_export:../examples/machine_translation/transformer/export_model.py --config ../examples/machine_translation/transformer/configs/transformer.base.yaml --vocab_file ../examples/machine_translation/transformer/vocab_all.bpe.33712 --bos_token "<s>" --eos_token "<e>" --benchmark
 quant_export:null
 fpgm_export:null
 distill_export:null
diff --git a/tests/test_tipc/configs/transformer/base/transformer_base_dygraph_params.txt b/tests/test_tipc/configs/transformer/base/transformer_base_dygraph_params.txt
index d6b96e534109..7a9bb5a4f37d 100644
--- a/tests/test_tipc/configs/transformer/base/transformer_base_dygraph_params.txt
+++ b/tests/test_tipc/configs/transformer/base/transformer_base_dygraph_params.txt
@@ -27,7 +27,7 @@ null:null
 ===========================infer_params===========================
 null:null
 null:null
-norm_export:../examples/machine_translation/transformer/export_model.py --config ../examples/machine_translation/transformer/configs/transformer.base.yaml --vocab_file ../examples/machine_translation/transformer/vocab_all.bpe.33708 --unk_token "<unk>" --bos_token "<s>" --eos_token "<e>"
+norm_export:../examples/machine_translation/transformer/export_model.py --config ../examples/machine_translation/transformer/configs/transformer.base.yaml --vocab_file ../examples/machine_translation/transformer/vocab_all.bpe.33708 --bos_token "<s>" --eos_token "<e>"
 quant_export:null
 fpgm_export:null
 distill_export:null
diff --git a/tests/test_tipc/configs/transformer/big/train_infer_python.txt b/tests/test_tipc/configs/transformer/big/train_infer_python.txt
index 689e12ed6993..f3995c1c5124 100644
--- a/tests/test_tipc/configs/transformer/big/train_infer_python.txt
+++ b/tests/test_tipc/configs/transformer/big/train_infer_python.txt
@@ -27,7 +27,7 @@ null:null
 ===========================infer_params===========================
 null:null
 null:null
-norm_export:../examples/machine_translation/transformer/export_model.py --config ../examples/machine_translation/transformer/configs/transformer.big.yaml --vocab_file ../examples/machine_translation/transformer/vocab_all.bpe.33712 --unk_token "<unk>" --bos_token "<s>" --eos_token "<e>" --benchmark
+norm_export:../examples/machine_translation/transformer/export_model.py --config ../examples/machine_translation/transformer/configs/transformer.big.yaml --vocab_file ../examples/machine_translation/transformer/vocab_all.bpe.33712 --bos_token "<s>" --eos_token "<e>" --benchmark
 quant_export:null
 fpgm_export:null
 distill_export:null
diff --git a/tests/test_tipc/configs/transformer/big/transformer_big_dygraph_params.txt b/tests/test_tipc/configs/transformer/big/transformer_big_dygraph_params.txt
index 4e8992411071..e72c03e70751 100644
--- a/tests/test_tipc/configs/transformer/big/transformer_big_dygraph_params.txt
+++ b/tests/test_tipc/configs/transformer/big/transformer_big_dygraph_params.txt
@@ -27,7 +27,7 @@ null:null
 ===========================infer_params===========================
 null:null
 null:null
-norm_export:../examples/machine_translation/transformer/export_model.py --config ../examples/machine_translation/transformer/configs/transformer.big.yaml  --vocab_file ../examples/machine_translation/transformer/vocab_all.bpe.33708 --unk_token "<unk>" --bos_token "<s>" --eos_token "<e>"
+norm_export:../examples/machine_translation/transformer/export_model.py --config ../examples/machine_translation/transformer/configs/transformer.big.yaml  --vocab_file ../examples/machine_translation/transformer/vocab_all.bpe.33708 --bos_token "<s>" --eos_token "<e>"
 quant_export:null
 fpgm_export:null
 distill_export:null
diff --git a/tests/test_tipc/configs/transformer/train_infer_python.txt b/tests/test_tipc/configs/transformer/train_infer_python.txt
index d6b96e534109..7a9bb5a4f37d 100644
--- a/tests/test_tipc/configs/transformer/train_infer_python.txt
+++ b/tests/test_tipc/configs/transformer/train_infer_python.txt
@@ -27,7 +27,7 @@ null:null
 ===========================infer_params===========================
 null:null
 null:null
-norm_export:../examples/machine_translation/transformer/export_model.py --config ../examples/machine_translation/transformer/configs/transformer.base.yaml --vocab_file ../examples/machine_translation/transformer/vocab_all.bpe.33708 --unk_token "<unk>" --bos_token "<s>" --eos_token "<e>"
+norm_export:../examples/machine_translation/transformer/export_model.py --config ../examples/machine_translation/transformer/configs/transformer.base.yaml --vocab_file ../examples/machine_translation/transformer/vocab_all.bpe.33708 --bos_token "<s>" --eos_token "<e>"
 quant_export:null
 fpgm_export:null
 distill_export:null
diff --git a/tests/test_tipc/transformer/modeling.py b/tests/test_tipc/transformer/modeling.py
index 072f88326829..55e87b3e4cce 100644
--- a/tests/test_tipc/transformer/modeling.py
+++ b/tests/test_tipc/transformer/modeling.py
@@ -1,5 +1,18 @@
-import numpy as np
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
 
+import numpy as np
 import paddle
 import paddle.nn as nn
 import paddle.nn.functional as F
@@ -13,7 +26,6 @@
     "TransformerDecodeCell",
     "TransformerBeamSearchDecoder",
     "TransformerModel",
-    "InferTransformerModel",
 ]
 
 
@@ -611,6 +623,10 @@ class TransformerModel(nn.Layer):
             The start token id and also be used as padding id. Defaults to 0.
         eos_id (int, optional):
             The end token id. Defaults to 1.
+        pad_id (int, optional):
+            The pad token id. Defaults to None. If it's None, the bos_id will be used as pad_id.
+        activation (str, optional):
+            The activation used in FFN. Defaults to "relu".
     """
 
     def __init__(
@@ -629,12 +645,15 @@ def __init__(
         act_dropout=None,
         bos_id=0,
         eos_id=1,
+        pad_id=None,
+        activation="relu",
     ):
         super(TransformerModel, self).__init__()
         self.trg_vocab_size = trg_vocab_size
         self.emb_dim = d_model
         self.bos_id = bos_id
         self.eos_id = eos_id
+        self.pad_id = pad_id if pad_id is not None else self.bos_id
         self.dropout = dropout
 
         self.src_word_embedding = WordEmbedding(vocab_size=src_vocab_size, emb_dim=d_model, bos_id=self.bos_id)
@@ -658,7 +677,7 @@ def __init__(
             dropout=dropout,
             attn_dropout=attn_dropout,
             act_dropout=act_dropout,
-            activation="relu",
+            activation=activation,
             normalize_before=True,
         )
 
diff --git a/tests/test_tipc/transformer/train.py b/tests/test_tipc/transformer/train.py
index 15dd81967613..32b26aab40e1 100644
--- a/tests/test_tipc/transformer/train.py
+++ b/tests/test_tipc/transformer/train.py
@@ -1,17 +1,30 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
 import os
 import sys
 import time
-
-import yaml
-import argparse
-import numpy as np
 from pprint import pprint
-from attrdict import AttrDict
 
+import numpy as np
 import paddle
 import paddle.distributed as dist
+import yaml
+from attrdict import AttrDict
+from modeling import CrossEntropyCriterion, TransformerModel
 
-from modeling import TransformerModel, CrossEntropyCriterion
 from paddlenlp.utils.log import logger
 
 sys.path.append(
@@ -19,8 +32,8 @@
         os.path.join(os.path.dirname(__file__), os.pardir, os.pardir, "examples", "machine_translation", "transformer")
     )
 )
-import reader
-from tls.record import AverageStatistical
+import reader  # noqa: E402
+from tls.record import AverageStatistical  # noqa: E402
 
 paddle.set_default_dtype("float64")
 
@@ -36,19 +49,25 @@ def parse_args():
         help="Whether to print logs on each cards and use benchmark vocab. Normally, not necessary to set --benchmark. ",
     )
     parser.add_argument("--max_iter", default=None, type=int, help="The maximum iteration for training. ")
+    parser.add_argument(
+        "--data_dir",
+        default=None,
+        type=str,
+        help="The dir of train, dev and test datasets. If data_dir is given, train_file and dev_file and test_file will be replaced by data_dir/[train|dev|test].\{src_lang\}-\{trg_lang\}.[\{src_lang\}|\{trg_lang\}]. ",
+    )
     parser.add_argument(
         "--train_file",
         nargs="+",
         default=None,
         type=str,
-        help="The files for training, including [source language file, target language file]. Normally, it shouldn't be set and in this case, the default WMT14 dataset will be used to train. ",
+        help="The files for training, including [source language file, target language file]. If it's None, the default WMT14 en-de dataset will be used. ",
     )
     parser.add_argument(
         "--dev_file",
         nargs="+",
         default=None,
         type=str,
-        help="The files for validation, including [source language file, target language file]. Normally, it shouldn't be set and in this case, the default WMT14 dataset will be used to do validation. ",
+        help="The files for validation, including [source language file, target language file]. If it's None, the default WMT14 en-de dataset will be used. ",
     )
     parser.add_argument(
         "--vocab_file",
@@ -56,6 +75,20 @@ def parse_args():
         type=str,
         help="The vocab file. Normally, it shouldn't be set and in this case, the default WMT14 dataset will be used.",
     )
+    parser.add_argument(
+        "--src_vocab",
+        default=None,
+        type=str,
+        help="The vocab file for source language. If --vocab_file is given, the --vocab_file will be used. ",
+    )
+    parser.add_argument(
+        "--trg_vocab",
+        default=None,
+        type=str,
+        help="The vocab file for target language. If --vocab_file is given, the --vocab_file will be used. ",
+    )
+    parser.add_argument("-s", "--src_lang", default=None, type=str, help="Source language. ")
+    parser.add_argument("-t", "--trg_lang", default=None, type=str, help="Target language. ")
     parser.add_argument(
         "--unk_token",
         default=None,
@@ -68,6 +101,12 @@ def parse_args():
     parser.add_argument(
         "--eos_token", default=None, type=str, help="The eos token. It should be provided when use custom vocab_file. "
     )
+    parser.add_argument(
+        "--pad_token",
+        default=None,
+        type=str,
+        help="The pad token. It should be provided when use custom vocab_file. And if it's None, bos_token will be used. ",
+    )
     args = parser.parse_args()
     return args
 
@@ -305,12 +344,43 @@ def do_train(args):
     args.benchmark = ARGS.benchmark
     if ARGS.max_iter:
         args.max_iter = ARGS.max_iter
+    args.data_dir = ARGS.data_dir
     args.train_file = ARGS.train_file
     args.dev_file = ARGS.dev_file
-    args.vocab_file = ARGS.vocab_file
+
+    if ARGS.vocab_file is not None:
+        args.src_vocab = ARGS.vocab_file
+        args.trg_vocab = ARGS.vocab_file
+        args.joined_dictionary = True
+    elif ARGS.src_vocab is not None and ARGS.trg_vocab is None:
+        args.vocab_file = args.trg_vocab = args.src_vocab = ARGS.src_vocab
+        args.joined_dictionary = True
+    elif ARGS.src_vocab is None and ARGS.trg_vocab is not None:
+        args.vocab_file = args.trg_vocab = args.src_vocab = ARGS.trg_vocab
+        args.joined_dictionary = True
+    else:
+        args.src_vocab = ARGS.src_vocab
+        args.trg_vocab = ARGS.trg_vocab
+        args.joined_dictionary = not (
+            args.src_vocab is not None and args.trg_vocab is not None and args.src_vocab != args.trg_vocab
+        )
+    if args.weight_sharing != args.joined_dictionary:
+        if args.weight_sharing:
+            raise ValueError("The src_vocab and trg_vocab must be consistency when weight_sharing is True. ")
+        else:
+            raise ValueError(
+                "The src_vocab and trg_vocab must be specified respectively when weight sharing is False. "
+            )
+
+    if ARGS.src_lang is not None:
+        args.src_lang = ARGS.src_lang
+    if ARGS.trg_lang is not None:
+        args.trg_lang = ARGS.trg_lang
+
     args.unk_token = ARGS.unk_token
     args.bos_token = ARGS.bos_token
     args.eos_token = ARGS.eos_token
+    args.pad_token = ARGS.pad_token
     pprint(args)
 
     do_train(args)
diff --git a/tests/transformer/modeling.py b/tests/transformer/modeling.py
index 072f88326829..55e87b3e4cce 100644
--- a/tests/transformer/modeling.py
+++ b/tests/transformer/modeling.py
@@ -1,5 +1,18 @@
-import numpy as np
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
 
+import numpy as np
 import paddle
 import paddle.nn as nn
 import paddle.nn.functional as F
@@ -13,7 +26,6 @@
     "TransformerDecodeCell",
     "TransformerBeamSearchDecoder",
     "TransformerModel",
-    "InferTransformerModel",
 ]
 
 
@@ -611,6 +623,10 @@ class TransformerModel(nn.Layer):
             The start token id and also be used as padding id. Defaults to 0.
         eos_id (int, optional):
             The end token id. Defaults to 1.
+        pad_id (int, optional):
+            The pad token id. Defaults to None. If it's None, the bos_id will be used as pad_id.
+        activation (str, optional):
+            The activation used in FFN. Defaults to "relu".
     """
 
     def __init__(
@@ -629,12 +645,15 @@ def __init__(
         act_dropout=None,
         bos_id=0,
         eos_id=1,
+        pad_id=None,
+        activation="relu",
     ):
         super(TransformerModel, self).__init__()
         self.trg_vocab_size = trg_vocab_size
         self.emb_dim = d_model
         self.bos_id = bos_id
         self.eos_id = eos_id
+        self.pad_id = pad_id if pad_id is not None else self.bos_id
         self.dropout = dropout
 
         self.src_word_embedding = WordEmbedding(vocab_size=src_vocab_size, emb_dim=d_model, bos_id=self.bos_id)
@@ -658,7 +677,7 @@ def __init__(
             dropout=dropout,
             attn_dropout=attn_dropout,
             act_dropout=act_dropout,
-            activation="relu",
+            activation=activation,
             normalize_before=True,
         )
 
diff --git a/tests/transformer/train.py b/tests/transformer/train.py
index 989911586efb..8fd1ffa921e8 100644
--- a/tests/transformer/train.py
+++ b/tests/transformer/train.py
@@ -12,20 +12,19 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 
+import argparse
 import os
 import sys
 import time
-
-import yaml
-import argparse
-import numpy as np
 from pprint import pprint
-from attrdict import AttrDict
 
+import numpy as np
 import paddle
 import paddle.distributed as dist
+import yaml
+from attrdict import AttrDict
+from modeling import CrossEntropyCriterion, TransformerModel
 
-from modeling import TransformerModel, CrossEntropyCriterion
 from paddlenlp.utils.log import logger
 
 sys.path.append(
@@ -33,8 +32,8 @@
         os.path.join(os.path.dirname(__file__), os.pardir, os.pardir, "examples", "machine_translation", "transformer")
     )
 )
-import reader
-from tls.record import AverageStatistical
+import reader  # noqa: E402
+from tls.record import AverageStatistical  # noqa: E402
 
 paddle.set_default_dtype("float64")
 
@@ -50,19 +49,25 @@ def parse_args():
         help="Whether to print logs on each cards and use benchmark vocab. Normally, not necessary to set --benchmark. ",
     )
     parser.add_argument("--max_iter", default=None, type=int, help="The maximum iteration for training. ")
+    parser.add_argument(
+        "--data_dir",
+        default=None,
+        type=str,
+        help="The dir of train, dev and test datasets. If data_dir is given, train_file and dev_file and test_file will be replaced by data_dir/[train|dev|test].\{src_lang\}-\{trg_lang\}.[\{src_lang\}|\{trg_lang\}]. ",
+    )
     parser.add_argument(
         "--train_file",
         nargs="+",
         default=None,
         type=str,
-        help="The files for training, including [source language file, target language file]. Normally, it shouldn't be set and in this case, the default WMT14 dataset will be used to train. ",
+        help="The files for training, including [source language file, target language file]. If it's None, the default WMT14 en-de dataset will be used. ",
     )
     parser.add_argument(
         "--dev_file",
         nargs="+",
         default=None,
         type=str,
-        help="The files for validation, including [source language file, target language file]. Normally, it shouldn't be set and in this case, the default WMT14 dataset will be used to do validation. ",
+        help="The files for validation, including [source language file, target language file]. If it's None, the default WMT14 en-de dataset will be used. ",
     )
     parser.add_argument(
         "--vocab_file",
@@ -70,6 +75,20 @@ def parse_args():
         type=str,
         help="The vocab file. Normally, it shouldn't be set and in this case, the default WMT14 dataset will be used.",
     )
+    parser.add_argument(
+        "--src_vocab",
+        default=None,
+        type=str,
+        help="The vocab file for source language. If --vocab_file is given, the --vocab_file will be used. ",
+    )
+    parser.add_argument(
+        "--trg_vocab",
+        default=None,
+        type=str,
+        help="The vocab file for target language. If --vocab_file is given, the --vocab_file will be used. ",
+    )
+    parser.add_argument("-s", "--src_lang", default=None, type=str, help="Source language. ")
+    parser.add_argument("-t", "--trg_lang", default=None, type=str, help="Target language. ")
     parser.add_argument(
         "--unk_token",
         default=None,
@@ -82,6 +101,12 @@ def parse_args():
     parser.add_argument(
         "--eos_token", default=None, type=str, help="The eos token. It should be provided when use custom vocab_file. "
     )
+    parser.add_argument(
+        "--pad_token",
+        default=None,
+        type=str,
+        help="The pad token. It should be provided when use custom vocab_file. And if it's None, bos_token will be used. ",
+    )
     parser.add_argument(
         "--device", default="gpu", choices=["gpu", "cpu", "xpu", "npu"], help="Device selected for inference."
     )
@@ -331,12 +356,44 @@ def do_train(args):
     args.benchmark = ARGS.benchmark
     if ARGS.max_iter:
         args.max_iter = ARGS.max_iter
+    args.data_dir = ARGS.data_dir
     args.train_file = ARGS.train_file
     args.dev_file = ARGS.dev_file
-    args.vocab_file = ARGS.vocab_file
+
+    if ARGS.vocab_file is not None:
+        args.src_vocab = ARGS.vocab_file
+        args.trg_vocab = ARGS.vocab_file
+        args.joined_dictionary = True
+    elif ARGS.src_vocab is not None and ARGS.trg_vocab is None:
+        args.vocab_file = args.trg_vocab = args.src_vocab = ARGS.src_vocab
+        args.joined_dictionary = True
+    elif ARGS.src_vocab is None and ARGS.trg_vocab is not None:
+        args.vocab_file = args.trg_vocab = args.src_vocab = ARGS.trg_vocab
+        args.joined_dictionary = True
+    else:
+        args.src_vocab = ARGS.src_vocab
+        args.trg_vocab = ARGS.trg_vocab
+        args.joined_dictionary = not (
+            args.src_vocab is not None and args.trg_vocab is not None and args.src_vocab != args.trg_vocab
+        )
+    if args.weight_sharing != args.joined_dictionary:
+        if args.weight_sharing:
+            raise ValueError("The src_vocab and trg_vocab must be consistency when weight_sharing is True. ")
+        else:
+            raise ValueError(
+                "The src_vocab and trg_vocab must be specified respectively when weight sharing is False. "
+            )
+
+    if ARGS.src_lang is not None:
+        args.src_lang = ARGS.src_lang
+    if ARGS.trg_lang is not None:
+        args.trg_lang = ARGS.trg_lang
+
     args.unk_token = ARGS.unk_token
     args.bos_token = ARGS.bos_token
     args.eos_token = ARGS.eos_token
+    args.pad_token = ARGS.pad_token
+
     args.device = ARGS.device
     pprint(args)