Norm-Based Curriculum Learning for Neural Machine Translation (ACL 2020)

This repo contains the source code and instructions to reproduce the results of our paper.

Reference:

@inproceedings{NORMCL20,
  title={Norm-Based Curriculum Learning for Neural Machine Translation},
  author={Liu, Xuebo and Lai, Houtim and Wong, Derek F. and Chao, Lidia S.},
  booktitle={ACL 2020},
  year={2020}
}

INSTALLATION

Requirements

* [Boost] 1.64.0
* [CMAKE] 3.13.2
* [CUDA] 8.0
* [Fasttext] https://github.com/facebookresearch/fastText
* Please review Marian for more installation details: https://marian-nmt.github.io/

Clone this repository

git clone https://github.com/NLP2CT/norm-nmt
cd norm-nmt
mkdir build
cd build
cmake ..
make -j

Training and Testing

Train Fasttext to get norm-based sentence difficulty

# Install Fasttext
mkdir ~/fast && cd ~/fast
wget https://github.com/facebookresearch/fastText/archive/v0.9.1.zip
unzip v0.9.1.zip
cd fastText-0.9.1
make -j
cp fasttext ../
# Use tokenized data for fasttext
mkdir -p cl/mod
cat $TRAIN.$SRCL | $MARIAN_VOCAB > cl/vocab.$SRCL.yml
python CL_tools/process_fasttext.py -i $TRAIN.$SRCL -o $SRCL.emb -v $cl/vocab.$SRCL.yml -w ~/fast/fasttext
python CL_tools/build_cdf_mod.py --emb_vector $SRCL.emb.orig.vec $TRAIN.$SRCL cl/mod/$SRCL-mod

Translation

# Training (8 GPUs)
 $MARIAN_TRAIN \
        --model $MODEL_DIR/model_revo.npz --type transformer \
        --train-sets $TRAIN.$SRCL $TRAIN.$TGTL \
        --max-length 140 \
        --vocabs $MODEL_DIR/vocab.$SRCL.yml $MODEL_DIR/vocab.$TGTL.yml \
        --mini-batch-fit -w 9250 --maxi-batch 5000 \
        --early-stopping 10 --cost-type=ce-mean-words \
        --valid-freq 2500 --save-freq 2500 --disp-freq 1 \
        --valid-metrics ce-mean-words perplexity translation \
        --valid-sets $CORPUS_DIR/$VALID.$SRCL $CORPUS_DIR/$VALID.$TGTL \
        --valid-script-path "bash ./validate-"$SRCL\-$TGTL".sh" \
        --valid-translation-output $OUTPUT_DIR/$ID.tf.$SRCL$TGTL.single --quiet-translation \
        --valid-mini-batch 64 \
        --beam-size 6 --normalize 0.6 \
        --log $MODEL_DIR/train.log --valid-log $MODEL_DIR/valid.log \
        --enc-depth 6 --dec-depth 6 \
        --transformer-heads 8 \
        --transformer-postprocess-emb d \
        --transformer-postprocess dan \
        --transformer-dropout 0.1 --label-smoothing 0.1 \
        --learn-rate 0.0003 --lr-warmup 16000 --lr-decay-inv-sqrt 16000 --lr-report \
        --optimizer-params 0.9 0.98 1e-09 --clip-norm 5 \
        --devices $GPUS --sync-sgd --seed $ID$ID$ID$ID --keep-best --overwrite \
        --exponential-smoothing --no-nccl --filter_corpus 0.85 \
        --sr-freq-file cl/mod/en-mod.txt cl/mod/en-mod-cdf_base.npz 2.5 0.01 mod d 0.5 \
        --after-batches 100000

# Evaluation
./decode_validate.sh $MODEL_DIR > $MODEL_DIR/result

# For more details:
# runner/run.sh

# Competence Parameters
# NBCL : --sr-freq-file cl/mod/en-mod.txt cl/mod/en-mod-cdf_base.npz 2.5 0.01 mod [d] [0.5]
# params : word_stat_file CDF_file [MOD: ratio] percentage_of_starting_corpus(c0) mode(mod) [dynamic_weight] [dynamic_ratio]

Procedure

All tools can be found in the runner folder.

compile marian source code [Please review marian repo for detail]
install fasttext
prepare your corpus data
use NBCL tools to build word_stat and cdf files
train model

Mainly modified code:

src/data/competence.h
src/data/gap_training.h
src/data/batch_generator.h

Great thanks to Marian community

This project is based on the codebase forked from Marian (version 13 Dec 2018).

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
CL_tools		CL_tools
cmake		cmake
contrib		contrib
doc		doc
examples		examples
runner		runner
scripts		scripts
src		src
vs		vs
CHANGELOG.md		CHANGELOG.md
CMakeLists.txt		CMakeLists.txt
CMakeSettings.json		CMakeSettings.json
CONTRIBUTING.md		CONTRIBUTING.md
Doxyfile.in		Doxyfile.in
LICENSE.md		LICENSE.md
NBCL4NMT.pdf		NBCL4NMT.pdf
README.md		README.md
VERSION		VERSION

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Norm-Based Curriculum Learning for Neural Machine Translation (ACL 2020)

Reference:

INSTALLATION

Requirements

Clone this repository

Training and Testing

Train Fasttext to get norm-based sentence difficulty

Translation

Procedure

Mainly modified code:

Great thanks to Marian community

About

Releases

Packages

Languages

License

NLP2CT/norm-nmt

Folders and files

Latest commit

History

Repository files navigation

Norm-Based Curriculum Learning for Neural Machine Translation (ACL 2020)

Reference:

INSTALLATION

Requirements

Clone this repository

Training and Testing

Train Fasttext to get norm-based sentence difficulty

Translation

Procedure

Mainly modified code:

Great thanks to Marian community

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages