GitHub

Introduction

This repository contains the code for the ACL-21 paper:

Improving Lexically Constrained Neural Machine Translation with Source-Conditioned Masked Span Prediction

Gyubok Lee*, Seongjun Yang*, Edward Choi (*: equal contribution)

Our code is built upon Leca because it works for both with or without a bilingual term dictionary.

Data download

DE-EN OPUS Acquis, Emea
Download the DE-EN OPUS Acquis, Emea dataset by this link
DE-EN IATE dictionary
Download the DE-EN dictionary by this link
KO-EN Law data
Download the KO-EN corpus by this link
KO-EN Law dictionary
Download the KO-EN dictionary by this link

Requirments and Installation

Pytorch version == 1.7.1
Python version >= 3.7

0. Installing from source

To install fairseq from source and develop locally :

git clone https://github.com/wns823/NMT_SSP.git
cd NMT_SSP
git clone https://github.com/moses-smt/mosesdecoder.git
git clone https://github.com/rsennrich/subword-nmt.git
pip install --editable .
pip install wandb
pip install spacy==2.2.4
pip install mecab-python3==0.996.5
pip install konlpy==0.5.2
pip install tokenizers==0.10.2
pip install parmap

Getting Started

1. Data preprocessing

Step1. Place the data in the appropriate folder (iate_extract.ipynb to preprocess IATE terms).

dict_law_en_ko.json, iate_en_de_all.json → dictionary folder
corpus → raw_data folder

Step2. Filter dictionary in IATE dictionary. (filter_dict.py)

python filter_dict.py --min 4 --max 20

Step3. Split data by terminology-aware data split algorithm (data_split_algorithm.py)

ex) python data_split_algorithm.py --domain acquis --src_path raw_data/JRC-Acquis.de-en.de  --tgt_path raw_data/JRC-Acquis.de-en.en --directory_path dictionary/iate_en_de_filter.json --src_lang de
ex) python data_split_algorithm.py --domain emea --src_path raw_data/EMEA.de-en.de  --tgt_path raw_data/EMEA.de-en.en --directory_path dictionary/iate_en_de_filter.json --src_lang de
ex) python data_split_algorithm.py --domain law --src_path raw_data/law-all.ko  --tgt_path raw_data/law-all.en --directory_path dictionary/dict_law_en_ko.json --src_lang ko

Step4. Tokenize and BPE

In DE-EN,
bash tokenizing_bpe_gen.sh domain
ex) bash tokenizing_bpe_gen.sh acquis
bash tokenizing_bpe_apply.sh domain split
ex) bash tokenizing_bpe_apply.sh acquis valid
In KO-EN,
bash tokenizing_bpe_gen_ko.sh
bash tokenizing_bpe_apply_ko.sh

Step5. Filter sentence by length (filter_data.py)

ex) python filter_data.py --domain emea --src_lang de --tgt_lang en --split train --min 5 --max 80
ex) python filter_data.py --domain emea --src_lang de --tgt_lang en --split valid --min 5 --max 80
ex) python filter_data.py --domain emea --src_lang de --tgt_lang en --split test --min 5 --max 80

Step6. Binarize dataset (binarize_dataset.sh)

bash binarize_dataset.sh emea de en

Step7. Span making (make_tok.sh -> make_span.py)

ex)
bash make_tok.sh de en emea
python make_span.py --directory emea_deen --src de --tgt en --saved data-bin/emea_deen

2. Train a transformer with SSP

bash train.sh gpu domain src tgt model_path span loss_ratio min_span max_span dropout
ex) bash train.sh 2 emea de en emea_leca_span_0.3 span 0.5 1 10 0.3

3. Generate

bash inference.sh gpu domain src tgt model_path with_dictionary
ex) bash inference.sh 0 emea de en emea_leca_span_0.3 1 (without dictionary)
ex) bash inference.sh 0 emea de en emea_leca_span_0.3 0 (with dictionary)

4. TER, LSM score

ex) python ngram_inference.py --domain emea --src_lang de --tgt_lang en --outputfile inference_result/emea_leca_span_0.3_1.txt

Citation

@inproceedings{lee-etal-2021-improving,
    title = "Improving Lexically Constrained Neural Machine Translation with Source-Conditioned Masked Span Prediction",
    author = "Lee, Gyubok  and
      Yang, Seongjun  and
      Choi, Edward",
    booktitle = "Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)",
    month = aug,
    year = "2021",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.acl-short.94",
    doi = "10.18653/v1/2021.acl-short.94",
    pages = "743--753",
    abstract = "Accurate terminology translation is crucial for ensuring the practicality and reliability of neural machine translation (NMT) systems. To address this, lexically constrained NMT explores various methods to ensure pre-specified words and phrases appear in the translation output. However, in many cases, those methods are studied on general domain corpora, where the terms are mostly uni- and bi-grams ({\textgreater}98{\%}). In this paper, we instead tackle a more challenging setup consisting of domain-specific corpora with much longer n-gram and highly specialized terms. Inspired by the recent success of masked span prediction models, we propose a simple and effective training strategy that achieves consistent improvements on both terminology and sentence-level translation for three domain-specific corpora in two language pairs.",
}

Name		Name	Last commit message	Last commit date
Latest commit History 72 Commits
acquis_deen		acquis_deen
build		build
data-bin		data-bin
dictionary		dictionary
docs		docs
emea_deen		emea_deen
examples		examples
fairseq.egg-info		fairseq.egg-info
fairseq		fairseq
fairseq_cli		fairseq_cli
inference_result		inference_result
law_koen		law_koen
original_dataset		original_dataset
outputs		outputs
preprocess_dataset		preprocess_dataset
raw_data		raw_data
scripts		scripts
spacy_konlpy_bpe		spacy_konlpy_bpe
tests		tests
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
binarize_dataset.sh		binarize_dataset.sh
data_split_algorithm.py		data_split_algorithm.py
filter_data.py		filter_data.py
filter_dict.py		filter_dict.py
generate_text.py		generate_text.py
hubconf.py		hubconf.py
iate_extract.ipynb		iate_extract.ipynb
inference.sh		inference.sh
make_span.py		make_span.py
make_tok.sh		make_tok.sh
ngram_inference.py		ngram_inference.py
preprocess_ko.py		preprocess_ko.py
pyproject.toml		pyproject.toml
setup.py		setup.py
tokenizing_bpe_apply.sh		tokenizing_bpe_apply.sh
tokenizing_bpe_apply_ko.sh		tokenizing_bpe_apply_ko.sh
tokenizing_bpe_gen.sh		tokenizing_bpe_gen.sh
tokenizing_bpe_gen_ko.sh		tokenizing_bpe_gen_ko.sh
train.py		train.py
train.sh		train.sh
undo_bpe_ko.py		undo_bpe_ko.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Introduction

Data download

Requirments and Installation

0. Installing from source

Getting Started

1. Data preprocessing

Step1. Place the data in the appropriate folder (iate_extract.ipynb to preprocess IATE terms).

Step2. Filter dictionary in IATE dictionary. (filter_dict.py)

Step3. Split data by terminology-aware data split algorithm (data_split_algorithm.py)

Step4. Tokenize and BPE

Step5. Filter sentence by length (filter_data.py)

Step6. Binarize dataset (binarize_dataset.sh)

Step7. Span making (make_tok.sh -> make_span.py)

2. Train a transformer with SSP

3. Generate

4. TER, LSM score

Citation

About

Releases

Packages

Contributors 2

Languages

License

wns823/NMT_SSP

Folders and files

Latest commit

History

Repository files navigation

Introduction

Data download

Requirments and Installation

0. Installing from source

Getting Started

1. Data preprocessing

Step1. Place the data in the appropriate folder (iate_extract.ipynb to preprocess IATE terms).

Step2. Filter dictionary in IATE dictionary. (filter_dict.py)

Step3. Split data by terminology-aware data split algorithm (data_split_algorithm.py)

Step4. Tokenize and BPE

Step5. Filter sentence by length (filter_data.py)

Step6. Binarize dataset (binarize_dataset.sh)

Step7. Span making (make_tok.sh -> make_span.py)

2. Train a transformer with SSP

3. Generate

4. TER, LSM score

Citation

About

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages