PhoMT: A High-Quality and Large-Scale Benchmark Dataset for Vietnamese-English Machine Translation

PhoMT is a high-quality and large-scale Vietnamese-English parallel dataset of 3.02M sentence pairs. The dataset statistics are as follows:

Details of the dataset construction and experimental results can be found in our EMNLP 2021 paper:

@inproceedings{PhoMT,
title     = {{PhoMT: A High-Quality and Large-Scale Benchmark Dataset for Vietnamese-English Machine Translation}},
author    = {Long Doan and Linh The Nguyen and Nguyen Luong Tran and Thai Hoang and Dat Quoc Nguyen},
booktitle = {Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing},
year      = {2021},
pages     = {4495--4503}
}

Please follow this LINK to download the PhoMT dataset. By downloading this dataset, USER agrees:

to use the dataset for research or educational purposes only.
to not distribute the dataset or part of the dataset in any original or modified form.
and to cite our EMNLP 2021 paper "PhoMT: A High-Quality and Large-Scale Benchmark Dataset for Vietnamese-English Machine Translation" whenever the dataset is used to help produce published results.

Note: We performed Vietnamese tone normalization on the Vietnamese sentences, using a Python script.

THE DATA IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE DATA OR THE USE OR OTHER DEALINGS IN THE
DATA.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PhoMT: A High-Quality and Large-Scale Benchmark Dataset for Vietnamese-English Machine Translation

Copyright (c) 2021 VinAI

About

Releases

Packages

VinAIResearch/PhoMT

Folders and files

Latest commit

History

Repository files navigation

PhoMT: A High-Quality and Large-Scale Benchmark Dataset for Vietnamese-English Machine Translation

Copyright (c) 2021 VinAI

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages