This repository contains the util tools for Cloud AutoML Translation. With this tool, you can validate, convert, count or randomly autosplit dataset before uploading to AutoML.
- Check it out.
- Optional: create and activate a new virtual env.
- Install libraries.
git clone https://github.com/GoogleCloudPlatform/automl-translation-tools.git
cd automl-translation-tools
virtualenv env
. env/bin/activate
pip install -r automl/requirements.txt
git clone https://github.com/GoogleCloudPlatform/automl-translation-tools.git
cd automl-translation-tools
# Run command, replace `python parser.py [FLAGS]` with
bazel run automl:parser -- [FLAGS]
# Tests
bazel test automl/...
This tool validates whether the tsv/tmx file is valid or not.
We will split each line of tsv using \t
. And a valid tsv line will contains
exact 2 sentence pairs.
Currently, parser can parse the subset of tmx spec.
- <tmx> element is required and should wrap all the content.
- <header> is required. It should be the first element inside tmx element. But parser will not return any error when there is nothing in tmx. (eg, <tmx></tmx>). Attribute 'srclang' is required, but all other attributes is optional for now.
- <body> is required. It should be right after header element. But parser will not return any error when there is no body element.
- <tu> is element inside <body>. Each <tu> contains a (src_lang, dst_lang) pair, it is expected to have 2 <tuv> elements.
- <tuv> is element inside <tu>. Attribute 'xml:lang' is required. Each <tuv> is expected to have 1 <seg> containing the phrase.
- <seg> contains the parallel phrase in either source or target language.
- Other unsupported tags(e.g. <entry_metadata>) are skipped.
- For each <tu>, if we can not parse a (src_lang, dst_lang) pair from it, we skip this <tu> and append the info into _skipped_phrases list.
Example TMX structure:
<tmx>
<header srclang="en" />
<body>
<tu>
<tuv xml:lang="en">
<seg>XXX</seg>
</tuv>
<tuv xml:lang="zh">
<seg>???</seg>
</tuv>
</tu>
</body>
</tmx>
# You can specify multiple input files.
python parser.py \
--cmd=validate \
--input_files=$INPUT_FILE \
--src_lang_code=en \
--dst_lang_code=zh
If number of sentence pairs smaller than 100k. We will randomly autosplit dataset with split ratio: 8:1:1. Otherwise, we will split the dataset into
- Training: total_number - 20k
- Validation: 10k
- Test: 10k
python parser.py \
--cmd=autosplit \
--input_files=$INPUT_FILE \
--train_dataset=$TRAINING_DATASET_OUTPUT \
--validation_dataset=$VALIDATION_DATASET_OUTPUT \
--test_dataset=$TEST_DATASET_OUTPUT
This tool can convert file formats between tsv/tmx.
python parser.py \
--cmd=convert \
--input_files=$INPUT_FILE \
--src_lang_code=en \
--dst_lang_code=zh \
--output_file=$OUTPUT_FILE
This tool will calculate the number of sentence pairs in input files.
python parser.py \
--cmd=count \
--input_files=$INPUT_FILE \
--src_lang_code=en \
--dst_lang_code=zh