Skip to content

Latest commit

 

History

History
120 lines (103 loc) · 3.66 KB

README.md

File metadata and controls

120 lines (103 loc) · 3.66 KB

Cloud AutoML Translation Tools

This repository contains the util tools for Cloud AutoML Translation. With this tool, you can validate, convert, count or randomly autosplit dataset before uploading to AutoML.

How to use this tool.

  1. Check it out.
  2. Optional: create and activate a new virtual env.
  3. Install libraries.
git clone https://github.com/GoogleCloudPlatform/automl-translation-tools.git
cd automl-translation-tools
virtualenv env
. env/bin/activate
pip install -r automl/requirements.txt

or use bazel

git clone https://github.com/GoogleCloudPlatform/automl-translation-tools.git
cd automl-translation-tools
# Run command, replace `python parser.py [FLAGS]` with
bazel run automl:parser -- [FLAGS]
# Tests
bazel test automl/...

Validate input file

This tool validates whether the tsv/tmx file is valid or not.

Valid tsv file

We will split each line of tsv using \t. And a valid tsv line will contains exact 2 sentence pairs.

Valid tmx file

Currently, parser can parse the subset of tmx spec.

  1. <tmx> element is required and should wrap all the content.
  2. <header> is required. It should be the first element inside tmx element. But parser will not return any error when there is nothing in tmx. (eg, <tmx></tmx>). Attribute 'srclang' is required, but all other attributes is optional for now.
  3. <body> is required. It should be right after header element. But parser will not return any error when there is no body element.
  4. <tu> is element inside <body>. Each <tu> contains a (src_lang, dst_lang) pair, it is expected to have 2 <tuv> elements.
  5. <tuv> is element inside <tu>. Attribute 'xml:lang' is required. Each <tuv> is expected to have 1 <seg> containing the phrase.
  6. <seg> contains the parallel phrase in either source or target language.
  7. Other unsupported tags(e.g. <entry_metadata>) are skipped.
  8. For each <tu>, if we can not parse a (src_lang, dst_lang) pair from it, we skip this <tu> and append the info into _skipped_phrases list.

Example TMX structure:

<tmx>
<header srclang="en" />
  <body>
    <tu>
      <tuv xml:lang="en">
        <seg>XXX</seg>
      </tuv>
      <tuv xml:lang="zh">
        <seg>???</seg>
      </tuv>
    </tu>
  </body>
</tmx>
# You can specify multiple input files.
python parser.py              \
    --cmd=validate            \
    --input_files=$INPUT_FILE \ 
    --src_lang_code=en        \
    --dst_lang_code=zh

Randomly autosplit dataset

If number of sentence pairs smaller than 100k. We will randomly autosplit dataset with split ratio: 8:1:1. Otherwise, we will split the dataset into

  • Training: total_number - 20k
  • Validation: 10k
  • Test: 10k
python parser.py                                     \
    --cmd=autosplit                                  \
    --input_files=$INPUT_FILE                        \ 
    --train_dataset=$TRAINING_DATASET_OUTPUT         \
    --validation_dataset=$VALIDATION_DATASET_OUTPUT  \
    --test_dataset=$TEST_DATASET_OUTPUT

Convert file format

This tool can convert file formats between tsv/tmx.

python parser.py              \
    --cmd=convert             \
    --input_files=$INPUT_FILE \
    --src_lang_code=en        \
    --dst_lang_code=zh        \
    --output_file=$OUTPUT_FILE

Count the total number of sentence pairs

This tool will calculate the number of sentence pairs in input files.

python parser.py              \
    --cmd=count               \
    --input_files=$INPUT_FILE \ 
    --src_lang_code=en        \
    --dst_lang_code=zh