Skip to content

v2.2.0

Compare
Choose a tag to compare
@mjpost mjpost released this 25 Jul 21:12
· 30 commits to master since this release

This release contains an inner reworking of the data representations, contributed by @BrightXiaoHan. This enables the following features:

  • Added WMT21 datasets (which are properly XML-encoded)
  • Exposed corpus metadata via --echo (including origlang, docid, and genre, which are all available for most WMT corpora)

We also added a Korean tokenizer (--tok ko-mecab), contributed by @NoUnique.

In addition, there are a number of bug fixes and minor fixes:

  • Empty references (#161) are now allowed. Some of our speech test sets could not be used before this was fixed!
  • We now recommend that people use the spm tokenizer, particularly for CJK languages.
  • Internally, the tarball downloads and extracted test and metadata files now have names that are globally unique (e.g., .sacrebleu/wmt21/wmt_21.en-de.ref instead of .sacrebleu/wmt21/de-en.ref. The file extension corresponds to the field that gets passed to --echo.