v2.2.0
This release contains an inner reworking of the data representations, contributed by @BrightXiaoHan. This enables the following features:
- Added WMT21 datasets (which are properly XML-encoded)
- Exposed corpus metadata via
--echo
(includingoriglang
,docid
, andgenre
, which are all available for most WMT corpora)
We also added a Korean tokenizer (--tok ko-mecab
), contributed by @NoUnique.
In addition, there are a number of bug fixes and minor fixes:
- Empty references (#161) are now allowed. Some of our speech test sets could not be used before this was fixed!
- We now recommend that people use the
spm
tokenizer, particularly for CJK languages. - Internally, the tarball downloads and extracted test and metadata files now have names that are globally unique (e.g.,
.sacrebleu/wmt21/wmt_21.en-de.ref
instead of.sacrebleu/wmt21/de-en.ref
. The file extension corresponds to the field that gets passed to--echo
.