Release v2.2.0 · mjpost/sacrebleu

This release contains an inner reworking of the data representations, contributed by @BrightXiaoHan. This enables the following features:

Added WMT21 datasets (which are properly XML-encoded)
Exposed corpus metadata via --echo (including origlang, docid, and genre, which are all available for most WMT corpora)

We also added a Korean tokenizer (--tok ko-mecab), contributed by @NoUnique.

In addition, there are a number of bug fixes and minor fixes:

Empty references (#161) are now allowed. Some of our speech test sets could not be used before this was fixed!
We now recommend that people use the spm tokenizer, particularly for CJK languages.
Internally, the tarball downloads and extracted test and metadata files now have names that are globally unique (e.g., .sacrebleu/wmt21/wmt_21.en-de.ref instead of .sacrebleu/wmt21/de-en.ref. The file extension corresponds to the field that gets passed to --echo.

Provide feedback