Replicability of mteval-v14.pl -c --international-tokenization #138

martinpopel · 2021-02-19T15:44:49Z

I don't seem to get the mteval-v14.pl -c --international-tokenization results exactly

I am just creating a new issue for this. I don't say sacrebleu --tok intl must always return the same result as mteval-v14.pl -c --international-tokenization, but we should at least know what is the reason for the differences.

Let me comment on some differences between

https://github.com/moses-smt/mosesdecoder/blob/7dd812/scripts/generic/mteval-v14.pl#L954-L983

and

https://github.com/mjpost/sacrebleu/blob/7bd5d88/sacrebleu/tokenizers/tokenizer_intl.py#L73-L76

$norm_text =~ s/\p{Zl}/ /g;
# We don't need joining lines in sacrebleu

$norm_text =~ s/&quot;/\"/g;
$norm_text =~ s/&amp;/&/g; 
...
# mteval needs sgm on the input, but instead of proper SGML/XML parsing it does this hack,
# which is not a part of the tokenization, but of SGML parsing.
# sacrebleu has plain-text on the input, so de-escaping entities should not be done.

$norm_text =~ s/\p{Z}+/ /g; # one space only between words
# This has no effect because sacrebleu uses line.split(), not line.split(' ')

The text was updated successfully, but these errors were encountered:

ozancaglayan · 2021-02-19T15:46:57Z

I am now generating a bunch of scores with mteval-v14.pl to compare against.

ozancaglayan · 2021-02-19T17:16:24Z

I selected 1-2 systems per each language pair, amongst the wmt17 submissions that are also used by the test.sh script. Funnily @martinpopel , only en-cs and cs-en pairs turned out to be different for one system, but not for the online one.. So it sounds like a difference in tokenizing Czech? (well although the source should not matter, but maybe there are words copied verbatim into target by NMT)

newstest2017.online-A.0.en-cs.sgm
 mteval-v14a.pl -c        => 16.5523
 sacrebleu --tok 13a      => 16.5523
 mteval-v14a.pl -c [intl] => 16.6493
 sacrebleu --tok intl     => 16.6493
newstest2017.PJATK.4761.en-cs.sgm
 mteval-v14a.pl -c        => 16.1798
 sacrebleu --tok 13a      => 16.1798
 mteval-v14a.pl -c [intl] => 16.0849
 sacrebleu --tok intl     => 15.9111
newstest2017.PJATK.4760.cs-en.sgm
 mteval-v14a.pl -c        => 23.1520
 sacrebleu --tok 13a      => 23.1520
 mteval-v14a.pl -c [intl] => 24.9324
 sacrebleu --tok intl     => 23.4651
newstest2017.online-A.0.cs-en.sgm
 mteval-v14a.pl -c        => 25.1221
 sacrebleu --tok 13a      => 25.1221
 mteval-v14a.pl -c [intl] => 25.5598
 sacrebleu --tok intl     => 25.5598

ozancaglayan · 2021-02-19T17:45:06Z

Ok found the culprit: We can never be sure that an MT system's output does not contain escaped chars for punctuations, and here this is what happens. We need to add those bits to intl as well, we already have them for 13a.
13a misses ' btw but it was added to v14a-intl. Total mess =)
I'll fix this in my branch.

martinpopel · 2021-02-19T18:16:13Z

If a system really produced ', but the reference says there should be an apostrophe, then we should not count it as a match, no matter what mteval says.
End users may not even know what does ' means.

Of course, if there is ' in the sgm output, we should decode it as the apostrophe in the plain-text files (both references and system outputs), but this is not a matter of tokenization.

ozancaglayan · 2021-02-19T18:19:39Z

Well it's a matter of choice. Seen that we are trying our best to stay compatible with mteval, why not to stay compatible in this case as well? Also, we already process these tokens in the default 13a tokenizer as well i.e. although the SGM files would not include such stuff, any system's output with escaped symbols were already mapped to their surface forms in sacreBLEU. I assume we will not change 13a to not process those tokens so I think we need to make intl even then.

This commit switches to regex module to make punctuation and symbol processing a lot faster (4.5s -> 0.9s for the tested en-cs system) during --tokenize intl tokenization (#46) This also handles #138 , a small de-escaping glitch that leaves sacreBLEU out of sync with mteval14.pl --international-tokenization.

martinpopel · 2021-02-19T19:09:45Z

We should fix the real cause and that is how someone converted sgm to txt in http://data.statmt.org/wmt17/translation-task/wmt17-submitted-data-v1.0.tgz
When you check sgm/system-outputs/newstest2017/cs-en/newstest2017.PJATK.4760.cs-en.sgm, there is
<seg id="3">The victim's brother ...
so the PJATK system correctly translated the sentence as The victim's brother ... and the PJATK authors correctly wrapped that into sgm as The victim's brother ....
Unfortunately, some WMT organizers incorrectly converted this to txt/system-outputs/newstest2017/cs-en/newstest2017.PJATK.4760.cs-en
as The victim's brother ....

Luckily, there is a recent initiative by @mjpost et al. to re-package all WMT references and system submissions, fixing these problems.

I think no WMT news test reference contains ', & etc. Even the WMT2016 IT-domain test set contains no sentences about programming or HTML coding with such escapes (there are some occurrences of “ etc, but these are handled by neither mteval nor sacrebleu).

So I think we should fix also the 13a tokenization code in sacrebleu to not do any ', & etc conversions as soon as the plain-text system submissions at statmt.org are fixed.
I don't care about mteval much, but someone could send its maintainers a patch (mteval-v15?) which would do proper SGML parsing and fixing their tokenization code as well.

ozancaglayan added this to the 2.0.0 milestone Feb 24, 2021

ozancaglayan mentioned this issue Feb 24, 2021

SGML entities de-escaping in tokenization #15

Closed

ozancaglayan closed this as completed Feb 25, 2021

ozancaglayan added a commit that referenced this issue Feb 26, 2021

intl: revert fixing escaped chars (#138)

8d95957

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replicability of mteval-v14.pl -c --international-tokenization #138

Replicability of mteval-v14.pl -c --international-tokenization #138

martinpopel commented Feb 19, 2021

ozancaglayan commented Feb 19, 2021

ozancaglayan commented Feb 19, 2021 •

edited

Loading

ozancaglayan commented Feb 19, 2021 •

edited

Loading

martinpopel commented Feb 19, 2021

ozancaglayan commented Feb 19, 2021 •

edited

Loading

martinpopel commented Feb 19, 2021

Replicability of mteval-v14.pl -c --international-tokenization #138

Replicability of mteval-v14.pl -c --international-tokenization #138

Comments

martinpopel commented Feb 19, 2021

ozancaglayan commented Feb 19, 2021

ozancaglayan commented Feb 19, 2021 • edited Loading

ozancaglayan commented Feb 19, 2021 • edited Loading

martinpopel commented Feb 19, 2021

ozancaglayan commented Feb 19, 2021 • edited Loading

martinpopel commented Feb 19, 2021

ozancaglayan commented Feb 19, 2021 •

edited

Loading

ozancaglayan commented Feb 19, 2021 •

edited

Loading

ozancaglayan commented Feb 19, 2021 •

edited

Loading