Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replicability of mteval-v14.pl -c --international-tokenization #138

Closed
martinpopel opened this issue Feb 19, 2021 · 6 comments
Closed

Replicability of mteval-v14.pl -c --international-tokenization #138

martinpopel opened this issue Feb 19, 2021 · 6 comments
Milestone

Comments

@martinpopel
Copy link
Collaborator

In #46, @ozancaglayan reported:

I don't seem to get the mteval-v14.pl -c --international-tokenization results exactly

I am just creating a new issue for this. I don't say sacrebleu --tok intl must always return the same result as mteval-v14.pl -c --international-tokenization, but we should at least know what is the reason for the differences.

Let me comment on some differences between

https://github.com/moses-smt/mosesdecoder/blob/7dd812/scripts/generic/mteval-v14.pl#L954-L983

and

https://github.com/mjpost/sacrebleu/blob/7bd5d88/sacrebleu/tokenizers/tokenizer_intl.py#L73-L76

$norm_text =~ s/\p{Zl}/ /g;
# We don't need joining lines in sacrebleu

$norm_text =~ s/"/\"/g;
$norm_text =~ s/&/&/g; 
...
# mteval needs sgm on the input, but instead of proper SGML/XML parsing it does this hack,
# which is not a part of the tokenization, but of SGML parsing.
# sacrebleu has plain-text on the input, so de-escaping entities should not be done.

$norm_text =~ s/\p{Z}+/ /g; # one space only between words
# This has no effect because sacrebleu uses line.split(), not line.split(' ')
@ozancaglayan
Copy link
Collaborator

I am now generating a bunch of scores with mteval-v14.pl to compare against.

@ozancaglayan
Copy link
Collaborator

ozancaglayan commented Feb 19, 2021

I selected 1-2 systems per each language pair, amongst the wmt17 submissions that are also used by the test.sh script. Funnily @martinpopel , only en-cs and cs-en pairs turned out to be different for one system, but not for the online one.. So it sounds like a difference in tokenizing Czech? (well although the source should not matter, but maybe there are words copied verbatim into target by NMT)

newstest2017.online-A.0.en-cs.sgm
 mteval-v14a.pl -c        => 16.5523
 sacrebleu --tok 13a      => 16.5523
 mteval-v14a.pl -c [intl] => 16.6493
 sacrebleu --tok intl     => 16.6493
newstest2017.PJATK.4761.en-cs.sgm
 mteval-v14a.pl -c        => 16.1798
 sacrebleu --tok 13a      => 16.1798
 mteval-v14a.pl -c [intl] => 16.0849
 sacrebleu --tok intl     => 15.9111
newstest2017.PJATK.4760.cs-en.sgm
 mteval-v14a.pl -c        => 23.1520
 sacrebleu --tok 13a      => 23.1520
 mteval-v14a.pl -c [intl] => 24.9324
 sacrebleu --tok intl     => 23.4651
newstest2017.online-A.0.cs-en.sgm
 mteval-v14a.pl -c        => 25.1221
 sacrebleu --tok 13a      => 25.1221
 mteval-v14a.pl -c [intl] => 25.5598
 sacrebleu --tok intl     => 25.5598

@ozancaglayan
Copy link
Collaborator

ozancaglayan commented Feb 19, 2021

Ok found the culprit: We can never be sure that an MT system's output does not contain escaped chars for punctuations, and here this is what happens. We need to add those bits to intl as well, we already have them for 13a.
13a misses ' btw but it was added to v14a-intl. Total mess =)
I'll fix this in my branch.

@martinpopel
Copy link
Collaborator Author

If a system really produced ', but the reference says there should be an apostrophe, then we should not count it as a match, no matter what mteval says.
End users may not even know what does ' means.

Of course, if there is ' in the sgm output, we should decode it as the apostrophe in the plain-text files (both references and system outputs), but this is not a matter of tokenization.

@ozancaglayan
Copy link
Collaborator

ozancaglayan commented Feb 19, 2021

Well it's a matter of choice. Seen that we are trying our best to stay compatible with mteval, why not to stay compatible in this case as well? Also, we already process these tokens in the default 13a tokenizer as well i.e. although the SGM files would not include such stuff, any system's output with escaped symbols were already mapped to their surface forms in sacreBLEU. I assume we will not change 13a to not process those tokens so I think we need to make intl even then.

ozancaglayan added a commit that referenced this issue Feb 19, 2021
This commit switches to regex module to make punctuation and symbol
processing a lot faster (4.5s -> 0.9s for the tested en-cs system)
during --tokenize intl tokenization (#46)

This also handles #138 , a small de-escaping glitch that leaves sacreBLEU out of
sync with mteval14.pl --international-tokenization.
@martinpopel
Copy link
Collaborator Author

We should fix the real cause and that is how someone converted sgm to txt in http://data.statmt.org/wmt17/translation-task/wmt17-submitted-data-v1.0.tgz
When you check sgm/system-outputs/newstest2017/cs-en/newstest2017.PJATK.4760.cs-en.sgm, there is
<seg id="3">The victim&apos;s brother ...
so the PJATK system correctly translated the sentence as The victim's brother ... and the PJATK authors correctly wrapped that into sgm as The victim&apos;s brother ....
Unfortunately, some WMT organizers incorrectly converted this to txt/system-outputs/newstest2017/cs-en/newstest2017.PJATK.4760.cs-en
as The victim&apos;s brother ....

Luckily, there is a recent initiative by @mjpost et al. to re-package all WMT references and system submissions, fixing these problems.

I think no WMT news test reference contains &apos;, &amp; etc. Even the WMT2016 IT-domain test set contains no sentences about programming or HTML coding with such escapes (there are some occurrences of &#8220; etc, but these are handled by neither mteval nor sacrebleu).

So I think we should fix also the 13a tokenization code in sacrebleu to not do any &apos;, &amp; etc conversions as soon as the plain-text system submissions at statmt.org are fixed.
I don't care about mteval much, but someone could send its maintainers a patch (mteval-v15?) which would do proper SGML parsing and fixing their tokenization code as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants