-
Notifications
You must be signed in to change notification settings - Fork 164
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replicability of mteval-v14.pl -c --international-tokenization #138
Comments
I am now generating a bunch of scores with |
I selected 1-2 systems per each language pair, amongst the wmt17 submissions that are also used by the
|
Ok found the culprit: We can never be sure that an MT system's output does not contain escaped chars for punctuations, and here this is what happens. We need to add those bits to |
If a system really produced Of course, if there is |
Well it's a matter of choice. Seen that we are trying our best to stay compatible with |
This commit switches to regex module to make punctuation and symbol processing a lot faster (4.5s -> 0.9s for the tested en-cs system) during --tokenize intl tokenization (#46) This also handles #138 , a small de-escaping glitch that leaves sacreBLEU out of sync with mteval14.pl --international-tokenization.
We should fix the real cause and that is how someone converted sgm to txt in http://data.statmt.org/wmt17/translation-task/wmt17-submitted-data-v1.0.tgz Luckily, there is a recent initiative by @mjpost et al. to re-package all WMT references and system submissions, fixing these problems. I think no WMT news test reference contains So I think we should fix also the 13a tokenization code in sacrebleu to not do any |
In #46, @ozancaglayan reported:
I am just creating a new issue for this. I don't say
sacrebleu --tok intl
must always return the same result asmteval-v14.pl -c --international-tokenization
, but we should at least know what is the reason for the differences.Let me comment on some differences between
https://github.com/moses-smt/mosesdecoder/blob/7dd812/scripts/generic/mteval-v14.pl#L954-L983
and
https://github.com/mjpost/sacrebleu/blob/7bd5d88/sacrebleu/tokenizers/tokenizer_intl.py#L73-L76
The text was updated successfully, but these errors were encountered: