-
Notifications
You must be signed in to change notification settings - Fork 165
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SGML entities de-escaping in tokenization #15
Comments
Nice catch. I'm going to fix this instead of remove it, though, because my goal with the tokenization_13a is to replicate mteval_13a.pl, which is used by Moses. This bug has no effect on the WMT datasets, which hasn't used these codes since 2009. There are, however, other test sets that I may add someday that do use these (for example, the Chinese NIST datasets from the mid-2000s). |
If there are *.sgm test sets which use " etc. then these should be de-escaped already when exporting the sgml to plain-text. |
this seems to be already handled, right? |
The obvious bug was fixed (converting all entities to double quotes), but the de-escaping of SGML entities (converting |
@mjpost can we have a decision on this? Martin strongly disagrees on keeping these in |
I agree that this is ugly, but I do not think that we should change the behavior of the tokenizer. We do not want a situation where someone upgrades sacrebleu and suddenly gets different results. Furthermore, there are many other things wrong with the I think it to see sacrebleu move to using a subword vocabulary model (#118). Note also that I'm working with Barry and Ondrej to repackage all WMT test sets in proper XML format, so this issue will go away from the data side. I think that can be part of the 2.0 release. |
I agree. So let's keep v13a buggy, but don't introduce the bug into |
https://github.com/mjpost/sacreBLEU/blob/b38690e1537cd4719c3517ef77c8255c5a107cc8/sacrebleu.py#L396-L399
First, there is a bug: all four entities
" & < >
are converted to double quotes. Probably a copy-paste error (not present in the original Perl implementation).Second, I think we should delete this completely. The de-escaping was needed in the original implementation because the translation and reference files were in SGML (*.sgm) format. SacréBLEU expects plain-text input (or API calls), so this is not needed. I think it is a responsibility of a modern MT system to clean the data (ideally the training data) and produce human-readable sentences (i.e. without escaped html/sgml entities).
Similarly, if the input format is expected to be one sentence per line, there is no need for
replace('-\n', '')
, but this does not matter if there are no newlines in the string, it just obfuscates the code.And guess what is my opinion on
replace('<skipped>', '')
.The text was updated successfully, but these errors were encountered: