Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SGML entities de-escaping in tokenization #15

Closed
martinpopel opened this issue Nov 22, 2017 · 8 comments
Closed

SGML entities de-escaping in tokenization #15

martinpopel opened this issue Nov 22, 2017 · 8 comments

Comments

@martinpopel
Copy link
Collaborator

https://github.com/mjpost/sacreBLEU/blob/b38690e1537cd4719c3517ef77c8255c5a107cc8/sacrebleu.py#L396-L399

First, there is a bug: all four entities " & < > are converted to double quotes. Probably a copy-paste error (not present in the original Perl implementation).

Second, I think we should delete this completely. The de-escaping was needed in the original implementation because the translation and reference files were in SGML (*.sgm) format. SacréBLEU expects plain-text input (or API calls), so this is not needed. I think it is a responsibility of a modern MT system to clean the data (ideally the training data) and produce human-readable sentences (i.e. without escaped html/sgml entities).

Similarly, if the input format is expected to be one sentence per line, there is no need for replace('-\n', ''), but this does not matter if there are no newlines in the string, it just obfuscates the code.

And guess what is my opinion on replace('<skipped>', '').

@mjpost
Copy link
Owner

mjpost commented Nov 22, 2017

Nice catch. I'm going to fix this instead of remove it, though, because my goal with the tokenization_13a is to replicate mteval_13a.pl, which is used by Moses. This bug has no effect on the WMT datasets, which hasn't used these codes since 2009. There are, however, other test sets that I may add someday that do use these (for example, the Chinese NIST datasets from the mid-2000s).

@martinpopel
Copy link
Collaborator Author

If there are *.sgm test sets which use " etc. then these should be de-escaped already when exporting the sgml to plain-text.

@ozancaglayan
Copy link
Collaborator

this seems to be already handled, right?

@martinpopel
Copy link
Collaborator Author

The obvious bug was fixed (converting all entities to double quotes), but the de-escaping of SGML entities (converting &amp; to &, &lt; to < etc.) is still present in the default tokenizer.
What I suggested in this issue is that the entities should be de-escaped when reading SGML/XML references, but it should not be part of the tokenizer. This is still not done, which is why the issue is still open.

@martinpopel martinpopel mentioned this issue Jul 17, 2020
@ozancaglayan
Copy link
Collaborator

@mjpost can we have a decision on this? Martin strongly disagrees on keeping these in v13a tokenizer. In #138 I actually added the same things to the intl tokenizer to make it exactly equivalent to mteval-v14.pl. I still lean towards having them in both tokenizers to satisfy the initial objective of sacreBLEU, that is obtaining the same scores with those evaluation scripts.

@mjpost
Copy link
Owner

mjpost commented Feb 25, 2021

I agree that this is ugly, but I do not think that we should change the behavior of the tokenizer. We do not want a situation where someone upgrades sacrebleu and suddenly gets different results. Furthermore, there are many other things wrong with the v13a tokenizer (have you seen what it does to URLs?), and it doesn't make sense to fix just this one.

I think it to see sacrebleu move to using a subword vocabulary model (#118).

Note also that I'm working with Barry and Ondrej to repackage all WMT test sets in proper XML format, so this issue will go away from the data side. I think that can be part of the 2.0 release.

@martinpopel
Copy link
Collaborator Author

We do not want a situation where someone upgrades sacrebleu and suddenly gets different results.

I agree. So let's keep v13a buggy, but don't introduce the bug into -tok intl.

@ozancaglayan
Copy link
Collaborator

Okay. So if we say that we keep it for v13a and not introduce it for v14, we can close this issue and #138 . I'll revert my changes in the branch to undo the #138 fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants