SGML entities de-escaping in tokenization #15

martinpopel · 2017-11-22T09:16:04Z

https://github.com/mjpost/sacreBLEU/blob/b38690e1537cd4719c3517ef77c8255c5a107cc8/sacrebleu.py#L396-L399

First, there is a bug: all four entities " & < > are converted to double quotes. Probably a copy-paste error (not present in the original Perl implementation).

Second, I think we should delete this completely. The de-escaping was needed in the original implementation because the translation and reference files were in SGML (*.sgm) format. SacréBLEU expects plain-text input (or API calls), so this is not needed. I think it is a responsibility of a modern MT system to clean the data (ideally the training data) and produce human-readable sentences (i.e. without escaped html/sgml entities).

Similarly, if the input format is expected to be one sentence per line, there is no need for replace('-\n', ''), but this does not matter if there are no newlines in the string, it just obfuscates the code.

And guess what is my opinion on replace('<skipped>', '').

The text was updated successfully, but these errors were encountered:

mjpost · 2017-11-22T19:07:08Z

Nice catch. I'm going to fix this instead of remove it, though, because my goal with the tokenization_13a is to replicate mteval_13a.pl, which is used by Moses. This bug has no effect on the WMT datasets, which hasn't used these codes since 2009. There are, however, other test sets that I may add someday that do use these (for example, the Chinese NIST datasets from the mid-2000s).

martinpopel · 2017-11-22T20:03:46Z

If there are *.sgm test sets which use " etc. then these should be de-escaped already when exporting the sgml to plain-text.

ozancaglayan · 2020-06-08T14:25:58Z

this seems to be already handled, right?

martinpopel · 2020-06-08T14:43:04Z

The obvious bug was fixed (converting all entities to double quotes), but the de-escaping of SGML entities (converting & to &, < to < etc.) is still present in the default tokenizer.
What I suggested in this issue is that the entities should be de-escaped when reading SGML/XML references, but it should not be part of the tokenizer. This is still not done, which is why the issue is still open.

ozancaglayan · 2021-02-24T19:42:37Z

@mjpost can we have a decision on this? Martin strongly disagrees on keeping these in v13a tokenizer. In #138 I actually added the same things to the intl tokenizer to make it exactly equivalent to mteval-v14.pl. I still lean towards having them in both tokenizers to satisfy the initial objective of sacreBLEU, that is obtaining the same scores with those evaluation scripts.

mjpost · 2021-02-25T20:19:56Z

I agree that this is ugly, but I do not think that we should change the behavior of the tokenizer. We do not want a situation where someone upgrades sacrebleu and suddenly gets different results. Furthermore, there are many other things wrong with the v13a tokenizer (have you seen what it does to URLs?), and it doesn't make sense to fix just this one.

I think it to see sacrebleu move to using a subword vocabulary model (#118).

Note also that I'm working with Barry and Ondrej to repackage all WMT test sets in proper XML format, so this issue will go away from the data side. I think that can be part of the 2.0 release.

martinpopel · 2021-02-25T20:23:44Z

We do not want a situation where someone upgrades sacrebleu and suddenly gets different results.

I agree. So let's keep v13a buggy, but don't introduce the bug into -tok intl.

ozancaglayan · 2021-02-25T21:26:55Z

Okay. So if we say that we keep it for v13a and not introduce it for v14, we can close this issue and #138 . I'll revert my changes in the branch to undo the #138 fix.

martinpopel mentioned this issue Jul 17, 2020

Add TER #102

Merged

martinpopel closed this as completed Feb 25, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SGML entities de-escaping in tokenization #15

SGML entities de-escaping in tokenization #15

martinpopel commented Nov 22, 2017

mjpost commented Nov 22, 2017

martinpopel commented Nov 22, 2017

ozancaglayan commented Jun 8, 2020

martinpopel commented Jun 8, 2020

ozancaglayan commented Feb 24, 2021

mjpost commented Feb 25, 2021

martinpopel commented Feb 25, 2021

ozancaglayan commented Feb 25, 2021

SGML entities de-escaping in tokenization #15

SGML entities de-escaping in tokenization #15

Comments

martinpopel commented Nov 22, 2017

mjpost commented Nov 22, 2017

martinpopel commented Nov 22, 2017

ozancaglayan commented Jun 8, 2020

martinpopel commented Jun 8, 2020

ozancaglayan commented Feb 24, 2021

mjpost commented Feb 25, 2021

martinpopel commented Feb 25, 2021

ozancaglayan commented Feb 25, 2021