Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TER above 100? #208

Closed
JoyeBright opened this issue Sep 9, 2022 · 12 comments
Closed

TER above 100? #208

JoyeBright opened this issue Sep 9, 2022 · 12 comments

Comments

@JoyeBright
Copy link

JoyeBright commented Sep 9, 2022

Dear all,

I am getting TER scores above 100 for some MT-Ref pairs. Shouldn't the scores be given between 0 and 100? Is any new update included in the sacreblue that modifies the range? Any idea on that?

Examples:

  • Panna Crema per latticini 300.0
  • 3,4-dimetossibenzaldeide Benzaldeide, 3,4-dimetossi- 200.0
  • Beta-metil-naftilchetone Metil beta-naftil chetone 300.0
  • BSA-minore di-1,71-PET-BASSA BSA-Inferiore A 1,71-PET-LOW 150.0
  • Dove: In cui: 200.0
@mjpost
Copy link
Owner

mjpost commented Sep 9, 2022

What dataset is this?

@JoyeBright
Copy link
Author

Chemical patent data – not publicly available! Does this matter?

@mjpost
Copy link
Owner

mjpost commented Sep 12, 2022

Well, the best way to diagnose this is if I can run your exact command.

Can you share the invocation you're using, and maybe some sample data? How many references are you using?

@bricksdont
Copy link

@JoyeBright you would need to show a reproducible example that demonstrates how to get a TER score above 100. Here is some advice: https://stackoverflow.com/help/minimal-reproducible-example

Either include code here or perhaps a link to a Colab. Make sure to mention the exact version of SacreBLEU or include a pip install command.

@JoyeBright
Copy link
Author

@mjpost, @bricksdont Below is the link to the Colab; you can see three examples that achieved TER above 100! https://colab.research.google.com/drive/13K16f9znwH_xYhVg9RoT0I0UiJADUuTA?usp=sharing

Any idea? Thank you!

@bricksdont
Copy link

bricksdont commented Oct 17, 2022

Thank you @JoyeBright,

This is a score scaling issue. If you install an earlier version the resulting scores are the same, except a 100 times smaller. Example:

from sacrebleu import TER
from argparse import Namespace

args = Namespace(
        normalized=False, no_punct=False,
        asian_support=False, case_sensitive=False)

ter = TER(args)

sentences_2 = "Inoltre, a meno che altrimenti indicato, i disegni non sono in scala."
ref_2 = "Tabella 16:"

# sacrebleu==2.2.1 (newest)
print(ter.sentence_score(sentences_1, [ref_1]).score)
600.0

# sacrebleu==1.5.0
print(ter.sentence_score(sentences_1, [ref_1]).score)
6.0

as a quick workaround you can just divide scores by 100. But still this is an issue that should be fixed.

@JoyeBright
Copy link
Author

Dear @bricksdont, thanks for spotting the problem!

@bricksdont
Copy link

@JoyeBright maybe leave this issue open, I think it needs to be addressed

@martinpopel
Copy link
Collaborator

I am reopening this issue as it seems quite serious problem if all TER scores are reported 100 times higher then they should be.

@martinpopel martinpopel reopened this Oct 19, 2022
@mjpost
Copy link
Owner

mjpost commented Oct 19, 2022

It looks like this was added then for the 2.0 release: https://github.com/mjpost/sacrebleu/blob/master/sacrebleu/metrics/ter.py#L129

I can revert that and do a 2.3.2 release.

@martinpopel
Copy link
Collaborator

martinpopel commented Oct 19, 2022

Wait. It seems I was wrong - and the current implementation is OK. I don't have enough time now to double check everything, but:

In the original TER paper, TER is defined as # of edits / average # of reference words, which should usually* give a number between 0 and 1, but already in that original paper the scores are reported as percentages, i.e. numbers between 0 and 100.

However, if the prediction is longer than the reference and does not share any words with the reference, we need to first delete all the words from the prediction and then add the words from the reference, so the # of edits will by higher than the # of reference words and the TER score will be higher than 100%.

It seems there have been no attempts to clip the scores at 100%. For example the original Java implementation (java -jar tercom.7.25.jar), seems to report mostly numbers between 0 and 1, but values higher than 1 are possible as well.

In SacreBLEU v2.0, we have decided to report TER as percentages, i.e. the formula 100 * (total_edits / sum_ref_lengths) seems to be OK.
In rare cases, we should expect TER scores higher than 100% (while an empty system prediction would get exactly 100%).

@mjpost
Copy link
Owner

mjpost commented Oct 19, 2022

I was just looking into this myself and came to the same conclusion. See also #169 where @ozancaglayan notes this same phenomenon (also #140 and the release notes in #152).

I am going to close this again—it seems the implementation is correct.

@mjpost mjpost closed this as completed Oct 19, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants