TER above 100? #208

JoyeBright · 2022-09-09T16:38:58Z

Dear all,

I am getting TER scores above 100 for some MT-Ref pairs. Shouldn't the scores be given between 0 and 100? Is any new update included in the sacreblue that modifies the range? Any idea on that?

Examples:

Panna Crema per latticini 300.0
3,4-dimetossibenzaldeide Benzaldeide, 3,4-dimetossi- 200.0
Beta-metil-naftilchetone Metil beta-naftil chetone 300.0
BSA-minore di-1,71-PET-BASSA BSA-Inferiore A 1,71-PET-LOW 150.0
Dove: In cui: 200.0

mjpost · 2022-09-09T20:22:59Z

What dataset is this?

JoyeBright · 2022-09-12T09:27:40Z

Chemical patent data – not publicly available! Does this matter?

mjpost · 2022-09-12T16:56:10Z

Well, the best way to diagnose this is if I can run your exact command.

Can you share the invocation you're using, and maybe some sample data? How many references are you using?

bricksdont · 2022-09-28T08:57:03Z

@JoyeBright you would need to show a reproducible example that demonstrates how to get a TER score above 100. Here is some advice: https://stackoverflow.com/help/minimal-reproducible-example

Either include code here or perhaps a link to a Colab. Make sure to mention the exact version of SacreBLEU or include a pip install command.

JoyeBright · 2022-10-14T19:42:31Z

@mjpost, @bricksdont Below is the link to the Colab; you can see three examples that achieved TER above 100! https://colab.research.google.com/drive/13K16f9znwH_xYhVg9RoT0I0UiJADUuTA?usp=sharing

Any idea? Thank you!

bricksdont · 2022-10-17T09:51:12Z

Thank you @JoyeBright,

This is a score scaling issue. If you install an earlier version the resulting scores are the same, except a 100 times smaller. Example:

from sacrebleu import TER
from argparse import Namespace

args = Namespace(
        normalized=False, no_punct=False,
        asian_support=False, case_sensitive=False)

ter = TER(args)

sentences_2 = "Inoltre, a meno che altrimenti indicato, i disegni non sono in scala."
ref_2 = "Tabella 16:"

# sacrebleu==2.2.1 (newest)
print(ter.sentence_score(sentences_1, [ref_1]).score)
600.0

# sacrebleu==1.5.0
print(ter.sentence_score(sentences_1, [ref_1]).score)
6.0

as a quick workaround you can just divide scores by 100. But still this is an issue that should be fixed.

JoyeBright · 2022-10-19T08:05:17Z

Dear @bricksdont, thanks for spotting the problem!

bricksdont · 2022-10-19T08:28:02Z

@JoyeBright maybe leave this issue open, I think it needs to be addressed

martinpopel · 2022-10-19T08:29:20Z

I am reopening this issue as it seems quite serious problem if all TER scores are reported 100 times higher then they should be.

mjpost · 2022-10-19T11:28:25Z

It looks like this was added then for the 2.0 release: https://github.com/mjpost/sacrebleu/blob/master/sacrebleu/metrics/ter.py#L129

I can revert that and do a 2.3.2 release.

martinpopel · 2022-10-19T11:57:47Z

Wait. It seems I was wrong - and the current implementation is OK. I don't have enough time now to double check everything, but:

In the original TER paper, TER is defined as # of edits / average # of reference words, which should usually* give a number between 0 and 1, but already in that original paper the scores are reported as percentages, i.e. numbers between 0 and 100.

However, if the prediction is longer than the reference and does not share any words with the reference, we need to first delete all the words from the prediction and then add the words from the reference, so the # of edits will by higher than the # of reference words and the TER score will be higher than 100%.

It seems there have been no attempts to clip the scores at 100%. For example the original Java implementation (java -jar tercom.7.25.jar), seems to report mostly numbers between 0 and 1, but values higher than 1 are possible as well.

In SacreBLEU v2.0, we have decided to report TER as percentages, i.e. the formula 100 * (total_edits / sum_ref_lengths) seems to be OK.
In rare cases, we should expect TER scores higher than 100% (while an empty system prediction would get exactly 100%).

mjpost · 2022-10-19T12:01:56Z

I was just looking into this myself and came to the same conclusion. See also #169 where @ozancaglayan notes this same phenomenon (also #140 and the release notes in #152).

I am going to close this again—it seems the implementation is correct.

JoyeBright closed this as completed Oct 19, 2022

martinpopel reopened this Oct 19, 2022

mjpost closed this as completed Oct 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TER above 100? #208

TER above 100? #208

JoyeBright commented Sep 9, 2022 •

edited

Loading

mjpost commented Sep 9, 2022

JoyeBright commented Sep 12, 2022

mjpost commented Sep 12, 2022

bricksdont commented Sep 28, 2022

JoyeBright commented Oct 14, 2022

bricksdont commented Oct 17, 2022 •

edited

Loading

JoyeBright commented Oct 19, 2022

bricksdont commented Oct 19, 2022

martinpopel commented Oct 19, 2022

mjpost commented Oct 19, 2022

martinpopel commented Oct 19, 2022 •

edited

Loading

mjpost commented Oct 19, 2022

TER above 100? #208

TER above 100? #208

Comments

JoyeBright commented Sep 9, 2022 • edited Loading

mjpost commented Sep 9, 2022

JoyeBright commented Sep 12, 2022

mjpost commented Sep 12, 2022

bricksdont commented Sep 28, 2022

JoyeBright commented Oct 14, 2022

bricksdont commented Oct 17, 2022 • edited Loading

JoyeBright commented Oct 19, 2022

bricksdont commented Oct 19, 2022

martinpopel commented Oct 19, 2022

mjpost commented Oct 19, 2022

martinpopel commented Oct 19, 2022 • edited Loading

mjpost commented Oct 19, 2022

JoyeBright commented Sep 9, 2022 •

edited

Loading

bricksdont commented Oct 17, 2022 •

edited

Loading

martinpopel commented Oct 19, 2022 •

edited

Loading