Refactoring ideas #125

ozancaglayan · 2020-11-25T11:39:08Z

Hi all,

I would like to do another round of refactoring that would basically target the way hypotheses and references are read. Right now, corpus and sentence-level methods of each metric make some assumptions, fixes wrong inputs using isinstance() checks, and bail out if nothing can be done. IMHO, this creates a lot of confusion especially with the introduction of standalone metric classes vs. the compatibility API. I remember that we encountered a few issues regarding these and those were fixed here and there.

I think the metrics should just stick to one signature (i.e. no Union type annotation for sys and ref streams) and the isinstance checks should be moved to the compatibility layer if required.

I would suggest the following and throw an exception if this is not the case:

hyp: str, refs: [str, str2, ..., strC] for sentence-level scorers
hyps: [str1, str2, ..., strN] refs: [[str1, str2, ..., strC]_1, ..., [str1, str2, ..., strC]_N] (EDIT: This is wrong, see below)

Another option is to explicitly implement classes: Hypothesis, HypothesisList, Reference

Moreover, the metrics should not receive streams that are processed on-the-fly but already populated lists. We may separate out the tokenization parts of metrics and call them first. This should simplify future significance testing as well, which will require reusing shuffled statistics many times.

What do you think?

The text was updated successfully, but these errors were encountered:

bricksdont · 2020-11-26T08:24:11Z

Hi Ozan,

I agree with the function signature part (as you might have guessed :-).

Lists as arguments instead of Iterables: I am not sure why you would like to require lists explicitly.

Separating out tokenization for significance testing: in my opinion, that is a good idea as well, but what would need to be separated out / cached is not only tokenization, but everything that can be reused, such as ngram statistics.

ozancaglayan · 2020-11-27T13:04:22Z

Sorry, when I say lists, I meant Iterables :)

martinpopel · 2020-11-27T15:52:28Z

Hi, I agree there was/is a mess with the various accepted inputs and isinstance checks and this should ideally be kept only in the legacy API, while the main API should have just a single argument signature per method (no Union) and provide helpful error messages when used incorrectly. The problem is that Unions are already in the main API, so excluding them would cause problems who are using the main API. There should be some deprecation period.

hyps: [str1, str2, ..., strN] refs: [[str1, str2, ..., strC]_1, ..., [str1, str2, ..., strC]_N]

C is the number of references and N is the number of sentences, am I right?.
So the suggestion is to change the main API (of corpus_bleu) from ref[reference_number][sentence_number] to ref[sentence_number][reference_number]?
This would naturally allow to have a different number of references for each sentence (some users need this, see e.g. #69), but I am not sure I like this idea:

If we allow different number of references for each sentence, we won't be able to test for accidental errors.
Such a big change of the API would cause a lot of confusion. It is difficult to test for a wrong usage if someone uses the old ref[sentence_number][reference_number] instead of the new ref[reference_number][sentence_number]. Of course, usually, there are more sentences than references (unless called from sentence_bleu)...
Most MT test sets have just a single reference. This typical usage is now simple:

sys = ['The dog bit the man.', "It wasn't surprising.", 'The man had just bitten him.']
refs = ['The dog bit the man.', 'It was not unexpected.', 'The man bit him first.']
corpus_bleu(sys, [refs])

but with the suggested API, we would need to run

corpus_bleu(sys, [[r] for r in refs])

(or use the legacy API, which would do this conversion automatically).

As for reusing the tokenization and correct & matching n-gram counts when doing bootstrap resampling significance tests, I think this is already done this way in #78. Unfortunately, that PR is still not merged and getting out of sync with the master, so there is a risk it will become stale similarly to the previous attempt at adding bootstrap to sacrebleu (#11).

ozancaglayan · 2020-11-27T15:59:46Z

Hi,

Sorry, I made a mistake and assumed that the current API was like that for reference stream organization. So you can ignore that part i.e. I don't propose changing it, we can keep it as it is.

martinpopel mentioned this issue Jan 8, 2021

BLEU with a variable number of references #130

Closed

ozancaglayan self-assigned this Jan 12, 2021

mjpost mentioned this issue Feb 15, 2021

adds mtedx valid and test data #136

Merged

ozancaglayan closed this as completed Feb 24, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactoring ideas #125

Refactoring ideas #125

ozancaglayan commented Nov 25, 2020 •

edited

Loading

bricksdont commented Nov 26, 2020

ozancaglayan commented Nov 27, 2020

martinpopel commented Nov 27, 2020

ozancaglayan commented Nov 27, 2020

Refactoring ideas #125

Refactoring ideas #125

Comments

ozancaglayan commented Nov 25, 2020 • edited Loading

bricksdont commented Nov 26, 2020

ozancaglayan commented Nov 27, 2020

martinpopel commented Nov 27, 2020

ozancaglayan commented Nov 27, 2020

ozancaglayan commented Nov 25, 2020 •

edited

Loading