Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactoring ideas #125

Closed
ozancaglayan opened this issue Nov 25, 2020 · 4 comments
Closed

Refactoring ideas #125

ozancaglayan opened this issue Nov 25, 2020 · 4 comments
Assignees

Comments

@ozancaglayan
Copy link
Collaborator

ozancaglayan commented Nov 25, 2020

Hi all,

I would like to do another round of refactoring that would basically target the way hypotheses and references are read. Right now, corpus and sentence-level methods of each metric make some assumptions, fixes wrong inputs using isinstance() checks, and bail out if nothing can be done. IMHO, this creates a lot of confusion especially with the introduction of standalone metric classes vs. the compatibility API. I remember that we encountered a few issues regarding these and those were fixed here and there.

I think the metrics should just stick to one signature (i.e. no Union type annotation for sys and ref streams) and the isinstance checks should be moved to the compatibility layer if required.

I would suggest the following and throw an exception if this is not the case:

  • hyp: str, refs: [str, str2, ..., strC] for sentence-level scorers
  • hyps: [str1, str2, ..., strN] refs: [[str1, str2, ..., strC]_1, ..., [str1, str2, ..., strC]_N] (EDIT: This is wrong, see below)

Another option is to explicitly implement classes: Hypothesis, HypothesisList, Reference

  1. Moreover, the metrics should not receive streams that are processed on-the-fly but already populated lists. We may separate out the tokenization parts of metrics and call them first. This should simplify future significance testing as well, which will require reusing shuffled statistics many times.

What do you think?

@bricksdont
Copy link

Hi Ozan,

I agree with the function signature part (as you might have guessed :-).

Lists as arguments instead of Iterables: I am not sure why you would like to require lists explicitly.

Separating out tokenization for significance testing: in my opinion, that is a good idea as well, but what would need to be separated out / cached is not only tokenization, but everything that can be reused, such as ngram statistics.

@ozancaglayan
Copy link
Collaborator Author

Sorry, when I say lists, I meant Iterables :)

@martinpopel
Copy link
Collaborator

Hi, I agree there was/is a mess with the various accepted inputs and isinstance checks and this should ideally be kept only in the legacy API, while the main API should have just a single argument signature per method (no Union) and provide helpful error messages when used incorrectly. The problem is that Unions are already in the main API, so excluding them would cause problems who are using the main API. There should be some deprecation period.

hyps: [str1, str2, ..., strN] refs: [[str1, str2, ..., strC]_1, ..., [str1, str2, ..., strC]_N]

C is the number of references and N is the number of sentences, am I right?.
So the suggestion is to change the main API (of corpus_bleu) from ref[reference_number][sentence_number] to ref[sentence_number][reference_number]?
This would naturally allow to have a different number of references for each sentence (some users need this, see e.g. #69), but I am not sure I like this idea:

  • If we allow different number of references for each sentence, we won't be able to test for accidental errors.
  • Such a big change of the API would cause a lot of confusion. It is difficult to test for a wrong usage if someone uses the old ref[sentence_number][reference_number] instead of the new ref[reference_number][sentence_number]. Of course, usually, there are more sentences than references (unless called from sentence_bleu)...
  • Most MT test sets have just a single reference. This typical usage is now simple:
sys = ['The dog bit the man.', "It wasn't surprising.", 'The man had just bitten him.']
refs = ['The dog bit the man.', 'It was not unexpected.', 'The man bit him first.']
corpus_bleu(sys, [refs])

but with the suggested API, we would need to run

corpus_bleu(sys, [[r] for r in refs])

(or use the legacy API, which would do this conversion automatically).

As for reusing the tokenization and correct & matching n-gram counts when doing bootstrap resampling significance tests, I think this is already done this way in #78. Unfortunately, that PR is still not merged and getting out of sync with the master, so there is a risk it will become stale similarly to the previous attempt at adding bootstrap to sacrebleu (#11).

@ozancaglayan
Copy link
Collaborator Author

Hi,

Sorry, I made a mistake and assumed that the current API was like that for reference stream organization. So you can ignore that part i.e. I don't propose changing it, we can keep it as it is.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants