Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BLEU with a variable number of references #130

Closed
tuetschek opened this issue Jan 8, 2021 · 5 comments · Fixed by #132
Closed

BLEU with a variable number of references #130

tuetschek opened this issue Jan 8, 2021 · 5 comments · Fixed by #132

Comments

@tuetschek
Copy link
Contributor

The current implementation of BLEU in SacreBLEU doesn't allow a variable number of references (different segments having a different number of references). Some NLG datasets (e.g. E2E, WebNLG) do have a variable number of references and it would be great if SacreBLEU worked on them.

Is that a design decision, or would you be open to a PR that would enable this?

@martinpopel
Copy link
Collaborator

It seems that this feature is often needed (e.g. #69) and the current workaround (duplicate another reference for a given segment) is not user friendly (and it is not efficient), so I would welcome such a PR.

File format

Currently, sacreBLEU supports both a single file with tab-separated references and multiple files (each reference in a single file).

The former format currently needs --num-refs=Integer. So do we want to allow also a range, e.g. --num-refs=1-5, so there is still same sanity check of the input data?

Do we want to allow variable number of references also in the latter format? That would mean e.g. re-defining an empty string as a missing reference (which should not affect the brevity penalty). How would then someone represent a real empty segment as a reference? (Suppose there is a language where the best translation of "Um, heh." is to omit it.) We could disambiguate the two use cases by special options such as --variable-number-of-references or --treat-empty-as-empty, but that seems too complicated (even with better names).

The Python API

Currently, we have corpus_bleu(sys, [ref1, ref2,...refN]), where each of the N references is a list of all segments. I am not sure if we want to change it (cf. #125).
We could represent missing references (for a given segment) with None here, preventing thus the ambiguity of empty strings.

@tuetschek
Copy link
Contributor Author

Thanks for such a quick and detailed answer! I'll try to implement it then 🙂!

I was actually only considering the Python API so far (for the upcoming GEM NLG shared task/benchmark), so having the possibility just via the API and not exposed on the command line would be fine for my use case. I agree that representing missing references with None is the easiest way to go.

Re. Command line/file format: I think that allowing variable numbers of references just in the tab-separated format is fine. Allowing a range for --num-refs makes sense.

Regardless of whether we want to support the multi-file format or not, would it make sense to represent real empty references with a space character instead of a truly empty line/tab column, or is that too obscure?

@ozancaglayan
Copy link
Collaborator

Hello

Thanks for the issue and the PR!

I am starting to think that all these should be discussed to allow for a well-defined API, instead of yet another processing trick through isinstance checks. See #121 for another small error that's due to (not) fixing the user's inputs' type. It becomes quite unreadable and confusing in terms of understanding what are the accepted input format for corpus_bleu method.

In overall, I think that the range of possibilities for the hyp_stream and ref_stream arguments is quite confusing and we should in fact drop all the sanity/robustness fixes in a major version, so that to avoid people relying on sacreBLEU to fix erroneous inputs.

@martinpopel
Copy link
Collaborator

@ozancaglayan: I think this issue (variable number of references for different sentences) is almost* orthogonal to the isinstance checks and API allowing several input types (single reference or a list of references).
I vote for merging #132 now unless there are some objections.

*) There is an option to change the corpus_bleu(sys, refs) API from
refs=[ref1,...refN] where ref1=[ref1_sent1,...ref1_sentM]
to
refs=[refs_sent1,...refs_M] where refs_sent1=[ref1_sent1,...refN_sent1].
Such a change would make it easy to use different number of references N in each sentence and we would not need to introduce None refs.
However, I don't think such a big change in the API is wise - it would confuse many users and using single reference BLEU would be more difficult.
@ozancaglayan, you wrote in #125 "EDIT: This is wrong, see below", so I expect you also don't want this change.
Thus I think we should accept the PR #132 with None representing missing references.

@ozancaglayan
Copy link
Collaborator

Okay then. Let's also think about adding documentation for this to the README file in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants