BLEU with a variable number of references #130

tuetschek · 2021-01-08T21:54:24Z

The current implementation of BLEU in SacreBLEU doesn't allow a variable number of references (different segments having a different number of references). Some NLG datasets (e.g. E2E, WebNLG) do have a variable number of references and it would be great if SacreBLEU worked on them.

Is that a design decision, or would you be open to a PR that would enable this?

martinpopel · 2021-01-08T23:45:52Z

It seems that this feature is often needed (e.g. #69) and the current workaround (duplicate another reference for a given segment) is not user friendly (and it is not efficient), so I would welcome such a PR.

File format

Currently, sacreBLEU supports both a single file with tab-separated references and multiple files (each reference in a single file).

The former format currently needs --num-refs=Integer. So do we want to allow also a range, e.g. --num-refs=1-5, so there is still same sanity check of the input data?

Do we want to allow variable number of references also in the latter format? That would mean e.g. re-defining an empty string as a missing reference (which should not affect the brevity penalty). How would then someone represent a real empty segment as a reference? (Suppose there is a language where the best translation of "Um, heh." is to omit it.) We could disambiguate the two use cases by special options such as --variable-number-of-references or --treat-empty-as-empty, but that seems too complicated (even with better names).

The Python API

Currently, we have corpus_bleu(sys, [ref1, ref2,...refN]), where each of the N references is a list of all segments. I am not sure if we want to change it (cf. #125).
We could represent missing references (for a given segment) with None here, preventing thus the ambiguity of empty strings.

tuetschek · 2021-01-09T17:21:54Z

Thanks for such a quick and detailed answer! I'll try to implement it then 🙂!

I was actually only considering the Python API so far (for the upcoming GEM NLG shared task/benchmark), so having the possibility just via the API and not exposed on the command line would be fine for my use case. I agree that representing missing references with None is the easiest way to go.

Re. Command line/file format: I think that allowing variable numbers of references just in the tab-separated format is fine. Allowing a range for --num-refs makes sense.

Regardless of whether we want to support the multi-file format or not, would it make sense to represent real empty references with a space character instead of a truly empty line/tab column, or is that too obscure?

ozancaglayan · 2021-01-12T11:19:55Z

Hello

Thanks for the issue and the PR!

I am starting to think that all these should be discussed to allow for a well-defined API, instead of yet another processing trick through isinstance checks. See #121 for another small error that's due to (not) fixing the user's inputs' type. It becomes quite unreadable and confusing in terms of understanding what are the accepted input format for corpus_bleu method.

In overall, I think that the range of possibilities for the hyp_stream and ref_stream arguments is quite confusing and we should in fact drop all the sanity/robustness fixes in a major version, so that to avoid people relying on sacreBLEU to fix erroneous inputs.

martinpopel · 2021-01-12T14:50:32Z

@ozancaglayan: I think this issue (variable number of references for different sentences) is almost* orthogonal to the isinstance checks and API allowing several input types (single reference or a list of references).
I vote for merging #132 now unless there are some objections.

*) There is an option to change the corpus_bleu(sys, refs) API from
refs=[ref1,...refN] where ref1=[ref1_sent1,...ref1_sentM]
to
refs=[refs_sent1,...refs_M] where refs_sent1=[ref1_sent1,...refN_sent1].
Such a change would make it easy to use different number of references N in each sentence and we would not need to introduce None refs.
However, I don't think such a big change in the API is wise - it would confuse many users and using single reference BLEU would be more difficult.
@ozancaglayan, you wrote in #125 "EDIT: This is wrong, see below", so I expect you also don't want this change.
Thus I think we should accept the PR #132 with None representing missing references.

ozancaglayan · 2021-01-12T14:57:24Z

Okay then. Let's also think about adding documentation for this to the README file in the future.

tuetschek mentioned this issue Jan 12, 2021

Allow variable number of references for BLEU via API #132

Merged

ozancaglayan linked a pull request Jan 18, 2021 that will close this issue

Allow variable number of references for BLEU via API #132

Merged

ozancaglayan closed this as completed in #132 Jan 18, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BLEU with a variable number of references #130

BLEU with a variable number of references #130

tuetschek commented Jan 8, 2021

martinpopel commented Jan 8, 2021

tuetschek commented Jan 9, 2021

ozancaglayan commented Jan 12, 2021

martinpopel commented Jan 12, 2021

ozancaglayan commented Jan 12, 2021

BLEU with a variable number of references #130

BLEU with a variable number of references #130

Comments

tuetschek commented Jan 8, 2021

martinpopel commented Jan 8, 2021

File format

The Python API

tuetschek commented Jan 9, 2021

ozancaglayan commented Jan 12, 2021

martinpopel commented Jan 12, 2021

ozancaglayan commented Jan 12, 2021