-
Notifications
You must be signed in to change notification settings - Fork 164
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes for 2.0.0 #152
Changes for 2.0.0 #152
Conversation
Return 0.0 BLEU if no matches occur (#141)
- Separate out TER functionality into lib_ter.py
… replace some re's with python
- Add a pre-packaged WMT17 EN-DE hyps/refs package for significance testing - Significance testing tests: Compare results to Moses and Multeval
Add more docs for significance part
Okay, I'll remove it then. For the second part, yes we can make it 1000, my initial motivation was to make the estimation more robust since in terms of speed, there is not much difference between 1000 and 2000. |
I'm trying to do some testing now. How much confidence do you have in the AR and BSR implementations? Has anyone code-reviewed them? Just want to make sure we have the details right, since people will likely start using this! |
Nitpicking now, but what do you think about remove the parens from the value in the JSON format?
Separately, we could add (The former, with the parens, is the only thing we can't change once we release). |
The
I'm definitely not an expert but quite confident that they should be OK. If you have some people in mind, it would of course be better to let them review the code. |
I think we can merge this, and process any additional changes on the main branch prior to the 2.0 release. |
One last question: The README had plenty of examples demonstrating sacreBLEU with the old Thank you! |
Sure and maybe note "-f text" when you mention that? |
okay, I am done with the README updates as well. Care to take a final look there? |
@mjpost Mmm I can't merge this as Travis doesn't work... |
You can click on "command line instructions" and follow the instructions. It says "If ... an automatic merge cannot be performed, you can perform a manual merge on the command line." |
oh okay I thought that Travis would block that too |
It still fails, maybe we should temporarily disable the limitation from repo settings?
|
I think this is the story: https://daniel.haxx.se/blog/2021/06/14/bye-bye-travis-ci/ |
Hello,
Here's a detailed summary of this PR. It'll probably be quite hard to review this as the modifications to metrics will appear as large diff blocks. But in the first part, we can move on through the below summary and examples and also the code changes. I tested this extensively, but it's possible that the combination of some CLI flags may raise errors, who knows. If merge this, we could do a release candidate first to let people test it.
In terms of backward compatibility, I tried to be conservative. The handling of Sentence BLEU yields non-intuitive scores #141 is definitely a backward-incompatible fix but I think it's the correct behavior. Another incompatible change is how signatures are formatted on the terminal.
Two things are hopefully handled correctly but actually untested on Windows: (1) Colored outputs (should be disabled on Windows through platform check), (2) multi-CPU significance test (should fall back to 1 CPU if on Windows)
Questions:
Should we keep the single-system confidence (--confidence) functionality or is it confusing things as it does not actually provide very valuable information on itself.
For Having better defaults for ChrF #124 should we switch to chrF++ by default or continue computing the plain old chrF for backward compatibility?
Thanks!
General
setup.py
so that they are shown on PyPI.portalocker
version pinning, addregex, tabulate, numpy
dependencies.isinstance
checks. If the user does not obeyto the expected annotations, exceptions will be raised. Robustness attempts lead to
confusions and obfuscated score errors in the past (Sentence CHRF silently accepts single reference as string #121)
still only available through the API.
colored output is disabled if the platform is Windows or if the output is
redirected into a file or
--no-color
is passed.Tokenizers
regex
dependency and use it in theV14International
tokenizer.Speed goes from ~4 seconds to ~0.6 seconds for a particular test set evaluation. (Speed up (w/ numpy) #46)
Metrics
word_order
argument. Added test cases against chrF++.py.Exposed it through the CLI (--chrf-word-order). The default is still chrF and not chrF++ (Having better defaults for ChrF #124)
the scores obtained are exactly the same as chrF++, Moses and NLTK implementations.
I kept the effective ordering as the default since this only affects sentence-level
scoring with very short sentences. (chrF not compatible with chrF++, Moses and NLTK for sentence-level smoothing #144)
string.translate
for OP code mapping in one line of code.Metric API
argparse.Namespace
.Metric
class is introduced to guide further metric developmentThis class defines the methods that should be implemented in the derived classes and
offers boilerplate methods for the common functionality.
references
argument at initialization timeto process and cache the references. Further evaluations of different systems against
the same references becomes faster this way for example when using significance
testing.
Signatures
var
if variable number of references is used.CLI
--input/-i
can now ingest multiple systems. For this reason, the positionalreferences
should always preceed the-i
flag:--help
is printed.I did not add --bleu prefixes to BLEU arguments.
=
as follows:in a parseable JSON format. Arguments such as --short, --score-only are ignored and
full information is dumped when
-f json
is given:json
falls back to plain text. Other options existin this mode such as
latex, rst, html
. (More on this later)Multi-system evaluation mode
sacreBLEU now supports evaluating multiple systems for a given test set in an efficient way.
Through the use of
tabulate
package, the results are nicely rendered into a tablein plain text, LaTeX, HTML or RST (cf. --format/-f argument).
The systems can be either given as a list of plain text files to
-i/--input
oras a tab-separated single stream redirected into
STDIN
. In the former case,the basenames of the files will be automatically used as system names.
If you give the same file twice, sacreBLEU will issue an error.
Explicit filenames:
Tab-separated STDIN:
LaTeX mode:
Single-system bootstrap confidence intervals (requires numpy) (#40 and #78)
95% confidence intervals are provided only for the single-system evaluation mode.
If you have multiple systems, we recommend using paired tests that will provide
both the confidence intervals and the p-values.
The feature is enabled by passing
--confidence
to the CLI. The default numberof bootstrap resamples is 2000. This can be changed with the
--confidence-n
flag.The random number generator's seed is by default fixed to 12345. The seed
can be modified by exporting
SACREBLEU_SEED
environment variable.If the exported value is
[Nn]one
, the seed is uninitialized, yieldingnon-deterministic results.
Unit tests are added to compare the results to Moses' significance Perl script.
Fixed seed:
Random seed:
Multi-system paired significance tests (#40 and #78)
When you have multiple systems to evaluate for a given test set and language pair,
you can now use paired significance tests to obtain p-values.
The first system provided to
--input/-i
(or the first column hypothesesif pasted
STDIN
method is used) will be flagged as the baseline system.When using
--input/-i
, sacreBLEU will automatically discard the baseline systemif it appears more than one time. This is useful when using shell globs.
Two types of paired tests are provided: Bootstrap resampling (
bs
) andapproximate randomization (
ar
).bs
replicates the behavior of Moses'significance Perl script whereas
ar
follows the Multevalfor performing approximate randomization. The feature is enabled by passing
one of these two methods to the
--paired
flag.The default number of samples/trials for
bs
andar
are 2,000 and 10,000, respectively.This can be changed by using the
--paired-n/-pan
flag.The
bs
test will also print 95% CI around the true mean as additional information.To enable same type of CI's for the AR test, pass
--paired-ar-confidence-n 0
for example, to use the default value of 2000 resamples.
Verbose information printed during the tests can be disabled by
--quiet
.Example of evaluating 16 WMT17 submissions with 2 metrics:
if you also enable TER, it takes ~3.5 minutes to complete. Therefore, it is also
possible to run the tests using multiple workers (only on Linux and Mac OS X).
Passing
0
to--paired-jobs/-paj
will launch as many workers as the numberof systems (up to the limit of the number of CPUs on the machine) whereas
passing a value
> 0
will manually set the number of workers in the pool.For the above example, it takes ~40 seconds to complete using 15 workers
(for 15 candidate systems, excluding the baseline).
Approximate randomization method:
SacreBLEU 2.0.0 performance tests
where caching kicks in. (The reason why no-cache also speeds up things
after the 1st evaluation is the caching in the tokenizers.)
performance diff between versions.