Changes for 2.0.0 #152

ozancaglayan · 2021-03-26T21:52:02Z

Hello,

Here's a detailed summary of this PR. It'll probably be quite hard to review this as the modifications to metrics will appear as large diff blocks. But in the first part, we can move on through the below summary and examples and also the code changes. I tested this extensively, but it's possible that the combination of some CLI flags may raise errors, who knows. If merge this, we could do a release candidate first to let people test it.
In terms of backward compatibility, I tried to be conservative. The handling of Sentence BLEU yields non-intuitive scores #141 is definitely a backward-incompatible fix but I think it's the correct behavior. Another incompatible change is how signatures are formatted on the terminal.
Two things are hopefully handled correctly but actually untested on Windows: (1) Colored outputs (should be disabled on Windows through platform check), (2) multi-CPU significance test (should fall back to 1 CPU if on Windows)

Questions:

Should we keep the single-system confidence (--confidence) functionality or is it confusing things as it does not actually provide very valuable information on itself.
For Having better defaults for ChrF #124 should we switch to chrF++ by default or continue computing the plain old chrF for backward compatibility?

Thanks!

General

Improve documentation and type annotations.
Add README.md and CHANGELOG.md to setup.py so that they are shown on PyPI.
Drop Python < 3.6 support and migrate to f-strings.
Relax portalocker version pinning, add regex, tabulate, numpy dependencies.
Drop input type manipulation through isinstance checks. If the user does not obey
to the expected annotations, exceptions will be raised. Robustness attempts lead to
confusions and obfuscated score errors in the past (Sentence CHRF silently accepts single reference as string #121)
Variable # references per segment is supported for all metrics by default. It is
still only available through the API.
Use colored strings in tabular outputs (multi-system evaluation mode). The
colored output is disabled if the platform is Windows or if the output is
redirected into a file or --no-color is passed.

Tokenizers

Add caching to tokenizers which seem to speed up things a bit in some cases.
Add regex dependency and use it in the V14International tokenizer.
Speed goes from ~4 seconds to ~0.6 seconds for a particular test set evaluation. (Speed up (w/ numpy) #46)

Metrics

General performance improvements for BLEU and CHRF (Speed up (w/ numpy) #46).
Scale all metrics into the [0, 100] range (Scaling chrF and TER to 0-100 #140)
BLEU: In case of no n-gram matches at all, skip smoothing and return 0.0 BLEU (Sentence BLEU yields non-intuitive scores #141).
CHRF: Added multi-reference support, verified the scores against chrF++.py, added test case.
CHRF: Added chrF+ support through word_order argument. Added test cases against chrF++.py.
Exposed it through the CLI (--chrf-word-order). The default is still chrF and not chrF++ (Having better defaults for ChrF #124)
CHRF: Add possibility to disable effective order smoothing (pass --chrf-eps-smoothing). This way,
the scores obtained are exactly the same as chrF++, Moses and NLTK implementations.
I kept the effective ordering as the default since this only affects sentence-level
scoring with very short sentences. (chrF not compatible with chrF++, Moses and NLTK for sentence-level smoothing #144)
TER: Move tokenizer signatures to the metric itself for consistency with other metrics.
TER: Use string.translate for OP code mapping in one line of code.

Metric API

Use explicit argument names and defaults for the metrics instead of passing argparse.Namespace.
Various sorts of refactoring in method names and arguments.
A base abstract Metric class is introduced to guide further metric development
This class defines the methods that should be implemented in the derived classes and
offers boilerplate methods for the common functionality.
All metrics now receive an optional references argument at initialization time
to process and cache the references. Further evaluations of different systems against
the same references becomes faster this way for example when using significance
testing.

Signatures

The signature formatting changed (mostly to remove '+' separator as it was interfering with chrF++).
The field separator is now '|' and key values are separated with ':' rather than '.'
Boolean true / false values are shortened to yes / no, some other shortenings applied as well.
Number of references is var if variable number of references is used.
Add effective order (yes/no) to BLEU and chrF signatures.

CLI

--input/-i can now ingest multiple systems. For this reason, the positional
references should always preceed the -i flag:

# Correct
$ sacrebleu ref -i sys
# Incorrect (will fail)
$ sacrebleu -i sys ref

Separate metric-specific arguments for clarity when --help is printed.
Allow modifying TER arguments through CLI. We still keep the TERCOM defaults.
Prefix metric-specific arguments with --chrf and --ter. To maintain CLI and historical compatibility,
I did not add --bleu prefixes to BLEU arguments.
When multiple metrics are asked, they are now aligned at = as follows:

 BLEU|case:mixed|nrefs:1|tok:13a|smooth:exp|version:2.0.0 = 23.2 55.4/28.6/17.1/10.6 (BP = 1.000 ratio = 1.059 hyp_len = 65345 ref_len = 61721)
chrF2|case:mixed|nrefs:1|nc:6|nw:0|space:no|version:2.0.0 = 52.6

Added --format/-f flag. If single system is evaluated, -f json will print the results
in a parseable JSON format. Arguments such as --short, --score-only are ignored and
full information is dumped when -f json is given:

$ sacrebleu/sacrebleu.py ref -i sys
BLEU|nrefs:1|case:mixed|eff:no|tok:13a|smooth:exp|version:2.0.0 = 20.8 54.4/26.6/14.9/8.7 (BP = 1.000 ratio = 1.026 hyp_len = 62880 ref_len = 61287)

$ sacrebleu/sacrebleu.py ref -i sys -f json
{
 "name": "BLEU",
 "score": 20.8,
 "signature": "nrefs:1|case:mixed|eff:no|tok:13a|smooth:exp|version:2.0.0",
 "verbose_score": "54.4/26.6/14.9/8.7 (BP = 1.000 ratio = 1.026 hyp_len = 62880 ref_len = 61287)",
 "nrefs": "1",
 "case": "mixed",
 "eff": "no",
 "tok": "13a",
 "smooth": "exp",
 "version": "2.0.0"
}

$ sacrebleu/sacrebleu.py ref -i sys -f json | jq .score
20.8

# Multiple metrics and JSON
$ sacrebleu/sacrebleu.py ref -i sys -f json -m bleu chrf
[
{
 "name": "BLEU",
 "score": 20.8,
 "signature": "nrefs:1|case:mixed|eff:no|tok:13a|smooth:exp|version:2.0.0",
 "verbose_score": "54.4/26.6/14.9/8.7 (BP = 1.000 ratio = 1.026 hyp_len = 62880 ref_len = 61287)",
 "nrefs": "1",
 "case": "mixed",
 "eff": "no",
 "tok": "13a",
 "smooth": "exp",
 "version": "2.0.0"
},
{
 "name": "chrF2",
 "score": 52.0,
 "signature": "nrefs:1|case:mixed|eff:yes|nc:6|nw:0|space:no|version:2.0.0",
 "nrefs": "1",
 "case": "mixed",
 "eff": "yes",
 "nc": "6",
 "nw": "0",
 "space": "no",
 "version": "2.0.0"
}
]

# Iterate over the list and fetch the score key
$ sacrebleu/sacrebleu.py ref -i sys -f json -m bleu chrf | jq '.[] | .score'
20.8
52

For multi-system mode, json falls back to plain text. Other options exist
in this mode such as latex, rst, html. (More on this later)

Multi-system evaluation mode

sacreBLEU now supports evaluating multiple systems for a given test set in an efficient way.
Through the use of tabulate package, the results are nicely rendered into a table
in plain text, LaTeX, HTML or RST (cf. --format/-f argument).
The systems can be either given as a list of plain text files to -i/--input or
as a tab-separated single stream redirected into STDIN. In the former case,
the basenames of the files will be automatically used as system names.
If you give the same file twice, sacreBLEU will issue an error.

Explicit filenames:

$ sacrebleu/sacrebleu.py -t wmt17 -l cs-en -i newstest2017.* -m bleu chrf
sacreBLEU: Found 4 systems.
+-----------------------------------+--------+---------+
|                            System |  BLEU  |  chrF2  |
+===================================+========+=========+
|     newstest2017.online-A.0.cs-en |  25.1  |  53.4   |
+-----------------------------------+--------+---------+
|     newstest2017.online-B.0.cs-en |  27.4  |  54.5   |
+-----------------------------------+--------+---------+
|     newstest2017.PJATK.4760.cs-en |  23.2  |  52.6   |
+-----------------------------------+--------+---------+
| newstest2017.uedin-nmt.4955.cs-en |  30.9  |  56.8   |
+-----------------------------------+--------+---------+

-----------------
Metric signatures
-----------------
 - BLEU       nrefs:1|case:mixed|eff:no|tok:13a|smooth:exp|version:2.0.0
 - chrF2      nrefs:1|case:mixed|eff:yes|nc:6|nw:0|space:no|version:2.0.0

Tab-separated STDIN:

$ paste newstest2017.* | sacrebleu/sacrebleu.py -t wmt17 -l cs-en -m bleu chrf
sacreBLEU: Found 4 systems.
+----------+--------+---------+
|   System |  BLEU  |  chrF2  |
+==========+========+=========+
| System 1 |  25.1  |  53.4   |
+----------+--------+---------+
| System 2 |  27.4  |  54.5   |
+----------+--------+---------+
| System 3 |  23.2  |  52.6   |
+----------+--------+---------+
| System 4 |  30.9  |  56.8   |
+----------+--------+---------+

LaTeX mode:

$ paste newstest2017.* | sacrebleu/sacrebleu.py -t wmt17 -l cs-en -m bleu chrf -f latex
sacreBLEU: Found 4 systems.
\begin{tabular}{rcc}
\toprule
   System &  BLEU  &  chrF2  \\
\midrule
 System 1 &  25.1  &  53.4   \\
 System 2 &  27.4  &  54.5   \\
 System 3 &  23.2  &  52.6   \\
 System 4 &  30.9  &  56.8   \\
\bottomrule
\end{tabular}

Single-system bootstrap confidence intervals (requires numpy) (#40 and #78)

95% confidence intervals are provided only for the single-system evaluation mode.
If you have multiple systems, we recommend using paired tests that will provide
both the confidence intervals and the p-values.
The feature is enabled by passing --confidence to the CLI. The default number
of bootstrap resamples is 2000. This can be changed with the --confidence-n flag.
The random number generator's seed is by default fixed to 12345. The seed
can be modified by exporting SACREBLEU_SEED environment variable.
If the exported value is [Nn]one, the seed is uninitialized, yielding
non-deterministic results.
Unit tests are added to compare the results to Moses' significance Perl script.

Fixed seed:

$ sacrebleu/sacrebleu.py ref -i fbk --confidence -m bleu chrf
  BLEU|nrefs:1|bs:2000|seed:12345|case:mixed|eff:no|tok:13a|smooth:exp|version:2.0.0 = 26.3 (μ = 26.33 ± 0.65) 59.3/32.3/19.9/12.6 (BP = 1.000 ratio = 1.003 hyp_len = 61450 ref_len = 61287)
chrF2|nrefs:1|bs:2000|seed:12345|case:mixed|eff:yes|nc:6|nw:0|space:no|version:2.0.0 = 54.7 (μ = 54.72 ± 0.48)

Random seed:

$ SACREBLEU_SEED=None sacrebleu/sacrebleu.py ref -i fbk --confidence -m bleu chrf ter
        BLEU|nrefs:1|bs:2000|seed:none|case:mixed|eff:no|tok:13a|smooth:exp|version:2.0.0 = 26.3 (μ = 26.30 ± 0.65) 59.3/32.3/19.9/12.6 (BP = 1.000 ratio = 1.003 hyp_len = 61450 ref_len = 61287)
      chrF2|nrefs:1|bs:2000|seed:none|case:mixed|eff:yes|nc:6|nw:0|space:no|version:2.0.0 = 54.7 (μ = 54.72 ± 0.50)
TER|nrefs:1|bs:2000|seed:none|case:lc|tok:tercom|norm:no|punct:yes|asian:no|version:2.0.0 = 63.6 (μ = 63.63 ± 0.78)

Multi-system paired significance tests (#40 and #78)

When you have multiple systems to evaluate for a given test set and language pair,
you can now use paired significance tests to obtain p-values.
The first system provided to --input/-i (or the first column hypotheses
if pasted STDIN method is used) will be flagged as the baseline system.
When using --input/-i, sacreBLEU will automatically discard the baseline system
if it appears more than one time. This is useful when using shell globs.
Two types of paired tests are provided: Bootstrap resampling (bs) and
approximate randomization (ar). bs replicates the behavior of Moses'
significance Perl script whereas ar follows the Multeval
for performing approximate randomization. The feature is enabled by passing
one of these two methods to the --paired flag.
The default number of samples/trials for bs and ar are 2,000 and 10,000, respectively.
This can be changed by using the --paired-n/-pan flag.
The bs test will also print 95% CI around the true mean as additional information.
To enable same type of CI's for the AR test, pass --paired-ar-confidence-n 0
for example, to use the default value of 2000 resamples.
Verbose information printed during the tests can be disabled by --quiet.

Example of evaluating 16 WMT17 submissions with 2 metrics:

# the LIUM system will not be counted as a candidate system as it was given
# as the baseline
$ sacrebleu -t wmt17 -l en-de -i newstest2017.LIUM-NMT.4900.en-de newstest2017.* --paired bs -m bleu chrf
(Verbose messages suppressed)

+--------------------------------------------+-----------------------+-----------------------+
|                                     System |  BLEU / μ / ± 95% CI  | chrF2 / μ / ± 95% CI  |
+============================================+=======================+=======================+
| Baseline: newstest2017.LIUM-NMT.4900.en-de |  26.6 / 26.6 / 0.65   |  55.9 / 55.9 / 0.47   |
+--------------------------------------------+-----------------------+-----------------------+
|              newstest2017.C-3MA.4959.en-de |  22.7 / 22.7 / 0.61   |  52.0 / 52.0 / 0.46   |
|                                            |     (p = 0.0005)*     |     (p = 0.0005)*     |
+--------------------------------------------+-----------------------+-----------------------+
|                newstest2017.FBK.4870.en-de |  26.3 / 26.3 / 0.65   |  54.7 / 54.7 / 0.48   |
|                                            |     (p = 0.0945)      |     (p = 0.0005)*     |
+--------------------------------------------+-----------------------+-----------------------+
|                newstest2017.KIT.4950.en-de |  26.1 / 26.1 / 0.66   |  55.8 / 55.8 / 0.46   |
|                                            |     (p = 0.0105)*     |     (p = 0.1089)      |
+--------------------------------------------+-----------------------+-----------------------+
|   newstest2017.LMU-nmt-reranked.4934.en-de |  27.1 / 27.1 / 0.65   |  56.4 / 56.4 / 0.46   |
|                                            |     (p = 0.0070)*     |     (p = 0.0015)*     |
+--------------------------------------------+-----------------------+-----------------------+
|     newstest2017.LMU-nmt-single.4893.en-de |  26.6 / 26.5 / 0.66   |  55.9 / 55.9 / 0.44   |
|                                            |     (p = 0.3353)      |     (p = 0.4098)      |
+--------------------------------------------+-----------------------+-----------------------+
|              newstest2017.online-A.0.en-de |  20.8 / 20.8 / 0.59   |  52.0 / 52.0 / 0.43   |
|                                            |     (p = 0.0005)*     |     (p = 0.0005)*     |
+--------------------------------------------+-----------------------+-----------------------+
|              newstest2017.online-B.0.en-de |  26.7 / 26.7 / 0.67   |  56.3 / 56.3 / 0.45   |
|                                            |     (p = 0.3073)      |     (p = 0.0240)*     |
+--------------------------------------------+-----------------------+-----------------------+
|              newstest2017.online-F.0.en-de |  15.5 / 15.5 / 0.49   |  49.3 / 49.3 / 0.39   |
|                                            |     (p = 0.0005)*     |     (p = 0.0005)*     |
+--------------------------------------------+-----------------------+-----------------------+
|              newstest2017.online-G.0.en-de |  18.2 / 18.2 / 0.54   |  51.6 / 51.6 / 0.40   |
|                                            |     (p = 0.0005)*     |     (p = 0.0005)*     |
+--------------------------------------------+-----------------------+-----------------------+
|   newstest2017.PROMT-Rule-based.4735.en-de |  16.6 / 16.6 / 0.51   |  50.4 / 50.4 / 0.40   |
|                                            |     (p = 0.0005)*     |     (p = 0.0005)*     |
+--------------------------------------------+-----------------------+-----------------------+
|  newstest2017.RWTH-nmt-ensemble.4921.en-de |  26.0 / 26.0 / 0.66   |  55.6 / 55.6 / 0.45   |
|                                            |     (p = 0.0050)*     |     (p = 0.0120)*     |
+--------------------------------------------+-----------------------+-----------------------+
|            newstest2017.SYSTRAN.4847.en-de |  26.7 / 26.7 / 0.66   |  55.6 / 55.6 / 0.45   |
|                                            |     (p = 0.2144)      |     (p = 0.0085)*     |
+--------------------------------------------+-----------------------+-----------------------+
|           newstest2017.TALP-UPC.4834.en-de |  21.2 / 21.2 / 0.58   |  51.7 / 51.7 / 0.43   |
|                                            |     (p = 0.0005)*     |     (p = 0.0005)*     |
+--------------------------------------------+-----------------------+-----------------------+
|          newstest2017.uedin-nmt.4722.en-de |  28.3 / 28.3 / 0.70   |  57.7 / 57.7 / 0.46   |
|                                            |     (p = 0.0005)*     |     (p = 0.0005)*     |
+--------------------------------------------+-----------------------+-----------------------+
|                newstest2017.xmu.4910.en-de |  26.7 / 26.7 / 0.65   |  56.0 / 56.0 / 0.45   |
|                                            |     (p = 0.2409)      |     (p = 0.3058)      |
+--------------------------------------------+-----------------------+-----------------------+

------------------------------------------------------------
Paired bootstrap resampling test with 2000 resampling trials
------------------------------------------------------------
 - Each system is pairwise compared to Baseline: newstest2017.LIUM-NMT.4900.en-de.
   Actual system score / estimated true mean / 95% CI are provided for each metric.

 - Null hypothesis: the system and the baseline translations are essentially
   generated by the same underlying process. The p-value is roughly the probability
   of the absolute score difference (delta) between a system and the {bline} occurring due to chance.

 - Assuming a significance threshold of 0.05, the Null hypothesis can be rejected
   for p-values < 0.05 (marked with "*"). This means that the delta is unlikely to be attributed
   to chance, hence the system is significantly "different" than the baseline.
   Otherwise, the p-values are highlighted in red (if the terminal supports colors).

 - NOTE: Significance does not tell whether a system is "better" than the baseline but rather
   emphasizes the "difference" of the systems in terms of the replicability of the delta.

-----------------
Metric signatures
-----------------
 - BLEU       nrefs:1|bs:2000|seed:12345|case:mixed|eff:no|tok:13a|smooth:exp|version:2.0.0
 - chrF2      nrefs:1|bs:2000|seed:12345|case:mixed|eff:yes|nc:6|nw:0|space:no|version:2.0.0

The above command takes ~45 seconds to run which is quite acceptable. However,
if you also enable TER, it takes ~3.5 minutes to complete. Therefore, it is also
possible to run the tests using multiple workers (only on Linux and Mac OS X).
Passing 0 to --paired-jobs/-paj will launch as many workers as the number
of systems (up to the limit of the number of CPUs on the machine) whereas
passing a value > 0 will manually set the number of workers in the pool.
For the above example, it takes ~40 seconds to complete using 15 workers
(for 15 candidate systems, excluding the baseline).

Approximate randomization method:

# No CI estimation, just p-value's
$ sacrebleu -t wmt17 -l en-de -i newstest2017.LIUM-NMT.4900.en-de newstest2017.* --paired ar -m bleu chrf

# With CI estimation using the default of 2000 bootstrap resamples
$ sacrebleu -t wmt17 -l en-de -i newstest2017.LIUM-NMT.4900.en-de newstest2017.* --paired ar --paired-ar-confidence-n 2000 -m bleu chrf

SacreBLEU 2.0.0 performance tests

The tests are performed across 4 different systems using the API. This is
where caching kicks in. (The reason why no-cache also speeds up things
after the 1st evaluation is the caching in the tokenizers.)
The 1st column between 1.5.1 and no-cache should be seen as the vanilla
performance diff between versions.

BLEU {}
 >    [1.5.1] 0.890 0.890 0.890 0.891  || mean: 0.890 -- median: 0.890 -- stdev: 0.000
 > [no-cache] 0.612 0.458 0.452 0.435  || mean: 0.489 -- median: 0.458 -- stdev: 0.071
 >   [cached] 0.604 0.315 0.310 0.303  || mean: 0.383 -- median: 0.315 -- stdev: 0.128
BLEU {'tokenize': 'intl'}
 >    [1.5.1] 4.048 4.039 4.039 4.072  || mean: 4.108 -- median: 4.048 -- stdev: 0.119
 > [no-cache] 0.516 0.426 0.424 0.389  || mean: 0.439 -- median: 0.426 -- stdev: 0.047
 >   [cached] 0.504 0.281 0.271 0.302  || mean: 0.339 -- median: 0.302 -- stdev: 0.095
BLEU {'tokenize': 'none', 'force': True}
 >    [1.5.1] 0.516 0.518 0.523 0.514  || mean: 0.516 -- median: 0.516 -- stdev: 0.004
 > [no-cache] 0.264 0.268 0.293 0.265  || mean: 0.273 -- median: 0.268 -- stdev: 0.012
 >   [cached] 0.257 0.158 0.153 0.152  || mean: 0.180 -- median: 0.158 -- stdev: 0.045
CHRF {}
 >    [1.5.1] 1.105 1.102 1.099 1.100  || mean: 1.101 -- median: 1.100 -- stdev: 0.002
 > [no-cache] 1.111 1.086 1.078 1.092  || mean: 1.092 -- median: 1.092 -- stdev: 0.012
 >   [cached] 1.050 0.660 0.659 0.644  || mean: 0.753 -- median: 0.660 -- stdev: 0.171
CHRF {'whitespace': True}
 >    [1.5.1] 1.221 1.222 1.223 1.218  || mean: 1.225 -- median: 1.222 -- stdev: 0.009
 > [no-cache] 1.250 1.280 1.250 1.234  || mean: 1.254 -- median: 1.250 -- stdev: 0.016
 >   [cached] 1.251 0.778 0.772 0.756  || mean: 0.889 -- median: 0.778 -- stdev: 0.209

Return 0.0 BLEU if no matches occur (#141)

- Allow using epsilon smoothing (#144) - Add multi-reference support - Add chrF++ support through the word_order argument (#124)

- Separate out TER functionality into lib_ter.py

… with python

…nditionally

…speed (#46)

… replace some re's with python

- Add a pre-packaged WMT17 EN-DE hyps/refs package for significance testing - Significance testing tests: Compare results to Moses and Multeval

Add more docs for significance part

ozancaglayan · 2021-07-03T12:27:24Z

Okay, I'll remove it then. For the second part, yes we can make it 1000, my initial motivation was to make the estimation more robust since in terms of speed, there is not much difference between 1000 and 2000.

mjpost · 2021-07-03T12:44:02Z

I'm trying to do some testing now.

How much confidence do you have in the AR and BSR implementations? Has anyone code-reviewed them? Just want to make sure we have the details right, since people will likely start using this!

mjpost · 2021-07-03T12:49:09Z

Nitpicking now, but what do you think about remove the parens from the value in the JSON format?

  "confidence": "(μ = 42.8 ± 1.0)",

Separately, we could add confidence-mean and confidence-var fields?

(The former, with the parens, is the only thing we can't change once we release).

ozancaglayan · 2021-07-03T12:50:20Z

The test/test_significance.py compares the results of:

BSR to the perl script from Moses (using a modified version of it with bs=2000 but that is not important for unit testing)
AR to multeval (https://github.com/jhclark/multeval)

I'm definitely not an expert but quite confident that they should be OK. If you have some people in mind, it would of course be better to let them review the code.

…ADME

mjpost · 2021-07-03T13:08:00Z

I think we can merge this, and process any additional changes on the main branch prior to the 2.0 release.

ozancaglayan · 2021-07-03T13:19:13Z

One last question: The README had plenty of examples demonstrating sacreBLEU with the old text output. Since the purpose there is not the output, do you think we can keep them by adding a note saying that we assumed textual output in the following examples ?

Thank you!

mjpost · 2021-07-03T13:44:22Z

Sure and maybe note "-f text" when you mention that?

ozancaglayan · 2021-07-04T13:02:15Z

okay, I am done with the README updates as well. Care to take a final look there?

ozancaglayan · 2021-07-08T20:23:54Z

@mjpost Mmm I can't merge this as Travis doesn't work...

martinpopel · 2021-07-08T20:28:23Z

You can click on "command line instructions" and follow the instructions. It says "If ... an automatic merge cannot be performed, you can perform a manual merge on the command line."

ozancaglayan · 2021-07-08T21:34:42Z

oh okay I thought that Travis would block that too

ozancaglayan · 2021-07-08T22:06:10Z

It still fails, maybe we should temporarily disable the limitation from repo settings?

remote: error: GH006: Protected branch update failed for refs/heads/master.
remote: error: Required status check "Travis CI - Pull Request" is expected. At least 1 approving review is required by reviewers with write access.

ozancaglayan · 2021-07-08T22:11:09Z

I think this is the story: https://daniel.haxx.se/blog/2021/06/14/bye-bye-travis-ci/
I can't access the settings of the repository but we probably need to disable this and then in the future, migrate to Github Actions instead of Travis CI.

ozancaglayan added 30 commits March 26, 2021 08:55

update packaging stuff, changelog and readme

ff1b5b4

Improve base metric API

a53b9e0

BLEU: Adapt to new metric API

ede982c

Return 0.0 BLEU if no matches occur (#141)

CHRF: Adapt to new metric API

459bf06

- Allow using epsilon smoothing (#144) - Add multi-reference support - Add chrF++ support through the word_order argument (#124)

TER: Adapt to new metric API

4c3a688

- Separate out TER functionality into lib_ter.py

utils.color: Add simple ANSI coloring class

da6bafc

dataset: fix cosmetic flake8 issues

6c489d3

__main__: cosmetic

04a15f1

compat: adapt to new API

912f26f

add statistical test module (#40, #78)

04ecd07

sacrebleu.py: update

5925177

utils.py: update

6e2c9a7

__init__.py: update

28ef313

rename NoneTokenizer to BaseTokenizer

8652315

tokenizer_zh: derive from base tokenizer, add cache

c2a1fb3

tokenizer_re: derive from base tokenizer, add cache, replace a regexp…

e9a23d5

… with python

tokenizer_char: derive from base tokenizer, add cache

0c8fcb0

tokenizer_13a: derive from base tokenizer, add cache, apply regexp co…

62b1eaf

…nditionally

tokenizer_ja_mecab: derive from base tokenizer, add cache

2ddd15c

tokenizer_intl: derive from base tokenizer, add cache, use regex for …

7ce3dc8

…speed (#46)

tokenizer_ter: move signature to metric, adapt to new API, add cache,…

41910cd

… replace some re's with python

tokenizers: update __init__py

cd6bb7c

add helpers module for metrics

3ed898d

test.sh: add chrf++, multi-ref chrf, mteval-v14.pl tests

7f93733

Unit tests: Add more tests, fix some old ones

27bec34

- Add a pre-packaged WMT17 EN-DE hyps/refs package for significance testing - Significance testing tests: Compare results to Moses and Multeval

scripts: add performance measure test

a4a5d32

update README

a0abbc5

README: Improve documentation

397f3b7

Update README.md

983fa52

Add more docs for significance part

Update README.md

3336c10

Remove bootstrap CI for AR test, change bs default to 1000, update RE…

691035e

…ADME

mjpost approved these changes Jul 3, 2021

View reviewed changes

ozancaglayan added 3 commits July 3, 2021 16:10

metrics/base: improve CI reporting in JSON output

2f260aa

ChangeLog: update date

a2af6a4

Update README.md

9656027

ozancaglayan added 4 commits July 3, 2021 16:46

Update README.md

f73b86a

Update README.md

368ee84

sacrebleu: cosmetic

d86b06f

Update README.md

52f6db7

ozancaglayan added 6 commits July 18, 2021 12:48

merge master

b040917

github actions: give it a try for multiple python versions

efda2ce

github actions: attempt testing on windows and mac os

c5ef8de

disable apt update

1db73cf

github actions: try to fix windows and py3.6

c2221df

github actions: lets give another try..

9efd07a

ozancaglayan merged commit 078c440 into master Jul 18, 2021

mjpost mentioned this pull request Oct 19, 2022

TER above 100? #208

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Changes for 2.0.0 #152

Changes for 2.0.0 #152

ozancaglayan commented Mar 26, 2021 •

edited

Loading

ozancaglayan commented Jul 3, 2021

mjpost commented Jul 3, 2021

mjpost commented Jul 3, 2021

ozancaglayan commented Jul 3, 2021 •

edited

Loading

mjpost commented Jul 3, 2021

ozancaglayan commented Jul 3, 2021

mjpost commented Jul 3, 2021

ozancaglayan commented Jul 4, 2021

ozancaglayan commented Jul 8, 2021

martinpopel commented Jul 8, 2021

ozancaglayan commented Jul 8, 2021

ozancaglayan commented Jul 8, 2021

ozancaglayan commented Jul 8, 2021

Changes for 2.0.0 #152

Changes for 2.0.0 #152

Conversation

ozancaglayan commented Mar 26, 2021 • edited Loading

General

Tokenizers

Metrics

Metric API

Signatures

CLI

Multi-system evaluation mode

Single-system bootstrap confidence intervals (requires numpy) (#40 and #78)

Multi-system paired significance tests (#40 and #78)

SacreBLEU 2.0.0 performance tests

ozancaglayan commented Jul 3, 2021

mjpost commented Jul 3, 2021

mjpost commented Jul 3, 2021

ozancaglayan commented Jul 3, 2021 • edited Loading

mjpost commented Jul 3, 2021

ozancaglayan commented Jul 3, 2021

mjpost commented Jul 3, 2021

ozancaglayan commented Jul 4, 2021

ozancaglayan commented Jul 8, 2021

martinpopel commented Jul 8, 2021

ozancaglayan commented Jul 8, 2021

ozancaglayan commented Jul 8, 2021

ozancaglayan commented Jul 8, 2021

ozancaglayan commented Mar 26, 2021 •

edited

Loading

ozancaglayan commented Jul 3, 2021 •

edited

Loading