Speed up (w/ numpy) #46

martinpopel · 2019-09-10T23:39:22Z

Can we make SacreBLEU faster, possibly using numpy, multithreading or even GPU? And still keep it reliable and easy to install?

This issue should serve for sharing ideas and coordinating our efforts (PRs).

I am not aware of any particular numpy BLEU implementation. I just know (and I guess @mjpost too) that the chrF implementation in SacreBLEU is taken from Sockeye, but it uses List[float] instead of np.array. I am not sure whether this has any substantial impact on the speed.
I have not done profiling, but I guess most time is spent with the tokenization and maybe n-gram extraction and intersection, which could be substituted with Counter intersection similarly to the chrF implementation, supposing that Python3's Counter is C-optimized and faster.

Numpy can be useful if bootstrap resampling is added (cf. #40, #11).

The international tokenization has been optimized using lru_cache. However, there is still a cycle through all Unicode code points in _property_chars for each execution of sacrebleu, which could be prevented if adding the regex dependency (importing it conditionally, only if --tok intl required).

The text was updated successfully, but these errors were encountered:

ozancaglayan · 2020-07-02T14:08:25Z

numpy maybe, I did have somewhere a preliminary stuff with numpy which very easily constructs the n-grams by playing with strides of an object tensor. But afterwards, the counting of n-grams etc. seemed to require going back-and-forth between numpy and python. Multi-processing can be useful for significance testing but I doubt that GPU would help =)

mjpost · 2020-07-02T14:22:17Z

Yeah, plus I'd hate to introduce a GPU dependency...

One thing we could do is cache the counts for (reference, tokenization) pairs, and store them under $SACREBLEU, perhaps with a hash. Then these could be just loaded instead of recomputed. I'm not sure how much time this would save, though, since the bulk of the time is probably spent in computing sentence-level n-gram precisions.

ozancaglayan · 2021-02-19T11:05:31Z

@martinpopel How does regex prevent the cycling through all Unicode codepoints? There's little documentation for regex, is it shipping a cached version of the codepoints?

martinpopel · 2021-02-19T12:01:34Z

@ozancaglayan
Currently we use something like (but @functools.lru_cache(maxsize=1) makes it even slower):

%timeit import unicodedata, re;
        punct = ''.join([chr(x) for x in range(sys.maxunicode) if unicodedata.category(chr(x)).startswith('P')]);
        r = re.compile(r'([^\d])([' + punct + r'])');
238 ms ± 8.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit re.match(r, 'aa')
2.08 µs ± 99.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

%timeit re.match(r, 'a.')
1.89 µs ± 28.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

By using regex we could make the initialization much faster at the cost of slightly slower runtime:

%timeit import regex; r = regex.compile(r'([^\d])(\p{P})')
1.54 µs ± 11.3 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

%timeit regex.match(r, 'aa')
5.44 µs ± 47.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

%timeit regex.match(r, 'a.')
5.56 µs ± 115 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

We should benchmark both versions on typical file size, but (238ms -1.54µs)/(5.44µs-2.08µs) = 70k, so for use cases with lower number of match calls, regex should be faster.

Maybe we should give up readability (and the possibility of new Unicode versions defining new punctuation code points, but that would not be good for sacrebleu's replicability anyway) and come up with even faster implementation, e.g. something like:

%timeit punct = '!"#%&\'()*,-./:;?@[\\]_{}¡§«¶·»¿;·՚՛՜՝՞՟։֊־׀׃׆׳״؉؊،؍؛؞؟٪٫٬٭۔܀܁܂܃܄܅܆܇܈܉܊܋܌܍߷߸߹࠰࠱࠲࠳࠴࠵࠶࠷࠸࠹࠺࠻࠼࠽࠾࡞।॥॰૰෴๏๚๛༄༅༆༇༈༉༊་༌།༎༏༐༑༒༔༺༻༼༽྅࿐࿑࿒࿓࿔࿙࿚၊။၌၍၎၏჻፠፡።፣፤፥፦፧፨᐀᙭᙮᚛᚜᛫᛬᛭᜵᜶។៕៖៘៙៚᠀᠁᠂᠃᠄᠅᠆᠇᠈᠉᠊᥄᥅᨞᨟᪠᪡᪢᪣᪤᪥᪦᪨᪩᪪᪫᪬᪭᭚᭛᭜᭝᭞᭟᭠᯼᯽᯾᯿᰻᰼᰽᰾᰿᱾᱿᳀᳁᳂᳃᳄᳅᳆᳇᳓‐‑‒–—―‖‗‘’‚‛“”„‟†‡•‣․‥…‧‰‱′″‴‵‶‷‸‹›※‼‽‾‿⁀⁁⁂⁃⁅⁆⁇⁈⁉⁊⁋⁌⁍⁎⁏⁐⁑⁓⁔⁕⁖⁗⁘⁙⁚⁛⁜⁝⁞⁽⁾₍₎⌈⌉⌊⌋〈〉❨❩❪❫❬❭❮❯❰❱❲❳❴❵⟅⟆⟦⟧⟨⟩⟪⟫⟬⟭⟮⟯⦃⦄⦅⦆⦇⦈⦉⦊⦋⦌⦍⦎⦏⦐⦑⦒⦓⦔⦕⦖⦗⦘⧘⧙⧚⧛⧼⧽⳹⳺⳻⳼⳾⳿⵰⸀⸁⸂⸃⸄⸅⸆⸇⸈⸉⸊⸋⸌⸍⸎⸏⸐⸑⸒⸓⸔⸕⸖⸗⸘⸙⸚⸛⸜⸝⸞⸟⸠⸡⸢⸣⸤⸥⸦⸧⸨⸩⸪⸫⸬⸭⸮⸰⸱⸲⸳⸴⸵⸶⸷⸸⸹⸺⸻⸼⸽⸾⸿⹀⹁⹂⹃⹄、。〃〈〉《》「」『』【】〔〕〖〗〘〙〚〛〜〝〞〟〰〽゠・꓾꓿꘍꘎꘏꙳꙾꛲꛳꛴꛵꛶꛷꡴꡵꡶꡷꣎꣏꣸꣹꣺꣼꤮꤯꥟꧁꧂꧃꧄꧅꧆꧇꧈꧉꧊꧋꧌꧍꧞꧟꩜꩝꩞꩟꫞꫟꫰꫱꯫﴾﴿︐︑︒︓︔︕︖︗︘︙︰︱︲︳︴︵︶︷︸︹︺︻︼︽︾︿﹀﹁﹂﹃﹄﹅﹆﹇﹈﹉﹊﹋﹌﹍﹎﹏﹐﹑﹒﹔﹕﹖﹗﹘﹙﹚﹛﹜﹝﹞﹟﹠﹡﹣﹨﹪﹫！＂＃％＆＇（）＊，－．／：；？＠［＼］＿｛｝｟｠｡｢｣､･𐄀𐄁𐄂𐎟𐏐𐕯𐡗𐤟𐤿𐩐𐩑𐩒𐩓𐩔𐩕𐩖𐩗𐩘𐩿𐫰𐫱𐫲𐫳𐫴𐫵𐫶𐬹𐬺𐬻𐬼𐬽𐬾𐬿𐮙𐮚𐮛𐮜𑁇𑁈𑁉𑁊𑁋𑁌𑁍𑂻𑂼𑂾𑂿𑃀𑃁𑅀𑅁𑅂𑅃𑅴𑅵𑇅𑇆𑇇𑇈𑇉𑇍𑇛𑇝𑇞𑇟𑈸𑈹𑈺𑈻𑈼𑈽𑊩𑑋𑑌𑑍𑑎𑑏𑑛𑑝𑓆𑗁𑗂𑗃𑗄𑗅𑗆𑗇𑗈𑗉𑗊𑗋𑗌𑗍𑗎𑗏𑗐𑗑𑗒𑗓𑗔𑗕𑗖𑗗𑙁𑙂';
9.32 ns ± 0.171 ns per loop (mean ± std. dev. of 7 runs, 100000000 loops each)

%timeit '.' in punct
29.7 ns ± 0.351 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

%timeit 'a' in punct
69.9 ns ± 0.567 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

ozancaglayan · 2021-02-19T12:56:25Z

regex.compile(r'([^\d])(\p{P})')
I don't think this covers all punctuation classes but only P.
Puncts is small, symbols is huge, around 8K :) So we can't hardcode it, it'll be too ugly.
I did cache them using pickle in my WIP branch though, gave ~10% speedup after the cache is built.
I don't understand your last set of results where you simply check if something is inside a str. How would that integrate in our case?

martinpopel · 2021-02-19T13:39:10Z

I don't think this covers all punctuation classes but only P.

I don't understand which other classes of punctuation you mean. Unicode Category P includes all: connector Pc, dash Pd,
initial quote Pi, final quote Pf, open Ps, close Pe, other Po.
Of course, you may want to tokenize also on some other symbols, but that would not be compatible with mteval-v14.pl --international-tokenization, which is the whole point of sacrebleu --tok intl

Puncts is small, symbols is huge, around 8K :) So we can't hardcode it, it'll be too ugly.

Where do you need 8K symbols? In is_chinese_char? I did not consider that, although it could be perhaps optimized as well.

I did cache them using pickle in my WIP branch though, gave ~10% speedup after the cache is built.

I think using pickle for loading a string of 748 characters is an overkill, though still better than the current solution of iterating through sys.maxunicode (>1M) code-points.

I don't understand your last set of results where you simply check if something is inside a str. How would that integrate in our case?

We need just to check for a digit followed or preceded by a punctuation symbol and separate them with a space. So we could e.g. iterate over the string and when a digit is found, check is the previous or following character is a punctuation. Implementing this in C would be straightforward and faster than regexes (but less readable). In Python, it is more tricky because strings are immutable, but there may be some ways. That said, I think this discussion is an example of premature optimization (the root of all evil): 238 ms init time is surely not the biggest bottleneck in sacrebleu. We can replace UnicodeRegex.punctuation() implementation with punct = '!"#%&\...' and we are done. There is no need to be faster than re.sub.

ozancaglayan · 2021-02-19T13:55:56Z

So the current intl tokenizer, fetches all characters that has a category that startswith P for puncts, and S for symbols.
I initially thought that there is a P class on its own in the specs, but apparently not.
So this means that \p{P} probably encapsulates Pc, Pd, Ps, Pe, Pi, Pf, Po right? Current code gives me a length of 792 for this large string.

If you check the intl tokenizer, the same thing is done for symbols, i.e. S* classes. Now this is huge, around 7292 tokens, with all the emojis and so on. It's probably why computing BLEU with this tokenizer takes around 5.5 seconds for me. By caching that is down to 4.9. (See: $norm_text =~ s/(\p{S})/ $1 /g; # tokenize symbols in mteval)

ozancaglayan · 2021-02-19T15:18:59Z

regex brings a huge improvement to intl tokenizer, goes from ~4 secs to 0.9 on one particular cs-en test. What do you think on adding regex as a dependency to sacrebleu?

But more curiously, I don't seem to get the mteval-v14.pl -c --international-tokenization results exactly. There were no tests for this until I added one a couple of months ago, assuming that the current sacreBLEU at that time was OK. But maybe, since the beginning, the intl tokenizer was never producing same results as the Perl variant?

Looking at the code in Moses repo, I see more tokenization steps instead of the 3 regexps that we do for punctuations and symbols. I'll try to synchronise and see what happens.

martinpopel · 2021-02-19T15:51:54Z

If you check the intl tokenizer, the same thing is done for symbols, i.e. S* classes.

Oh, I see (I forgot, despite it was me who wrote the code).

Now this is huge, around 7292 tokens,

Yes. Now I remember, this was the reason, made the comment about regex.

What do you think on adding regex as a dependency to sacrebleu?

I don't see a reason why not. (There was a reason at the time when sacrebleu was a single script with no need to install.)

mjpost · 2021-02-19T16:30:36Z

I prefer regex—the challenge is just to make sure it computes the same results.

This commit switches to regex module to make punctuation and symbol processing a lot faster (4.5s -> 0.9s for the tested en-cs system) during --tokenize intl tokenization (#46) This also handles #138 , a small de-escaping glitch that leaves sacreBLEU out of sync with mteval14.pl --international-tokenization.

Replacing multiple whitespaces with one ' ' is faster in Python than RE

This speeds up international tokenizer substantially by around ~5 times.

ozancaglayan · 2021-03-05T11:38:29Z

Working towards significance tests these days, I noticed that the slowness of TER makes thing quite prohibitive when multiple systems are provided. I added multiprocessing support as well but, I wondered whether there would be a chance to numpy'ize the TER internals @ales-t and whether that would help?

ales-t · 2021-03-05T12:47:19Z

@ozancaglayan Yes, TER is really slow and the main culprit is the edit distance calculation, which needs to be done many times for a single sentence. There are already some optimizations in place (without them, it would be essentially unusable) such as caching or "beam search".

The BeamEditDistance class uses quite inefficient data types (lists of str) so one step would be to represent the sentence as a (numpy?) array of vocab IDs and work with that.

But it would probably not be trivial to express the algorithm in terms of efficient numpy array/matrix transformations. Maybe the easiest way to speed it up would be to write the core routine (including "beam" and caching) in C. But it's just a guess.

ozancaglayan · 2021-03-05T12:49:46Z

Do you think this library would help for BeamEditDistance or is BeamEditDistance quite different than normal Edit/Levenshtein distance?
https://github.com/roy-ht/editdistance

ales-t · 2021-03-05T13:38:09Z

I think it's too different. We need the backtrace (sequence of edit operations) and I believe that without the optimizations (cache+beam), even in C++ the algorithm would be too slow.

…speed (#46)

- Build: Add Windows and OS X testing to github workflow - Improve documentation and type annotations. - Drop `Python < 3.6` support and migrate to f-strings. - Drop input type manipulation through `isinstance` checks. If the user does not obey to the expected annotations, exceptions will be raised. Robustness attempts lead to confusions and obfuscated score errors in the past (fixes #121) - Use colored strings in tabular outputs (multi-system evaluation mode) through the help of `colorama` package. - tokenizers: Add caching to tokenizers which seem to speed up things a bit. - `intl` tokenizer: Use `regex` module. Speed goes from ~4 seconds to ~0.6 seconds for a particular test set evaluation. (fixes #46) - Signature: Formatting changed (mostly to remove '+' separator as it was interfering with chrF++). The field separator is now '|' and key values are separated with ':' rather than '.'. - Metrics: Scale all metrics into the [0, 100] range (fixes #140) - BLEU: In case of no n-gram matches at all, skip smoothing and return 0.0 BLEU (fixes #141). - BLEU: allow modifying max_ngram_order (fixes #156) - CHRF: Added multi-reference support, verified the scores against chrF++.py, added test case. - CHRF: Added chrF+ support through `word_order` argument. Added test cases against chrF++.py. Exposed it through the CLI (--chrf-word-order) (fixes #124) - CHRF: Add possibility to disable effective order smoothing (pass --chrf-eps-smoothing). This way, the scores obtained are exactly the same as chrF++, Moses and NLTK implementations. We keep the effective ordering as the default for compatibility, since this only affects sentence-level scoring with very short sentences. (fixes #144) - CLI: Allow modifying TER arguments through CLI. We still keep the TERCOM defaults. - CLI: Prefix metric-specific arguments with --chrf and --ter. To maintain compatibility, BLEU argument names are kept the same. - CLI: Added `--format/-f` flag. The single-system output mode is now `json` by default. If you want to keep the old text format persistently, you can export `SACREBLEU_FORMAT=text` into your shell. - CLI: sacreBLEU now supports evaluating multiple systems for a given test set in an efficient way. Through the use of `tabulate` package, the results are nicely rendered into a plain text table, LaTeX, HTML or RST (cf. --format/-f argument). The systems can be either given as a list of plain text files to `-i/--input` or as a tab-separated single stream redirected into `STDIN`. In the former case, the basenames of the files will be automatically used as system names. - Statistical tests: sacreBLEU now supports confidence interval estimation through bootstrap resampling for single-system evaluation (`--confidence` flag) as well as paired bootstrap resampling (`--paired-bs`) and paired approximate randomization tests (`--paired-ar`) when evaluating multiple systems (fixes #40 and fixes #78).

martinpopel mentioned this issue Sep 10, 2019

Add --subset option and domain+country annotation for wmt18+wmt19 #44

Merged

martinpopel mentioned this issue Feb 19, 2021

Replicability of mteval-v14.pl -c --international-tokenization #138

Closed

ozancaglayan added this to the 2.0.0 milestone Feb 24, 2021

ozancaglayan added a commit that referenced this issue Feb 27, 2021

tokenizer_re: Add LRU cache & replace one regexp with .split() (#46)

0cf71da

Replacing multiple whitespaces with one ' ' is faster in Python than RE

ozancaglayan added a commit that referenced this issue Feb 27, 2021

tokenizer_intl: Use regex module in intl tokenizer (#46)

321ca1e

This speeds up international tokenizer substantially by around ~5 times.

ozancaglayan added a commit that referenced this issue Mar 26, 2021

tokenizer_intl: derive from base tokenizer, add cache, use regex for …

7ce3dc8

…speed (#46)

ozancaglayan mentioned this issue Mar 26, 2021

Changes for 2.0.0 #152

Merged

ozancaglayan linked a pull request Mar 26, 2021 that will close this issue

Changes for 2.0.0 #152

Merged

ozancaglayan closed this as completed in #152 Jul 18, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up (w/ numpy) #46

Speed up (w/ numpy) #46

martinpopel commented Sep 10, 2019

ozancaglayan commented Jul 2, 2020

mjpost commented Jul 2, 2020

ozancaglayan commented Feb 19, 2021

martinpopel commented Feb 19, 2021 •

edited

Loading

ozancaglayan commented Feb 19, 2021

martinpopel commented Feb 19, 2021

ozancaglayan commented Feb 19, 2021

ozancaglayan commented Feb 19, 2021

martinpopel commented Feb 19, 2021

mjpost commented Feb 19, 2021

ozancaglayan commented Mar 5, 2021

ales-t commented Mar 5, 2021

ozancaglayan commented Mar 5, 2021

ales-t commented Mar 5, 2021

Speed up (w/ numpy) #46

Speed up (w/ numpy) #46

Comments

martinpopel commented Sep 10, 2019

ozancaglayan commented Jul 2, 2020

mjpost commented Jul 2, 2020

ozancaglayan commented Feb 19, 2021

martinpopel commented Feb 19, 2021 • edited Loading

ozancaglayan commented Feb 19, 2021

martinpopel commented Feb 19, 2021

ozancaglayan commented Feb 19, 2021

ozancaglayan commented Feb 19, 2021

martinpopel commented Feb 19, 2021

mjpost commented Feb 19, 2021

ozancaglayan commented Mar 5, 2021

ales-t commented Mar 5, 2021

ozancaglayan commented Mar 5, 2021

ales-t commented Mar 5, 2021

martinpopel commented Feb 19, 2021 •

edited

Loading