Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up (w/ numpy) #46

Closed
martinpopel opened this issue Sep 10, 2019 · 14 comments · Fixed by #152
Closed

Speed up (w/ numpy) #46

martinpopel opened this issue Sep 10, 2019 · 14 comments · Fixed by #152
Milestone

Comments

@martinpopel
Copy link
Collaborator

Can we make SacreBLEU faster, possibly using numpy, multithreading or even GPU? And still keep it reliable and easy to install?

This issue should serve for sharing ideas and coordinating our efforts (PRs).

I am not aware of any particular numpy BLEU implementation. I just know (and I guess @mjpost too) that the chrF implementation in SacreBLEU is taken from Sockeye, but it uses List[float] instead of np.array. I am not sure whether this has any substantial impact on the speed.
I have not done profiling, but I guess most time is spent with the tokenization and maybe n-gram extraction and intersection, which could be substituted with Counter intersection similarly to the chrF implementation, supposing that Python3's Counter is C-optimized and faster.

Numpy can be useful if bootstrap resampling is added (cf. #40, #11).

The international tokenization has been optimized using lru_cache. However, there is still a cycle through all Unicode code points in _property_chars for each execution of sacrebleu, which could be prevented if adding the regex dependency (importing it conditionally, only if --tok intl required).

@ozancaglayan
Copy link
Collaborator

numpy maybe, I did have somewhere a preliminary stuff with numpy which very easily constructs the n-grams by playing with strides of an object tensor. But afterwards, the counting of n-grams etc. seemed to require going back-and-forth between numpy and python. Multi-processing can be useful for significance testing but I doubt that GPU would help =)

@mjpost
Copy link
Owner

mjpost commented Jul 2, 2020

Yeah, plus I'd hate to introduce a GPU dependency...

One thing we could do is cache the counts for (reference, tokenization) pairs, and store them under $SACREBLEU, perhaps with a hash. Then these could be just loaded instead of recomputed. I'm not sure how much time this would save, though, since the bulk of the time is probably spent in computing sentence-level n-gram precisions.

@ozancaglayan
Copy link
Collaborator

@martinpopel How does regex prevent the cycling through all Unicode codepoints? There's little documentation for regex, is it shipping a cached version of the codepoints?

@martinpopel
Copy link
Collaborator Author

martinpopel commented Feb 19, 2021

@ozancaglayan
Currently we use something like (but @functools.lru_cache(maxsize=1) makes it even slower):

%timeit import unicodedata, re;
        punct = ''.join([chr(x) for x in range(sys.maxunicode) if unicodedata.category(chr(x)).startswith('P')]);
        r = re.compile(r'([^\d])([' + punct + r'])');
238 ms ± 8.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit re.match(r, 'aa')
2.08 µs ± 99.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

%timeit re.match(r, 'a.')
1.89 µs ± 28.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

By using regex we could make the initialization much faster at the cost of slightly slower runtime:

%timeit import regex; r = regex.compile(r'([^\d])(\p{P})')
1.54 µs ± 11.3 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

%timeit regex.match(r, 'aa')
5.44 µs ± 47.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

%timeit regex.match(r, 'a.')
5.56 µs ± 115 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

We should benchmark both versions on typical file size, but (238ms -1.54µs)/(5.44µs-2.08µs) = 70k, so for use cases with lower number of match calls, regex should be faster.

Maybe we should give up readability (and the possibility of new Unicode versions defining new punctuation code points, but that would not be good for sacrebleu's replicability anyway) and come up with even faster implementation, e.g. something like:

%timeit punct = '!"#%&\'()*,-./:;?@[\\]_{}¡§«¶·»¿;·՚՛՜՝՞՟։֊־׀׃׆׳״؉؊،؍؛؞؟٪٫٬٭۔܀܁܂܃܄܅܆܇܈܉܊܋܌܍߷߸߹࠰࠱࠲࠳࠴࠵࠶࠷࠸࠹࠺࠻࠼࠽࠾࡞।॥॰૰෴๏๚๛༄༅༆༇༈༉༊་༌།༎༏༐༑༒༔༺༻༼༽྅࿐࿑࿒࿓࿔࿙࿚၊။၌၍၎၏჻፠፡።፣፤፥፦፧፨᐀᙭᙮᚛᚜᛫᛬᛭᜵᜶។៕៖៘៙៚᠀᠁᠂᠃᠄᠅᠆᠇᠈᠉᠊᥄᥅᨞᨟᪠᪡᪢᪣᪤᪥᪦᪨᪩᪪᪫᪬᪭᭚᭛᭜᭝᭞᭟᭠᯼᯽᯾᯿᰻᰼᰽᰾᰿᱾᱿᳀᳁᳂᳃᳄᳅᳆᳇᳓‐‑‒–—―‖‗‘’‚‛“”„‟†‡•‣․‥…‧‰‱′″‴‵‶‷‸‹›※‼‽‾‿⁀⁁⁂⁃⁅⁆⁇⁈⁉⁊⁋⁌⁍⁎⁏⁐⁑⁓⁔⁕⁖⁗⁘⁙⁚⁛⁜⁝⁞⁽⁾₍₎⌈⌉⌊⌋〈〉❨❩❪❫❬❭❮❯❰❱❲❳❴❵⟅⟆⟦⟧⟨⟩⟪⟫⟬⟭⟮⟯⦃⦄⦅⦆⦇⦈⦉⦊⦋⦌⦍⦎⦏⦐⦑⦒⦓⦔⦕⦖⦗⦘⧘⧙⧚⧛⧼⧽⳹⳺⳻⳼⳾⳿⵰⸀⸁⸂⸃⸄⸅⸆⸇⸈⸉⸊⸋⸌⸍⸎⸏⸐⸑⸒⸓⸔⸕⸖⸗⸘⸙⸚⸛⸜⸝⸞⸟⸠⸡⸢⸣⸤⸥⸦⸧⸨⸩⸪⸫⸬⸭⸮⸰⸱⸲⸳⸴⸵⸶⸷⸸⸹⸺⸻⸼⸽⸾⸿⹀⹁⹂⹃⹄、。〃〈〉《》「」『』【】〔〕〖〗〘〙〚〛〜〝〞〟〰〽゠・꓾꓿꘍꘎꘏꙳꙾꛲꛳꛴꛵꛶꛷꡴꡵꡶꡷꣎꣏꣸꣹꣺꣼꤮꤯꥟꧁꧂꧃꧄꧅꧆꧇꧈꧉꧊꧋꧌꧍꧞꧟꩜꩝꩞꩟꫞꫟꫰꫱꯫﴾﴿︐︑︒︓︔︕︖︗︘︙︰︱︲︳︴︵︶︷︸︹︺︻︼︽︾︿﹀﹁﹂﹃﹄﹅﹆﹇﹈﹉﹊﹋﹌﹍﹎﹏﹐﹑﹒﹔﹕﹖﹗﹘﹙﹚﹛﹜﹝﹞﹟﹠﹡﹣﹨﹪﹫!"#%&'()*,-./:;?@[\]_{}⦅⦆。「」、・𐄀𐄁𐄂𐎟𐏐𐕯𐡗𐤟𐤿𐩐𐩑𐩒𐩓𐩔𐩕𐩖𐩗𐩘𐩿𐫰𐫱𐫲𐫳𐫴𐫵𐫶𐬹𐬺𐬻𐬼𐬽𐬾𐬿𐮙𐮚𐮛𐮜𑁇𑁈𑁉𑁊𑁋𑁌𑁍𑂻𑂼𑂾𑂿𑃀𑃁𑅀𑅁𑅂𑅃𑅴𑅵𑇅𑇆𑇇𑇈𑇉𑇍𑇛𑇝𑇞𑇟𑈸𑈹𑈺𑈻𑈼𑈽𑊩𑑋𑑌𑑍𑑎𑑏𑑛𑑝𑓆𑗁𑗂𑗃𑗄𑗅𑗆𑗇𑗈𑗉𑗊𑗋𑗌𑗍𑗎𑗏𑗐𑗑𑗒𑗓𑗔𑗕𑗖𑗗𑙁𑙂';
9.32 ns ± 0.171 ns per loop (mean ± std. dev. of 7 runs, 100000000 loops each)

%timeit '.' in punct
29.7 ns ± 0.351 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

%timeit 'a' in punct
69.9 ns ± 0.567 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

@ozancaglayan
Copy link
Collaborator

  1. regex.compile(r'([^\d])(\p{P})')
    I don't think this covers all punctuation classes but only P.

  2. Puncts is small, symbols is huge, around 8K :) So we can't hardcode it, it'll be too ugly.

  3. I did cache them using pickle in my WIP branch though, gave ~10% speedup after the cache is built.

  4. I don't understand your last set of results where you simply check if something is inside a str. How would that integrate in our case?

@martinpopel
Copy link
Collaborator Author

I don't think this covers all punctuation classes but only P.

I don't understand which other classes of punctuation you mean. Unicode Category P includes all: connector Pc, dash Pd,
initial quote Pi, final quote Pf, open Ps, close Pe, other Po.
Of course, you may want to tokenize also on some other symbols, but that would not be compatible with mteval-v14.pl --international-tokenization, which is the whole point of sacrebleu --tok intl

Puncts is small, symbols is huge, around 8K :) So we can't hardcode it, it'll be too ugly.

Where do you need 8K symbols? In is_chinese_char? I did not consider that, although it could be perhaps optimized as well.

I did cache them using pickle in my WIP branch though, gave ~10% speedup after the cache is built.

I think using pickle for loading a string of 748 characters is an overkill, though still better than the current solution of iterating through sys.maxunicode (>1M) code-points.

I don't understand your last set of results where you simply check if something is inside a str. How would that integrate in our case?

We need just to check for a digit followed or preceded by a punctuation symbol and separate them with a space. So we could e.g. iterate over the string and when a digit is found, check is the previous or following character is a punctuation. Implementing this in C would be straightforward and faster than regexes (but less readable). In Python, it is more tricky because strings are immutable, but there may be some ways. That said, I think this discussion is an example of premature optimization (the root of all evil): 238 ms init time is surely not the biggest bottleneck in sacrebleu. We can replace UnicodeRegex.punctuation() implementation with punct = '!"#%&\...' and we are done. There is no need to be faster than re.sub.

@ozancaglayan
Copy link
Collaborator

So the current intl tokenizer, fetches all characters that has a category that startswith P for puncts, and S for symbols.
I initially thought that there is a P class on its own in the specs, but apparently not.
So this means that \p{P} probably encapsulates Pc, Pd, Ps, Pe, Pi, Pf, Po right? Current code gives me a length of 792 for this large string.

If you check the intl tokenizer, the same thing is done for symbols, i.e. S* classes. Now this is huge, around 7292 tokens, with all the emojis and so on. It's probably why computing BLEU with this tokenizer takes around 5.5 seconds for me. By caching that is down to 4.9. (See: $norm_text =~ s/(\p{S})/ $1 /g; # tokenize symbols in mteval)

@ozancaglayan
Copy link
Collaborator

regex brings a huge improvement to intl tokenizer, goes from ~4 secs to 0.9 on one particular cs-en test. What do you think on adding regex as a dependency to sacrebleu?

But more curiously, I don't seem to get the mteval-v14.pl -c --international-tokenization results exactly. There were no tests for this until I added one a couple of months ago, assuming that the current sacreBLEU at that time was OK. But maybe, since the beginning, the intl tokenizer was never producing same results as the Perl variant?

Looking at the code in Moses repo, I see more tokenization steps instead of the 3 regexps that we do for punctuations and symbols. I'll try to synchronise and see what happens.

@martinpopel
Copy link
Collaborator Author

If you check the intl tokenizer, the same thing is done for symbols, i.e. S* classes.

Oh, I see (I forgot, despite it was me who wrote the code).

Now this is huge, around 7292 tokens,

Yes. Now I remember, this was the reason, made the comment about regex.

What do you think on adding regex as a dependency to sacrebleu?

I don't see a reason why not. (There was a reason at the time when sacrebleu was a single script with no need to install.)

@mjpost
Copy link
Owner

mjpost commented Feb 19, 2021

I prefer regex—the challenge is just to make sure it computes the same results.

ozancaglayan added a commit that referenced this issue Feb 19, 2021
This commit switches to regex module to make punctuation and symbol
processing a lot faster (4.5s -> 0.9s for the tested en-cs system)
during --tokenize intl tokenization (#46)

This also handles #138 , a small de-escaping glitch that leaves sacreBLEU out of
sync with mteval14.pl --international-tokenization.
@ozancaglayan ozancaglayan added this to the 2.0.0 milestone Feb 24, 2021
ozancaglayan added a commit that referenced this issue Feb 27, 2021
Replacing multiple whitespaces with one ' ' is faster in Python than RE
ozancaglayan added a commit that referenced this issue Feb 27, 2021
This speeds up international tokenizer substantially by around ~5 times.
@ozancaglayan
Copy link
Collaborator

Working towards significance tests these days, I noticed that the slowness of TER makes thing quite prohibitive when multiple systems are provided. I added multiprocessing support as well but, I wondered whether there would be a chance to numpy'ize the TER internals @ales-t and whether that would help?

@ales-t
Copy link
Contributor

ales-t commented Mar 5, 2021

@ozancaglayan Yes, TER is really slow and the main culprit is the edit distance calculation, which needs to be done many times for a single sentence. There are already some optimizations in place (without them, it would be essentially unusable) such as caching or "beam search".

The BeamEditDistance class uses quite inefficient data types (lists of str) so one step would be to represent the sentence as a (numpy?) array of vocab IDs and work with that.

But it would probably not be trivial to express the algorithm in terms of efficient numpy array/matrix transformations. Maybe the easiest way to speed it up would be to write the core routine (including "beam" and caching) in C. But it's just a guess.

@ozancaglayan
Copy link
Collaborator

Do you think this library would help for BeamEditDistance or is BeamEditDistance quite different than normal Edit/Levenshtein distance?
https://github.com/roy-ht/editdistance

@ales-t
Copy link
Contributor

ales-t commented Mar 5, 2021

I think it's too different. We need the backtrace (sequence of edit operations) and I believe that without the optimizations (cache+beam), even in C++ the algorithm would be too slow.

@ozancaglayan ozancaglayan linked a pull request Mar 26, 2021 that will close this issue
ozancaglayan added a commit that referenced this issue Jul 18, 2021
  - Build: Add Windows and OS X testing to github workflow
  - Improve documentation and type annotations.
  - Drop `Python < 3.6` support and migrate to f-strings.
  - Drop input type manipulation through `isinstance` checks. If the user does not obey
    to the expected annotations, exceptions will be raised. Robustness attempts lead to
    confusions and obfuscated score errors in the past (fixes #121)
  - Use colored strings in tabular outputs (multi-system evaluation mode) through
    the help of `colorama` package.
  - tokenizers: Add caching to tokenizers which seem to speed up things a bit.
  - `intl` tokenizer: Use `regex` module. Speed goes from ~4 seconds to ~0.6 seconds
    for a particular test set evaluation. (fixes #46)
  - Signature: Formatting changed (mostly to remove '+' separator as it was
    interfering with chrF++). The field separator is now '|' and key values
    are separated with ':' rather than '.'.
  - Metrics: Scale all metrics into the [0, 100] range (fixes #140)
  - BLEU: In case of no n-gram matches at all, skip smoothing and return 0.0 BLEU (fixes #141).
  - BLEU: allow modifying max_ngram_order (fixes #156)
  - CHRF: Added multi-reference support, verified the scores against chrF++.py, added test case.
  - CHRF: Added chrF+ support through `word_order` argument. Added test cases against chrF++.py.
    Exposed it through the CLI (--chrf-word-order) (fixes #124)
  - CHRF: Add possibility to disable effective order smoothing (pass --chrf-eps-smoothing).
    This way, the scores obtained are exactly the same as chrF++, Moses and NLTK implementations.
    We keep the effective ordering as the default for compatibility, since this only
    affects sentence-level scoring with very short sentences. (fixes #144)
  - CLI: Allow modifying TER arguments through CLI. We still keep the TERCOM defaults.
  - CLI: Prefix metric-specific arguments with --chrf and --ter. To maintain compatibility, BLEU argument names are kept the same.
  - CLI: Added `--format/-f` flag. The single-system output mode is now `json` by default.
    If you want to keep the old text format persistently, you can export `SACREBLEU_FORMAT=text` into your
    shell.
  - CLI: sacreBLEU now supports evaluating multiple systems for a given test set
    in an efficient way. Through the use of `tabulate` package, the results are
    nicely rendered into a plain text table, LaTeX, HTML or RST (cf. --format/-f argument).
    The systems can be either given as a list of plain text files to `-i/--input` or
    as a tab-separated single stream redirected into `STDIN`. In the former case,
    the basenames of the files will be automatically used as system names.
  - Statistical tests: sacreBLEU now supports confidence interval estimation
    through bootstrap resampling for single-system evaluation (`--confidence` flag)
    as well as paired bootstrap resampling (`--paired-bs`) and paired approximate
    randomization tests (`--paired-ar`) when evaluating multiple systems (fixes #40 and fixes #78).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants