-
Notifications
You must be signed in to change notification settings - Fork 164
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed up (w/ numpy) #46
Comments
numpy maybe, I did have somewhere a preliminary stuff with numpy which very easily constructs the n-grams by playing with strides of an object tensor. But afterwards, the counting of n-grams etc. seemed to require going back-and-forth between numpy and python. Multi-processing can be useful for significance testing but I doubt that GPU would help =) |
Yeah, plus I'd hate to introduce a GPU dependency... One thing we could do is cache the counts for (reference, tokenization) pairs, and store them under |
@martinpopel How does |
@ozancaglayan %timeit import unicodedata, re;
punct = ''.join([chr(x) for x in range(sys.maxunicode) if unicodedata.category(chr(x)).startswith('P')]);
r = re.compile(r'([^\d])([' + punct + r'])');
238 ms ± 8.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit re.match(r, 'aa')
2.08 µs ± 99.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

%timeit re.match(r, 'a.')
1.89 µs ± 28.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) By using %timeit import regex; r = regex.compile(r'([^\d])(\p{P})')
1.54 µs ± 11.3 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%timeit regex.match(r, 'aa')
5.44 µs ± 47.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

%timeit regex.match(r, 'a.')
5.56 µs ± 115 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) We should benchmark both versions on typical file size, but (238ms -1.54µs)/(5.44µs-2.08µs) = 70k, so for use cases with lower number of match calls, Maybe we should give up readability (and the possibility of new Unicode versions defining new punctuation code points, but that would not be good for sacrebleu's replicability anyway) and come up with even faster implementation, e.g. something like:
|
|
I don't understand which other classes of punctuation you mean. Unicode Category P includes all: connector Pc, dash Pd,
Where do you need 8K symbols? In
I think using pickle for loading a string of 748 characters is an overkill, though still better than the current solution of iterating through
We need just to check for a digit followed or preceded by a punctuation symbol and separate them with a space. So we could e.g. iterate over the string and when a digit is found, check is the previous or following character is a punctuation. Implementing this in C would be straightforward and faster than regexes (but less readable). In Python, it is more tricky because strings are immutable, but there may be some ways. That said, I think this discussion is an example of premature optimization (the root of all evil): 238 ms init time is surely not the biggest bottleneck in sacrebleu. We can replace |
So the current If you check the intl tokenizer, the same thing is done for symbols, i.e. |
But more curiously, I don't seem to get the Looking at the code in Moses repo, I see more tokenization steps instead of the 3 regexps that we do for punctuations and symbols. I'll try to synchronise and see what happens. |
Oh, I see (I forgot, despite it was me who wrote the code).
Yes. Now I remember, this was the reason, made the comment about
I don't see a reason why not. (There was a reason at the time when sacrebleu was a single script with no need to install.) |
I prefer |
This commit switches to regex module to make punctuation and symbol processing a lot faster (4.5s -> 0.9s for the tested en-cs system) during --tokenize intl tokenization (#46) This also handles #138 , a small de-escaping glitch that leaves sacreBLEU out of sync with mteval14.pl --international-tokenization.
Replacing multiple whitespaces with one ' ' is faster in Python than RE
This speeds up international tokenizer substantially by around ~5 times.
Working towards significance tests these days, I noticed that the slowness of TER makes thing quite prohibitive when multiple systems are provided. I added |
@ozancaglayan Yes, TER is really slow and the main culprit is the edit distance calculation, which needs to be done many times for a single sentence. There are already some optimizations in place (without them, it would be essentially unusable) such as caching or "beam search". The But it would probably not be trivial to express the algorithm in terms of efficient |
Do you think this library would help for |
I think it's too different. We need the backtrace (sequence of edit operations) and I believe that without the optimizations (cache+beam), even in C++ the algorithm would be too slow. |
- Build: Add Windows and OS X testing to github workflow - Improve documentation and type annotations. - Drop `Python < 3.6` support and migrate to f-strings. - Drop input type manipulation through `isinstance` checks. If the user does not obey to the expected annotations, exceptions will be raised. Robustness attempts lead to confusions and obfuscated score errors in the past (fixes #121) - Use colored strings in tabular outputs (multi-system evaluation mode) through the help of `colorama` package. - tokenizers: Add caching to tokenizers which seem to speed up things a bit. - `intl` tokenizer: Use `regex` module. Speed goes from ~4 seconds to ~0.6 seconds for a particular test set evaluation. (fixes #46) - Signature: Formatting changed (mostly to remove '+' separator as it was interfering with chrF++). The field separator is now '|' and key values are separated with ':' rather than '.'. - Metrics: Scale all metrics into the [0, 100] range (fixes #140) - BLEU: In case of no n-gram matches at all, skip smoothing and return 0.0 BLEU (fixes #141). - BLEU: allow modifying max_ngram_order (fixes #156) - CHRF: Added multi-reference support, verified the scores against chrF++.py, added test case. - CHRF: Added chrF+ support through `word_order` argument. Added test cases against chrF++.py. Exposed it through the CLI (--chrf-word-order) (fixes #124) - CHRF: Add possibility to disable effective order smoothing (pass --chrf-eps-smoothing). This way, the scores obtained are exactly the same as chrF++, Moses and NLTK implementations. We keep the effective ordering as the default for compatibility, since this only affects sentence-level scoring with very short sentences. (fixes #144) - CLI: Allow modifying TER arguments through CLI. We still keep the TERCOM defaults. - CLI: Prefix metric-specific arguments with --chrf and --ter. To maintain compatibility, BLEU argument names are kept the same. - CLI: Added `--format/-f` flag. The single-system output mode is now `json` by default. If you want to keep the old text format persistently, you can export `SACREBLEU_FORMAT=text` into your shell. - CLI: sacreBLEU now supports evaluating multiple systems for a given test set in an efficient way. Through the use of `tabulate` package, the results are nicely rendered into a plain text table, LaTeX, HTML or RST (cf. --format/-f argument). The systems can be either given as a list of plain text files to `-i/--input` or as a tab-separated single stream redirected into `STDIN`. In the former case, the basenames of the files will be automatically used as system names. - Statistical tests: sacreBLEU now supports confidence interval estimation through bootstrap resampling for single-system evaluation (`--confidence` flag) as well as paired bootstrap resampling (`--paired-bs`) and paired approximate randomization tests (`--paired-ar`) when evaluating multiple systems (fixes #40 and fixes #78).
Can we make SacreBLEU faster, possibly using numpy, multithreading or even GPU? And still keep it reliable and easy to install?
This issue should serve for sharing ideas and coordinating our efforts (PRs).
I am not aware of any particular numpy BLEU implementation. I just know (and I guess @mjpost too) that the chrF implementation in SacreBLEU is taken from Sockeye, but it uses
List[float]
instead ofnp.array
. I am not sure whether this has any substantial impact on the speed.I have not done profiling, but I guess most time is spent with the tokenization and maybe n-gram extraction and intersection, which could be substituted with Counter intersection similarly to the chrF implementation, supposing that Python3's Counter is C-optimized and faster.
Numpy can be useful if bootstrap resampling is added (cf. #40, #11).
The international tokenization has been optimized using lru_cache. However, there is still a cycle through all Unicode code points in
_property_chars
for each execution ofsacrebleu
, which could be prevented if adding the regex dependency (importing it conditionally, only if--tok intl
required).The text was updated successfully, but these errors were encountered: