switch regex engine from oniguruma to fancy-regex #18

bminixhofer · 2021-02-02T09:52:20Z

I would like to switch from rust-onig to fancy-regex.

This would probably come with a speedup and remove the last non-Rust dependency. This is nice in general and would enable compiling to WebAssembly.

Changing this in NLPRule would be easy but it is currently blocked by fancy-regex/fancy-regex#59 and fancy-regex/fancy-regex#49.

robinst · 2021-02-15T03:30:58Z

This would probably come with a speedup

I'd be curious to see if there's a speedup or not. Keep in mind that Oniguruma is highly-optimized (for its type of regex engine). The situations where I could see fancy-regex win is for patterns that can be delegated to the regex crate whereas onig needs backtracking.

bminixhofer · 2021-02-15T08:26:53Z

I did some early comparisons in the meantime and my findings are roughly consistent with trishume/syntect#34. I get a ~15% slowdown from switching tofancy-regex. I'll do some more investigation. The vast majority of regexes nlprule runs should be delegated to the regex crate, there might be a few fancy regexes which take most of the time where the result could be cached.

I also already pre-compute lots of regex matches (making speed pretty much irrelevant for those) but there is room for improvement there too which should decrease the overall % of time spent on regex matching.

I definitely want to switch to fancy-regex. If things don't work out I might implement something like trishume/syntect#270 but in general I'm OK with slightly slower regex matching if the next release is in total faster than the previous release.

robinst · 2021-02-16T00:39:07Z

If you have a particular regex that's delegated to regex and is unexpectedly slow, we can also ask burntsushi to have a look, he's very helpful.

bminixhofer · 2021-02-19T16:45:14Z

There is now a PR for this: #36. I opted for the solution from trishume/syntect#270 which works quite well. I already had a wrapper around the Regex anyway for serialization so this didn't add a lot of complexity and was quite easy to implement.

At the moment I still have some problems with mismatches between oniguruma and fancy-regex / regex:

Oniguruma has better case folding support

fn main() {
    let regex_fancy = regex::Regex::new(r"(?i)ss|ﬁ").unwrap();
    let regex_onig = onig::Regex::new(r"(?i)ss|ﬁ").unwrap();

    for text in &["ß", "ss", "fi", "ﬁ"] {
        println!(
            "{}\t{}\t{}",
            text,
            regex_fancy.is_match(text),
            regex_onig.is_match(text)
        );
    }
}

prints

ß       false   true
ss      true    true
fi      false   true
ﬁ       true    true

Unicode property classes in a case-insensitive oniguruma regex are still case sensitive:

fn main() {
    let regex_fancy = regex::Regex::new(r"(?i)\p{Lu}").unwrap();
    let regex_onig = onig::Regex::new(r"(?i)\p{Lu}").unwrap();

    for text in &["A", "a"] {
        println!(
            "{}\t{}\t{}",
            text,
            regex_fancy.is_match(text),
            regex_onig.is_match(text)
        );
    }
}

prints

A       true    true
a       true    false

It is enough to reliably detect and disable regexes with 1.) since they are only a few but I need a fix for 2.). I tried to sort of escape the \p{Lu} by adding (?-i) before and and (?i) afterwards but I don't think that works in every case e. g. inside [] sets. I also don't know how to reliably detect 1.) - I don't think that's trivial.

bminixhofer · 2021-02-19T18:02:08Z

I think I can get around both of these issues by using regex-syntax to naively construct lowercase regexes by replacing e g. a with [aA] for all literals instead of using the (?i) flag. I am parsing regexes which are also used in a Java project, this behavior would probably be closest to how it behaves there.

bminixhofer · 2021-02-21T12:02:21Z

This is now solved by:

a modular regex backend like in Move all regex usage to separate module to add support for fancy-regex trishume/syntect#270
a function from_java_regex which uses regex-syntax to parse the regular expressions, fix errors (e. g. unnecessary escaped chars) and get them to a state in which both fancy-regex and Oniguruma do the same thing (e. g. removing the case-insensitive flag and instead naively making it case insensitive).

In the tests I evaluate approx. 20k regular expressions on ~100k inputs each, it's quite cool that fancy-regex behaves the same as Oniguruma in every case now.

Regarding speed:

I do not see a significant difference in the benchmark between fancy-regex and Oniguruma when running without parallelism. Curiously, with parallelism enabled fancy-regex is 10%-15% slower.

This would warrant some further investigation. I'm not so sure about the quality of the benchmark. It uses the Python bindings which incur some additional overhead and it runs the entire pipeline. To investigate the slowdown I'd have to do the benchmark in Rust and check which part of the pipeline is slower.

But for now, I'm happy with just having both backends with the performance difference being "inconclusive" (and both being fast!).

robinst · 2021-02-22T02:44:27Z

Awesome, glad to hear :).

On 1., that's a limitation of regex: https://github.com/rust-lang/regex/blob/master/UNICODE.md#rl15-simple-loose-matches

On 2., that seems to be the desired behavior in regex: https://github.com/rust-lang/regex/blob/d5bf98f293b48174d5378471d01c2e0ef271bbbc/tests/unicode.rs#L12

Note that PCRE agrees with regex there:

$ perl -e 'print "matches" if "a" =~ /(?i)\p{Lu}/'
matches

bminixhofer added enhancement New feature or request good first issue Good for newcomers P2 Medium priority labels Feb 2, 2021

bminixhofer mentioned this issue Feb 19, 2021

Modularize regex backend, add fancy-regex support #36

Merged

2 tasks

bminixhofer closed this as completed in #36 Feb 21, 2021

Narsil mentioned this issue Feb 17, 2022

JS / WebAssembly binding planned ? huggingface/tokenizers#63

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

switch regex engine from oniguruma to fancy-regex #18

switch regex engine from oniguruma to fancy-regex #18

bminixhofer commented Feb 2, 2021

robinst commented Feb 15, 2021

bminixhofer commented Feb 15, 2021

robinst commented Feb 16, 2021

bminixhofer commented Feb 19, 2021 •

edited

Loading

bminixhofer commented Feb 19, 2021 •

edited

Loading

bminixhofer commented Feb 21, 2021 •

edited

Loading

robinst commented Feb 22, 2021

switch regex engine from oniguruma to fancy-regex #18

switch regex engine from oniguruma to fancy-regex #18

Comments

bminixhofer commented Feb 2, 2021

robinst commented Feb 15, 2021

bminixhofer commented Feb 15, 2021

robinst commented Feb 16, 2021

bminixhofer commented Feb 19, 2021 • edited Loading

bminixhofer commented Feb 19, 2021 • edited Loading

bminixhofer commented Feb 21, 2021 • edited Loading

robinst commented Feb 22, 2021

bminixhofer commented Feb 19, 2021 •

edited

Loading

bminixhofer commented Feb 19, 2021 •

edited

Loading

bminixhofer commented Feb 21, 2021 •

edited

Loading