-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
switch regex engine from oniguruma to fancy-regex #18
Comments
I'd be curious to see if there's a speedup or not. Keep in mind that Oniguruma is highly-optimized (for its type of regex engine). The situations where I could see fancy-regex win is for patterns that can be delegated to the regex crate whereas onig needs backtracking. |
I did some early comparisons in the meantime and my findings are roughly consistent with trishume/syntect#34. I get a ~15% slowdown from switching to I also already pre-compute lots of regex matches (making speed pretty much irrelevant for those) but there is room for improvement there too which should decrease the overall % of time spent on regex matching. I definitely want to switch to fancy-regex. If things don't work out I might implement something like trishume/syntect#270 but in general I'm OK with slightly slower regex matching if the next release is in total faster than the previous release. |
If you have a particular regex that's delegated to regex and is unexpectedly slow, we can also ask burntsushi to have a look, he's very helpful. |
There is now a PR for this: #36. I opted for the solution from trishume/syntect#270 which works quite well. I already had a wrapper around the Regex anyway for serialization so this didn't add a lot of complexity and was quite easy to implement. At the moment I still have some problems with mismatches between oniguruma and
fn main() {
let regex_fancy = regex::Regex::new(r"(?i)ss|fi").unwrap();
let regex_onig = onig::Regex::new(r"(?i)ss|fi").unwrap();
for text in &["ß", "ss", "fi", "fi"] {
println!(
"{}\t{}\t{}",
text,
regex_fancy.is_match(text),
regex_onig.is_match(text)
);
}
} prints
fn main() {
let regex_fancy = regex::Regex::new(r"(?i)\p{Lu}").unwrap();
let regex_onig = onig::Regex::new(r"(?i)\p{Lu}").unwrap();
for text in &["A", "a"] {
println!(
"{}\t{}\t{}",
text,
regex_fancy.is_match(text),
regex_onig.is_match(text)
);
}
} prints
It is enough to reliably detect and disable regexes with 1.) since they are only a few but I need a fix for 2.). I tried to sort of escape the |
I think I can get around both of these issues by using regex-syntax to naively construct lowercase regexes by replacing e g. |
This is now solved by:
In the tests I evaluate approx. 20k regular expressions on ~100k inputs each, it's quite cool that Regarding speed: I do not see a significant difference in the benchmark between This would warrant some further investigation. I'm not so sure about the quality of the benchmark. It uses the Python bindings which incur some additional overhead and it runs the entire pipeline. To investigate the slowdown I'd have to do the benchmark in Rust and check which part of the pipeline is slower. But for now, I'm happy with just having both backends with the performance difference being "inconclusive" (and both being fast!). |
Awesome, glad to hear :). On 1., that's a limitation of regex: https://github.com/rust-lang/regex/blob/master/UNICODE.md#rl15-simple-loose-matches On 2., that seems to be the desired behavior in regex: https://github.com/rust-lang/regex/blob/d5bf98f293b48174d5378471d01c2e0ef271bbbc/tests/unicode.rs#L12 Note that PCRE agrees with regex there:
|
I would like to switch from rust-onig to fancy-regex.
This would probably come with a speedup and remove the last non-Rust dependency. This is nice in general and would enable compiling to WebAssembly.
Changing this in NLPRule would be easy but it is currently blocked by fancy-regex/fancy-regex#59 and fancy-regex/fancy-regex#49.
The text was updated successfully, but these errors were encountered: