Move all regex usage to separate module to add support for fancy-regex #270

robinst · 2019-11-25T12:19:36Z

This has the same goal as #34 but with a different approach.

Note that the fancy-regex implementation doesn't compile yet, but I thought it would be useful to get this reviewed earlier rather than later.

I haven't ported over the regex rewriting changes yet, I'm hoping that we can generate regexes that work on both onig and fancy-regex.

Add std::error::Error impl for fancy-regex, see Make Error implement std::error::Error trait fancy-regex/fancy-regex#35
Release Add limit for backtracking in case of catastrophic backtracking fancy-regex/fancy-regex#33
Port regex rewriting changes

robinst · 2019-11-25T12:25:21Z

Cargo.toml

-default = ["parsing", "assets", "html", "yaml-load", "dump-load", "dump-create"]
+default-onig = ["parsing", "assets", "html", "yaml-load", "dump-load", "dump-create", "regex-onig"]
+default-fancy = ["parsing", "assets", "html", "yaml-load", "dump-load", "dump-create", "regex-fancy"]
+default = ["default-onig"]


Not sure what the best way to structure the features is, any thoughts?

The key thing about features is that they're additive so that if multiple crates specify different features, the union of them works for both crates. The other nice thing to do is to preserve compatibility with existing crates that depend on us without default features, although I'm okay with breaking that if there's no good way otherwise.

I can't see a clean way of making the features backwards-compatible, so maybe just change the cfg statements in the regex module so that if both regex-fancy and regex-onig are set then regex-fancy takes precedence, although I could see the precedence going the other way too, as long as it works with both set.

robinst · 2019-11-25T12:30:15Z

src/parsing/parser.rs

@@ -327,7 +323,7 @@ impl ParseState {
                let match_pat = pat_context.match_at(pat_index);

                if let Some(match_region) = self.search(
-                    line, start, match_pat, captures, search_cache, regions
+                    line, start, match_pat, captures, search_cache


Note that the previous code reused the regions. I should benchmark what the impact of this is, but I decided for the straightforward API for now because I didn't want to complicate things more.

I think this is plausibly an actually important optimization, perhaps especially with onig but allocating an extra Vec on every search with fancy-regex isn't great either. But I forget since it was a long time ago that I optimized this.

I would be happy if you ran the benchmarks with onig in this PR vs the ones in master so we can see if this makes a difference that matters. If so we can just refactor the regex module interfaces to get a region passed in, then the fancy-regex implementation can clear the Vec before adding to it again.

robinst · 2019-11-25T12:33:24Z

src/parsing/regex.rs

+
+/// A region contains text positions for capture groups in a match result.
+#[derive(Clone, Debug, Eq, PartialEq)]
+pub struct Region {


So these types and their methods become public API. Not sure about the naming, it's currently similar to what onig uses, but different from the regex crate/fancy-regex.

robinst · 2019-11-25T12:36:19Z

src/parsing/syntax_definition.rs

 pub struct MatchPattern {
    pub has_captures: bool,
-    pub regex_str: String,
+    pub regex: Regex,


Note due to how the serialization/deserialization of Regex just delegates to the String inside it, this happens to work without changing the binary format for packs (i.e. no need to regenerate packs).

robinst · 2019-11-25T12:38:18Z

src/parsing/yaml_load.rs

@@ -424,25 +424,35 @@ impl SyntaxDefinition {
    }

    fn resolve_variables(raw_regex: &str, state: &ParserState<'_>) -> String {
-        state.variable_regex.replace_all(raw_regex, |caps: &Captures<'_>| {


This was the only use of replace_all. Putting that into the regex API would have been a bit complicated, so I rewrote this part to use search instead.

trishume

Looks good! Some things I'd like before merging:

Run the benchmarks with master, this PR on onig, and this PR on fancy-regex
Add a bit of documentation mentioning fancy-regex and why you might use it (pure Rust!) somewhere (readme, Cargo.toml, doc comment, not sure).
Fix the CI failure where the build doesn't work with no default features because of references to a missing regex_impl

trishume · 2019-11-26T03:57:58Z

src/parsing/metadata.rs

@@ -23,13 +22,6 @@ type Dict = serde_json::Map<String, Settings>;
 /// A String representation of a `ScopeSelectors` instance.
 type SelectorString = String;

-/// A simple regex pattern, used for checking indentation state.
-#[derive(Debug)]
-pub struct Pattern {


I like how this refactor gets rid of the duplicate implementation of regex laziness.

trishume · 2019-11-26T04:01:39Z

src/parsing/parser.rs

@@ -327,7 +323,7 @@ impl ParseState {
                let match_pat = pat_context.match_at(pat_index);

                if let Some(match_region) = self.search(
-                    line, start, match_pat, captures, search_cache, regions
+                    line, start, match_pat, captures, search_cache


I think this is plausibly an actually important optimization, perhaps especially with onig but allocating an extra Vec on every search with fancy-regex isn't great either. But I forget since it was a long time ago that I optimized this.

I would be happy if you ran the benchmarks with onig in this PR vs the ones in master so we can see if this makes a difference that matters. If so we can just refactor the regex module interfaces to get a region passed in, then the fancy-regex implementation can clear the Vec before adding to it again.

trishume · 2019-11-26T04:11:30Z

Cargo.toml

-default = ["parsing", "assets", "html", "yaml-load", "dump-load", "dump-create"]
+default-onig = ["parsing", "assets", "html", "yaml-load", "dump-load", "dump-create", "regex-onig"]
+default-fancy = ["parsing", "assets", "html", "yaml-load", "dump-load", "dump-create", "regex-fancy"]
+default = ["default-onig"]


The key thing about features is that they're additive so that if multiple crates specify different features, the union of them works for both crates. The other nice thing to do is to preserve compatibility with existing crates that depend on us without default features, although I'm okay with breaking that if there's no good way otherwise.

I can't see a clean way of making the features backwards-compatible, so maybe just change the cfg statements in the regex module so that if both regex-fancy and regex-onig are set then regex-fancy takes precedence, although I could see the precedence going the other way too, as long as it works with both set.

trishume · 2019-11-26T04:16:49Z

Also I must say I love how this feature removes nearly as many lines as it adds, largely due to the regex abstraction removing a lot of duplication. Good work on that.

robinst · 2019-12-06T10:53:09Z

Ok, so there's a problem with the Java syntax with fancy-regex, which I've narrowed down to this bug that will need to be fixed: fancy-regex/fancy-regex#37

Adarma · 2020-03-04T14:59:08Z

Any progress on getting this fancy-regex feature?
Even if performance is worse, it is way better than a failed build on Windows.

Running cargo run --features="default-fancy" --example synstats still tries and fails to build onig_sys:

/c/CODE/Rust/syntect ((fe28a3c...))
$ cargo run --features="default-fancy" --example synstats
   Compiling onig_sys v69.2.0
error: failed to run custom build command for `onig_sys v69.2.0`

Caused by:
  process didn't exit successfully: `C:\CODE\Rust\syntect\target\debug\build\onig_sys-9c435766ff906277\build-script-build` (exit code: 101)
--- stdout
cargo:warning=couldn't execute `llvm-config --prefix` (error: The system cannot find the file specified. (os error 2))
cargo:warning=set the LLVM_CONFIG_PATH environment variable to a valid `llvm-config` executable

--- stderr
thread 'main' panicked at 'Unable to find libclang: "couldn\'t find any valid shared libraries matching: [\'clang.dll\', \'libclang.dll\'], set the `LIBCLANG_PATH` environment variable to a path where one of these files can be found (invalid: [])"', src\libcore\result.rs:1188:5
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace.

gilescope · 2020-03-05T08:49:41Z

What can we do to help? Really keen on being able to land this.

trishume · 2020-03-07T21:33:50Z

One avenue that might be an easier path towards fixing common Windows woes is helping with rust-onig/rust-onig#126 which should make onig no longer require Clang on Windows, although it may still need to build C I'm not sure. You could maybe make it so that the onig crate uses a binary Windows build of the oniguruma library.

Alternatively you can try checking out this branch and seeing if you can rebase it and get it passing CI, which may require updating fancy-regex and maybe some more fixes. I think @robinst may have mentioned at some point that he was still occasionally tinkering with this but not sure if that's still true.

That way we can add fancy-regex support behind a feature.

* Adds a std::error::Error impl for Error * Adds a backtracking limit to mitigate catastrophic backtracking

Without this, some parsing benchmarks took 30% longer to run.

Some of the regexes include `$` and expect it to match end of line. In fancy-regex, `$` means end of text by default. Adding `(?m)` activates multi-line mode which changes `$` to match end of line. This fixes a large number of the failed assertions with syntest.

In fancy-regex, POSIX character classes only match ASCII characters. Sublime's syntaxes expect them to match Unicode characters as well, so transform them to corresponding Unicode character classes.

With the regex crate and fancy-regex, `^` in multi-line mode also matches at the end of a string like "test\n". There are some regexes in the syntax definitions like `^\s*$`, which are intended to match a blank line only. So change `^` to `\A` which only matches at the beginning of text.

Note that this wasn't a problem with Oniguruma because it works on UTF-8 bytes, but fancy-regex works on characters.

Always adding `(?m)` for the entire regex meant that `.` also changed meaning, which is not what we want. The safer option is to use `(?m:$)` for `$` only. That also means we don't have to bother with `\A`. But we do need to parse look-behinds because we can't use `(?m:$)` in it.

Turns out `(?m:$)` works in look-behinds, just not `(?m)$(?-m)` which I was using before.

Includes the fix for fancy-regex/fancy-regex#37 which caused a test failure with the Java syntax.

gilescope · 2020-03-19T08:41:58Z

(Am trying to bring this PR up to date with PR 7

Might be worth introducing a specific feature for this and then have other features depend on it? Not sure.

gilescope · 2020-03-21T07:56:42Z

Should the feature be called regex-rs rather than regex-fancy to match with dump-create-rs and dump-load-rs?

gilescope · 2020-03-21T08:24:22Z

Confirmed this branch builds fine on windows 10 and OSX using cargo build --features regex-fancy --no-default-features with no onix installed. Maybe we could add a line to the readme to explicitly say if you want to build a pure rust version this is how to do it. Aside from that I'm very happy with this PR.

gilescope · 2020-03-21T08:26:31Z

The sooner we land it, the sooner people can say cargo install cargo-expand and it will just work (tm). This will be really great for a lot of rust devs!

Adarma · 2020-03-21T14:46:46Z

This works fine for me on windows with the following Cargo.toml:

syntect = { git = "https://github.com/trishume/syntect", branch = "move-regex-use-to-module", default-features = false, features = ["default-fancy"] }

trishume

I looked over all this again and it looks excellent. I really like the new abstractions and all the new tests.

I ran the benchmarks and it looks like this doesn't really change performance in onig mode, and fancy-regex mode is about half the speed:

highlight/"highlight_test.erb"
                        time:   [3.8420 ms 3.9419 ms 4.1724 ms]
                        change: [+250.05% +314.05% +396.57%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 2 outliers among 10 measurements (20.00%)
  1 (10.00%) high mild
  1 (10.00%) high severe
highlight/"InspiredGitHub.tmTheme"
                        time:   [36.775 ms 36.969 ms 37.223 ms]
                        change: [+97.910% +102.09% +106.71%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high mild
Benchmarking highlight/"Ruby.sublime-syntax": Warming up for 3.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 12.1s or reduce sample count to 10
highlight/"Ruby.sublime-syntax"
                        time:   [214.52 ms 216.23 ms 219.43 ms]
                        change: [+369.60% +377.83% +385.88%] (p = 0.00 < 0.05)
                        Performance has regressed.
Benchmarking highlight/"jquery.js": Warming up for 3.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 47.5s or reduce sample count to 10
highlight/"jquery.js"   time:   [808.28 ms 811.63 ms 820.02 ms]
                        change: [+91.568% +96.014% +100.90%] (p = 0.00 < 0.05)
                        Performance has regressed.
Benchmarking highlight/"parser.rs": Warming up for 3.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 28.5s or reduce sample count to 10
highlight/"parser.rs"   time:   [494.35 ms 495.46 ms 497.29 ms]
                        change: [+85.262% +87.230% +89.678%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high severe
highlight/"scope.rs"    time:   [45.487 ms 46.328 ms 47.719 ms]
                        change: [+87.907% +94.569% +100.73%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 1 outliers among 10 measurements (10.00%)

I think fancy-regex being substantially slower shouldn't block merging as long as it's not the default so I'm totally fine with merging at the current performance levels. If we ever want to make it the default I'd want to improve the speed first though.

I also pushed a commit that runs the tests for fancy-regex mode in CI. It looks like more syntax tests pass than before (good job!) but my tricky highlight_test.erb gets mis-highlighted, which causes an HTML comparison test to fail. Try running cargo run --features default-fancy --no-default-features --release --example syncat testdata/highlight_test.erb to see this.

If you're able to quickly dig in and fix that it would be awesome to launch fancy-regex with no known highlighting flaws. However I can imagine it might be something that's time consuming to diagnose or fix, in which case I'd be willing to launch with that test disabled under fancy-regex and a warning in the readme that although fancy-regex works most of the time there are know cases where it messes up. Because it does work most of the time and for the people who need fancy-regex something is better than nothing.

So yah only thing I can think of right now that this needs before merge is either disabling or fixing that test, and something in the readme, both of which I can take a stab at if @robinst is busy.

Amazing work @robinst, thanks so much I'm happy to see this finally being so close! Also thanks @raphlinus for the initial work on fancy-regex, it's looking like it'll finally get put to use! And thanks @gilescope for pushing to get this done :)

gilescope · 2020-03-22T22:54:56Z

Wow, that's some test case.

This applies the regex rewriting fixes on this branch to the default packs.

robinst · 2020-03-25T10:21:51Z

Yay, thanks for helping push this over the finish line :)!

The benchmarks match the shallow benchmarking that I've done. I've done some matching of single regexes (ones that don't need fancy features) and I've observed that the first N matches are slow, and only after some warmup do they get fast. It would be good to investigate this more to see if we can get more performance out of it.

For the test failure, I actually already looked into that and fixed it in the regex rewriting. All that was needed was updating the packs (which I put off until the end because it's hard to rebase)! Pushed that now and build should go green.

If you don't mind adding the bits to the readme @trishume, that would be awesome.

BurntSushi · 2020-03-25T11:38:27Z

@robinst If you're able to widdle down a simple benchmark for me, I'd be happy to take a closer look.

trishume · 2020-03-29T22:36:31Z

This is now released as v4.0.0! Thanks again everyone for your work, especially @robinst!

robinst requested review from trishume and keith-hall November 25, 2019 12:19

robinst mentioned this pull request Nov 25, 2019

[WIP] Kinda-working fancy-regex support #34

Closed

7 tasks

robinst commented Nov 25, 2019

View reviewed changes

trishume reviewed Nov 26, 2019

View reviewed changes

Keats mentioned this pull request Dec 11, 2019

Stop pinning syntect version getzola/zola#876

Closed

sharkdp mentioned this pull request Dec 22, 2019

cargo install failing due to a compilation problem of onig_sys crate sharkdp/bat#650

Closed

gilescope mentioned this pull request Mar 5, 2020

Is it possible to use a pure rust lib rather than onig? dtolnay/cargo-expand#74

Closed

robinst and others added 12 commits March 16, 2020 10:44

Move all regex usage to separate module

5901e1b

That way we can add fancy-regex support behind a feature.

Bump fancy-regex to 0.3.0

d05bde4

* Adds a std::error::Error impl for Error * Adds a backtracking limit to mitigate catastrophic backtracking

Restore optimization of reusing Regions

df56e71

Without this, some parsing benchmarks took 30% longer to run.

Change feature cfg so that regex-onig wins if both features are enabled

f1af918

Add YAML parsing test

f37b17b

Replace POSIX character classes so that they match Unicode as well

5caa56a

In fancy-regex, POSIX character classes only match ASCII characters. Sublime's syntaxes expect them to match Unicode characters as well, so transform them to corresponding Unicode character classes.

Fix code that skips a character to work with unicode

d8eeff9

Note that this wasn't a problem with Oniguruma because it works on UTF-8 bytes, but fancy-regex works on characters.

Remove special treatment of look-behind

fa92de0

Turns out `(?m:$)` works in look-behinds, just not `(?m)$(?-m)` which I was using before.

Bump fancy-regex to 0.3.2

a7045b1

Includes the fix for fancy-regex/fancy-regex#37 which caused a test failure with the Java syntax.

robinst force-pushed the move-regex-use-to-module branch from fe28a3c to a7045b1 Compare March 20, 2020 09:17

Only load regex module for features that need it

4f09143

Might be worth introducing a specific feature for this and then have other features depend on it? Not sure.

Test fancy-regex mode in CI

0a74d87

trishume approved these changes Mar 21, 2020

View reviewed changes

Make packs to fix html::tests::strings test for fancy

da2d4b5

This applies the regex rewriting fixes on this branch to the default packs.

Add section to Readme about new fancy-regex mode.

9cbe524

trishume merged commit c0efc8c into master Mar 29, 2020

trishume mentioned this pull request Mar 29, 2020

Build failed for onig #264

Closed

trishume deleted the move-regex-use-to-module branch March 29, 2020 22:41

k3d3 mentioned this pull request Apr 23, 2020

Wasm support #135

Closed

This was referenced Feb 15, 2021

switch regex engine from oniguruma to fancy-regex bminixhofer/nlprule#18

Closed

Modularize regex backend, add fancy-regex support bminixhofer/nlprule#36

Merged

josephrocca mentioned this pull request Feb 16, 2022

JS / WebAssembly binding planned ? huggingface/tokenizers#63

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Move all regex usage to separate module to add support for fancy-regex #270

Move all regex usage to separate module to add support for fancy-regex #270

robinst commented Nov 25, 2019 •

edited

Loading

robinst Nov 25, 2019

trishume Nov 26, 2019

robinst Nov 25, 2019

trishume Nov 26, 2019

robinst Nov 25, 2019

robinst Nov 25, 2019

robinst Nov 25, 2019

trishume left a comment

trishume Nov 26, 2019

trishume Nov 26, 2019

trishume Nov 26, 2019

trishume commented Nov 26, 2019

robinst commented Dec 6, 2019

Adarma commented Mar 4, 2020 •

edited

Loading

gilescope commented Mar 5, 2020

trishume commented Mar 7, 2020

gilescope commented Mar 19, 2020

gilescope commented Mar 21, 2020

gilescope commented Mar 21, 2020

gilescope commented Mar 21, 2020

Adarma commented Mar 21, 2020

trishume left a comment

gilescope commented Mar 22, 2020

robinst commented Mar 25, 2020

BurntSushi commented Mar 25, 2020

trishume commented Mar 29, 2020

Move all regex usage to separate module to add support for fancy-regex #270

Move all regex usage to separate module to add support for fancy-regex #270

Conversation

robinst commented Nov 25, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

trishume left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

trishume commented Nov 26, 2019

robinst commented Dec 6, 2019

Adarma commented Mar 4, 2020 • edited Loading

gilescope commented Mar 5, 2020

trishume commented Mar 7, 2020

gilescope commented Mar 19, 2020

gilescope commented Mar 21, 2020

gilescope commented Mar 21, 2020

gilescope commented Mar 21, 2020

Adarma commented Mar 21, 2020

trishume left a comment

Choose a reason for hiding this comment

gilescope commented Mar 22, 2020

robinst commented Mar 25, 2020

BurntSushi commented Mar 25, 2020

trishume commented Mar 29, 2020

robinst commented Nov 25, 2019 •

edited

Loading

Adarma commented Mar 4, 2020 •

edited

Loading