Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider refactoring/removing regex conditionals to allow support in Shiki's JS engine #16

Open
slevithan opened this issue Dec 27, 2024 · 33 comments · May be fixed by #21
Open

Consider refactoring/removing regex conditionals to allow support in Shiki's JS engine #16

slevithan opened this issue Dec 27, 2024 · 33 comments · May be fixed by #21

Comments

@slevithan
Copy link

Context: Shiki offers a JavaScript regex engine that includes highly accurate Oniguruma emulation, and supports the vast majority of TextMate grammars. The JS engine allows lightweight use of TM grammars without requiring a large Oniguruma WASM bundle, which is especially beneficial in browsers. In the few cases where a grammar (like Swift) isn't supported, it's due to one of the following reasons:

  1. Bugs in the TM grammar (invalid Oniguruma regexes that fail silently in vscode-textmate, but the error is not ignored by the JS engine). (Applies to the Ada and Hack grammars.)
  2. Bugs in Oniguruma that the JS engine doesn't reproduce (leading to the JS engine rather than Oniguruma providing the intended highlighting). (Applies to the Kusto grammar.)
  3. Bugs in Oniguruma-To-ES which I maintain and is used under the hood by the JS engine. A lot of work has gone into making the emulation extremely accurate, so this is rare but possible.
  4. Use of a small number of rare regex features that are not yet emulated by Oniguruma-To-ES, or are not possible/feasible to emulate.

The Swift grammar falls under number 4, due to it's use of conditionals in 3 regexes. It is the only grammar of the 218 currently included with Shiki that uses conditionals. If these conditionals were refactored, the Swift grammar would be supported.

Some uses of conditionals can be refactored without any change in meaning. A simple example would be changing (<)?foo(?(1)>) to (?:(<)foo>|foo), and for cases as simple as that, perhaps future versions of Oniguruma-To-ES could be updated to translate them automatically. In other cases, deeper refactoring is required, or in some cases a regex might need to be split into multiple regexes.

As a real example, here's the first use of conditionals in the Swift grammar:

(((\#+)?)/)     # (1) for captures, (2) for matching end, (3) for conditionals
(?(3)|(?!/))   # is not a comment
(?(3)|(?!\s))  # does not start with a space or tab

Stripping the comments, we get (((\#+)?)/) (?(3)|(?!/)) (?(3)|(?!\s)), and this could be converted to (((\#+))/ | /(?![/\s])) with the same matches and same values for captures 1 and 3. Unfortunately, capture 2's behavior changes subtly, since it will now be nonparticipating (rather than matching an empty string) if the first alternative isn't matched, which affects backreferencing behavior in Oniguruma. However, this can probably be addressed by simply changing later uses of \2 to \2?. This is an example of refactoring that can't be done automatically, but might not present a major challenge if you were open to hand-editing the regexes to remove conditionals.

What do you think? Are you open to refactoring/removing conditionals in the grammar?

Although this would currently only add support for your grammar to Shiki's JS engine, the JS engine (which is new) has seen significant adoption as it has continued to improve to its current robust state. I also expect more libraries that work with TM grammars to adopt Oniguruma-To-ES in the future. Removing the conditionals would mean that the Swift grammar is supported by all of them.

CC @antfu

@jtbandes
Copy link
Owner

Thanks for reaching out and sharing the detailed background info! I definitely could be open to tweaking the grammar for improved compatibility, and already had to do so for VS Code (e.g. 711310c, b788c77, efe7774). If you’re interested in making a PR to propose specific changes, I could review it. I haven’t spent much time on this grammar in the last several months, but can try to do some with what little is left of the holiday season :)

Is there an easy way to test compatibility with Shiki? Like what would be the best way to iterate on the grammar and see how it looks with this test file?

I’m also generally curious about the approach of “transpiling” oniguruma patterns, rather than implementing/using an oniguruma engine—is it mostly for performance and bundle size reasons? Otherwise, I would think the oniguruma wasm approach would be fairly reliable?

@slevithan
Copy link
Author

slevithan commented Dec 27, 2024

Thanks for reaching out and sharing the detailed background info! I definitely could be open to tweaking the grammar for improved compatibility, and already had to do so for VS Code [...] If you’re interested in making a PR to propose specific changes, I could review it. I haven’t spent much time on this grammar in the last several months, but can try to do some with what little is left of the holiday season :)

Thanks! Unfortunately I don't think I'm the right person to refactor these regexes, since I don't have the requisite knowledge of the Swift language to make smart choices and test properly. So if you're able to do it that would be fantastic.

I’m also generally curious about the approach of “transpiling” oniguruma patterns, rather than implementing/using an oniguruma engine—is it mostly for performance and bundle size reasons? Otherwise, I would think the oniguruma wasm approach would be fairly reliable?

Yes, it's for bundle size and performance reasons. Shiki offers both a JS engine (which transpiles regexes) and WASM engine. The benefits of the JS engine are most relevant when running syntax highlighting in the browser, since oniguruma-to-es is less than 4% of the size of vscode-oniguruma.

@slevithan
Copy link
Author

slevithan commented Dec 28, 2024

Is there an easy way to test compatibility with Shiki? Like what would be the best way to iterate on the grammar and see how it looks with this test file?

I think you'd just use the latest/development version of Shiki as normal with a custom language (your modified grammar).

CC Shiki's author @antfu in case there is anything that would help with iterating on grammar development and testing with Shiki as you go.


Aside: Here's how you'd run Shiki's JS engine compatibility report:

  • Clone Shiki.
  • pnpm install
  • pnpm run build
  • pnpm run report-engine-js

That runs the script here, and its output goes to docs/references/engine-js-compat.md. For reference, the JS engine itself is here. But the above steps assume that the grammar update is already in tm-grammars and that the tm-grammars version has been bumped in Shiki's dependency list prior to running pnpm install.


Also possibly helpful is the oniguruma-to-es demo page here. You can enter an individual regex and see the transpiled output as you type. The main thing would be ensuring it doesn't error. To match Shiki's use of the library, enable the options listed here (they're collapsed under "More options" on the demo page).

@jtbandes
Copy link
Owner

Sitting down with this a little longer, you've definitely picked out the most salient example of where conditionals are used. (Aside: it looks like all the places I ended up using conditionals are in the patterns related to Swift's regex literals. They were definitely one of the trickier things to write patterns for, and this grammar certainly doesn't do it perfectly...if that's even possible, which I doubt it is.)

I kinda understand the alternatives you're suggesting, but it does get tougher in the context of the whole pattern. I'd be curious to get your thoughts on how to handle this. Apologies as what follows is a bit of a stream-of-consciousness rather than a concrete question...

Stripping the comments, we get (((\#+)?)/) (?(3)|(?!/)) (?(3)|(?!\s)), and this could be converted to (((\#+))/ | /(?![/\s])) with the same matches and same values for captures 1 and 3. Unfortunately, capture 2's behavior changes subtly, since it will now be nonparticipating (rather than matching an empty string) if the first alternative isn't matched, which affects backreferencing behavior in Oniguruma. However, this can probably be addressed by simply changing later uses of \2 to \2?.

The structure of the whole pattern looks like

(((\#+)?)/)
(?(3)|(?!/))   # is not a comment
(?(3)|(?!\s))  # does not start with a space or tab
...snip...
\\Q
  (?:(?!\\E)(?!/\2).)*+
  (?:\\E
    # A quoted sequence may not have a closing E, in which case it extends to the end of the regex
    | (?(3)|(?<!\s))(?=/\2)
  )
...snip...
# may end with a space only if it is an extended literal or contains only a single escaped space
(?(3)|(?(5)(?<!\s)))
(/\2)

The intent here is to handle:

  • if a regex starts with #/ then it must end with /# etc. (so replacing the (/\2) at the end with (/\2?) would break this, allowing #/ ... / to match)
  • iff the # is absent, // is interpreted as a line comment, not an empty regex, and the regex may not begin with whitespace
  • Spaces at the end are allowed only in certain situations: /... / is invalid, but /\ / is allowed and #/... /# is allowed.
  • As an example, I left in the part of the pattern above which matches PCRE-style \Q...\E "quotes". I used \\Q(?:(?!\\E)(?!/\2).)*+ to ensure the quote ends at the regex terminator — which may be / or /# or /## etc. (This is another place where replacing /\2 with /\2? would break the pattern by ending the match early in a case like #/\Q.../...\E/# — the / inside the \Q\E pair is supposed to be ignored; only /# would terminate this regex.)

Thinking about it a bit more, I suspect the first two (?(3)|(?!/)) might be able to be replaced with simple negative lookaheads that precede the (\#+)? pattern. This might also simplify the situation with \2 being nonparticipating vs matching an empty string.

The conditionals inside the \Q...\E and at the end are harder to fix. Both are related to the rule about ending with a space. Is there a way to achieve these conditional-like behaviors where the allowed whitespace that precedes the terminating delimiter of the regex depends on what the starting delimiter was?

One idea would be to restructure the whole pattern as (roughly)

/ ... (some rules about terminating whitespace here) /
 | (\#+)/ ... (some different rules about terminating whitespace here) /\1

but this requires duplicating the whole "guts" of the pattern in between the start+end delimiters, or possibly separating it into two nearly-identical patterns (assuming I don't discover some reason that it has to be a single pattern). Maybe that's the only option?

@slevithan
Copy link
Author

slevithan commented Dec 30, 2024

The difficulty I have with collaborating on refactoring these particular regexes for parsing Swift regex literals is that I don't know Swift's regex syntax (and don't know Swift, so it would be harder to test). Even if I read the whole documentation page you linked, I'm not optimistic it would cover all the edge cases (which regex syntax is usually chock full of). But, at a high level, I think the best path to avoid conditionals, given the need to match all of /...\/.../, #/.../.../#, ##/.../#.../##, etc. would be to split the regex in two: one that matches regex literals with any number of #s in the delimiters, and one for without #s.

For the regex that allows # in the delimiter, I imagine you'd continue to use backreferences to balance them: (?x)(?<hashes>\#+)/...../\k<hashes>.

Is there any way to embed a pattern within a pattern in a TM grammar? If so, that would avoid the need for redundancy. Obviously the redundancy could be avoided by using a script to generate the grammars, but I'd understand if you don't think it's appropriate to go down that road.

Splitting the regexes in two would not be all bad, since in addition to simplifying the delimiter logic, I imagine it would simplify at least some of the logic inside (given the change in meaning for unescaped /, /#, etc. within the pattern, and the change in allowing #/ \ /# but not / \ /). These simplifications would probably make it easier to understand and reason about edge cases in the regexes, plus make them easier to maintain (since, with regexes this complex, unintentionally introducing runaway backtracking is a real issue).

Another way to reduce redundancy would be to simplify the overall pattern. On that front, I think the following...

# we only support a fixed maximum number of braces because otherwise we can't balance the number of open and close braces
\{(?<g1>\{)?+(?<g2>\{)?+(?<g3>\{)?+(?<g4>\{)?+(?<g5>\{)?+
.+?
\}(?(<g1>)\})(?(<g2>)\})(?(<g3>)\})(?(<g4>)\})(?(<g5>)\})

...can be simplified to (?<balanced> { (?:[^{}] | \g<balanced>)* }) (unless I'm missing something). That also gets rid of five conditionals. 😊

jtbandes added a commit that referenced this issue Dec 30, 2024
Relates to #16 / cc
@slevithan

The conditional negative lookaheads were conditional based on absence of
`#` in the opening delimiter. This can be simplified by including `/` in
the negative lookaheads and moving them before the opening delimiter.

(The comment rule seems like it is not actually necessary since
`#comment` is given a higher priority, but I kept it for good measure.)
@slevithan
Copy link
Author

slevithan commented Dec 30, 2024

Oh, also, if it makes things easier, the regex doesn't literally need to be split in two. Instead it could be one long free-spaced regex with (?x)(?: ... | ... ) (unless I'm missing something about how TM grammar captures are used, and you can't just use the same handling for two different sets of capture indexes). Not sure whether this would be preferable.

@jtbandes
Copy link
Owner

jtbandes commented Dec 30, 2024

...can be simplified to (?<balanced> { (?:[^{}] | \g<balanced>)* }) (unless I'm missing something). That also gets rid of five conditionals. 😊

It's been a while, but I think the reason I didn't do this was because I didn't want to restrict the interior to [^{}]. For example, I believe this is valid (based on my interpretation of the docs): (?{{{abc{def}ghi}}})

@slevithan
Copy link
Author

It's been a while, but I think the reason I didn't do this was because I didn't want to restrict the interior to [^{}]. For example, I believe this is valid (based on my interpretation of the docs): (?{{{abc{def}ghi}}})

That example string is matched just fine by the recursive pattern. 😊

@jtbandes
Copy link
Owner

Sorry, bad example. I think this is also valid: (?{{{abc{def}}})

@slevithan
Copy link
Author

slevithan commented Dec 30, 2024

What about this?

\(\?
(?<b>{(?:[^{]{*|\g<b>)*?})
\)

That handles your example, and then some.

Here's the original sub-pattern I'm comparing it to:

\(\?
\{(?<g1>\{)?+(?<g2>\{)?+(?<g3>\{)?+(?<g4>\{)?+(?<g5>\{)?+
.+?
\}(?(<g1>)\})(?(<g2>)\})(?(<g3>)\})(?(<g4>)\})(?(<g5>)\})
\)

And here are my tests, which I ran against both the original and my new regex, with identical results:

Matches:

  • (?{{{abc{def}ghi}}})
  • (?{{{abc{def}}}) (?{{{abc}def}}}) — 2 separate matches
  • (?{{{a}bc{def}}}) (?{{{abc}def}}}) (?{{{abc}d}}ef}}}) (?{{{ab{c{def}}}) — 4 separate matches
  • (?{a{{a}) (?{a}}a}) — 2 separate matches
  • (?{{}}}})
  • (?{{a}}})
  • (?{a}a})
  • (?{a}})

Doesn't match:

  • (?{{{{}})
  • (?{)
  • (?})
  • (?{{a})
  • (?{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{) — testing for runaway backtracking

I believe the only difference is that the new regex matches recursive {...} without any chars in between (e.g. (?{}) and (?{{}})). If you want to prevent those from being matched, you could add a negative lookahead in front, like this:

\(\?
(?!(?<r>{\g<r>?})\))
(?<b>{(?:[^{]{*|\g<b>)*?})
\)

Edit: After additional testing, tweaked the regexes and removed my (invalid) concern about the .+? in the original sub-pattern.

@RedCMD
Copy link
Contributor

RedCMD commented Jan 4, 2025

- match: '(?x)(\\[gk](<)|\\[gk]'') (?: ((?!\d)\w+) (?:([+-])(\d+))? | ([+-]?\d+) (?:([+-])(\d+))? ) ((?(2)>|''))'

this is the easiest to fix
simply just duplicate the whole regex
with quotes ' assigned to one and <> the other

I found an interesting quirk of how backreferences work \\1
allowing for this interesting regex @(?>'()|<())abc(?>(?=\\1)'|(?=\\2)>)
image
@'abc' @<abc> @'abc> @<abc'
image

however it is definitely not compatible with JS
and you will notice that it works differently when backreferencing across begin/end


looking at the Callouts
https://github.com/swiftlang/swift-evolution/blob/main/proposals/0355-regex-syntax-run-time-construction.md#callouts

(\{(?<g1>\{)?+(?<g2>\{)?+(?<g3>\{)?+(?<g4>\{)?+(?<g5>\{)?+) .+? \}(?(<g1>)\})(?(<g2>)\})(?(<g3>)\})(?(<g4>)\})(?(<g5>)\})

if Swift follows Oniguruma then

  • opening brackets are allowed in the middle and at the end (?{ab{c{})
  • closing brackets are allowed in the middle and at the end; as long as there are fewer of them in sequence than the initial opening brackets (?{{{a}b}}c}}})
  • closing brackets are not allowed in the middle and at the end; as long as there are an equal or more of them in sequence than the initial opening brackets (?{{a}}b}}}c}})

this will work (?>({(?:\g<-1>|(?!{)(?:}++??[^}]++)*?)}))
image
I particularly like this cursed part }++??
(?:}++??[^}]++) can just be replaced with .

however Oniguruma limits recursion to 20 levels


| (?(3)|(?<!\s))(?=/\2)

for the hashtag delimiters (?(3)
(?=(?<!\s)/|/(?=#))(?=\2)
image
must end with 1 or more #
OR
end with no pre-whitespace

(?(3)|(?(5)(?<!\s)))

same thing but put it with guts
(?>(?<guts>...)(?=(?<!\\s)/|/#))?+
image

or just duplicate the whole regex again
one with # and the other without

@jtbandes
Copy link
Owner

this is the easiest to fix
simply just duplicate the whole regex
with quotes ' assigned to one and <> the other

#19

Thanks as always for the cursed advice :)

@jtbandes
Copy link
Owner

jtbandes commented Jan 18, 2025

What about this?

\(\?
(?<b>{(?:[^{]{*|\g<b>)*?})
\)

I like this and it seems you are both heading in the same direction (using \g to handle nesting). However it looks like this incorrectly matches {{ab}c}: https://regex101.com/r/8i0M8x/1

this will work (?>({(?:\g<-1>|(?!{)(?:}++??[^}]++)*?)}))
I particularly like this cursed part }++??
(?:}++??[^}]++) can just be replaced with .

This does seem to work fine (?>({(?:\g<-1>|(?!{).*?)})) https://regex101.com/r/wV5AGq/2

I'm not sure why the (?> is needed as it seems to just as well work without it. Is it just an optimization? Same for (?:}++??[^}]++), what does that accomplish?

edit: I see that when surrounded with the \(\? ... [X<>]?\) delimiters, the (?> is preventing (?{abc } d}X) from matching incorrectly, but I don't fully understand why

however Oniguruma limits recursion to 20 levels

I think that's ok :)

@RedCMD
Copy link
Contributor

RedCMD commented Jan 18, 2025

(?>) just disables backtracking once a match is found
even if the rest of the regex fails, it won't backtrack into a new (bigger) match

(?:}++??[^}]++) is just a catastrophic backtracking optimization compared to .
however I don't think it's actually needed
unless maybe the callout syntax is invalid and hundreds of characters long

I don't think catastrophic backtracking is even possible
but it's faster anyway

@jtbandes
Copy link
Owner

#20

@slevithan
Copy link
Author

I like this and it seems you are both heading in the same direction (using \g to handle nesting). However it looks like this incorrectly matches {{ab}c}

Since you only enforced 6 levels of balancing in the original, I guess you could just code out each of the options:

\(\?{(?:[^{].*?|{[^{].*?}|{{[^{].*?}}|{{{[^{].*?}}}|{{{{[^{].*?}}}}|{{{{{.+?}}}}})}\)

Apply free-spacing as you wish, but this is significantly shorter than the original, and it works the same (for reals this time, apart from [^{] matching \n unlike the dot, but you could easily disallow that and it shouldn't be relevant within the context of a TM grammar). Due to the explicit [^{] in all but the last alternation option, there's no superlinear backtracking.

@jtbandes
Copy link
Owner

Hm - would you advocate for that over @RedCMD’s version? I’m not sure how to think about which is better other than some obscure performance issues that I’d probably have to spend another couple hours thinking about to grasp fully 😄

@slevithan
Copy link
Author

slevithan commented Jan 18, 2025

Well, leaving aside whatever the Swift callout syntax is that should be matched, if you compare your original regex and my latest regex (which both match the same strings, and neither of which use recursion) to \(\?(?>(\{(?:\g<-1>|(?!{).*?)\}))\), they don't match the same strings.

This version from @RedCMD with recursion doesn't match things like (?{{}}}}), (?{{a}}}), (?{a}a}), (?{a}}). And it doesn't match (?{{{{{{{a}}}}}}) like the original and my new equivalent do (it probably shouldn't match this one, but as @RedCMD mentioned Oniguruma caps out at 20 levels of recursion anyway, so either way recursion is capped.

Re: perf, I think they're both fine, but my version without recursion will often require less backtracking and perform at least slightly better. My version without recursion is in general easier to reason about (since it's less clever), and it's easy to visually verify that it will not suffer backtracking-related perf problems because all of the alternatives are mutually exclusive.

So yes, I would advocate for it unless you were also trying to change what the regex matched.

@slevithan
Copy link
Author

slevithan commented Jan 18, 2025

PS: Would be happy for @RedCMD to weigh in. For the record, I'm a fan of Red's regex/Oniguruma work.

@jtbandes
Copy link
Owner

jtbandes commented Jan 18, 2025

things like (?{{}}}}), (?{{a}}}), (?{a}a}), (?{a}})

Yeah, I'm honestly not sure if these are supposed to be matched or not 😬

It looks like Xcode currently rejects /(?{a}})/ /(?{a}b})/ /(?{{a}}})/ with a syntax error, and accepts /(?{{a}b}})/ – for some definition of "accepts"...it actually rejects it with "callout is not currently supported" 😅

I'm not going to claim the original patterns were 100% correct. I don't think we can even define 100% correct given Swift doesn't fully support all this syntax yet – they've just described it and will probably eventually support it someday.

Here's a set of test cases I came up with https://regex101.com/r/JzV2D5/2 and it seems like the version in #20 is working as expected so I'll probably just go with that for now.

jtbandes added a commit that referenced this issue Jan 18, 2025
jtbandes added a commit that referenced this issue Jan 18, 2025
)

More work towards #16

If we wanted to capture the `{` `}` delimiters with some scopes, I think
we might run into
microsoft/vscode-textmate#164 & related
issues. But for now we aren't highlighting them so it's fine.

---------

Co-authored-by: RedCMD <theredcmd@gmail.com>
@RedCMD
Copy link
Contributor

RedCMD commented Jan 18, 2025

I just copied how oniguruma parses its callouts

I don't think catastrophic backtracking is even possible since both sides of the group are mutually exclusive

(?:}++??[^}]++) is just able to skip over large sections of text that it can guarantee to not need to indivually check each char

@jtbandes
Copy link
Owner

I think this should do it: #21

@jtbandes
Copy link
Owner

jtbandes commented Jan 19, 2025

Looks like even with that change applied, there are some subtle differences between the results of Shiki's WASM & JS engines. And I'm now seeing Error: Cannot use multiple overlapping recursions. How do I determine which rule(s) this is coming from?

Screen.Recording.2025-01-18.at.4.33.39.PM.mov

edit: seems like it's the same pattern we were looking at (naturally).

Image Image

Any suggestions for what to do here?

@slevithan
Copy link
Author

slevithan commented Jan 19, 2025

It's hard to know for certain in advance how things will go after resolving errors (since errors can affect subsequent matches), but I think that if you're able to avoid the newly-added overlapping recursions, everything will work correctly. It's possible though that you'll need to use Shiki 1.27.1+ (which added support for some \G edge cases that the Swift grammar might use).

edit: seems like it's the same pattern we were looking at (naturally).

Yes, it's your addition of a class recursion within the guts recursion in #21, and the addition of the relative-index recursion (\{(?:\g<-1>|(?!{).*?)\}) within the guts recursion.

(Emulating overlapping recursion [which is extremely rarely used] within JS is technically possible but would lead to exponentially growing regex sizes and would be terrifying to debug.)

Any suggestions for what to do here?

Does guts need to be recursed in the first place? I haven't studied it closely, but it's mostly a list of different regex token types as alternatives within a quantified group. How is recursing this group helping?

@jtbandes
Copy link
Owner

Rather than just start & end patterns that look for / delimiters, I needed to figure out the boundaries of the whole regex up front to avoid ambiguities with operators, and help correctly enforce other details of the regex syntax. Once the boundaries are determined, simpler rules are used to provide detailed scopes/colors for all the different regex syntax. So you can think of this giant pattern as a combination of all the regex-related patterns in the grammar, but simplified, so its main job is to correctly determine what is and is not a regex literal.

That's why the pattern needed to include some recursion — it needs to ensure the coarse structure of the whole regex is correct, so it has to understand things like escapes, groups, character classes etc. (and support recursively nesting those).

@RedCMD
Copy link
Contributor

RedCMD commented Jan 19, 2025

mine is slightly faster :)
except for the first case for some reason
https://regex101.com/r/JzV2D5/2

tho the recursion does cause problems with es and when scoping the brackets

spiced up version of @slevithan's

\(\?(?>{{{{{.+?}}}}}|{{{{.+?}}}}|{{{.+?}}}|{{.+?}}|{.+?})\)

overlapping recursions

are you saying this doesn't work? :(

(?<object>{\\s*+(?>\"(?>[^\\\\\"]++|\\\\.)*+\"\\s*+:\\s*+(?>\\g<object>|(?<value>\\[\\s*+(?>(?>\\g<object>|\\g<value>)\\s*+,?+\\s*+)*+]|\"(?>[^\\\\\"]++|\\\\.)*+\"|true|false|null|-?+(?>0|[1-9][0-9]*+)(?>(?>\\.[0-9]++)?+(?>[eE][+-]?+[0-9]++)?+)?+))\\s*+,?+\\s*+)*+})

@jtbandes
Copy link
Owner

overlapping recursions

are you saying this doesn't work? :(

(?<object>{\\s*+(?>\"(?>[^\\\\\"]++|\\\\.)*+\"\\s*+:\\s*+(?>\\g<object>|(?<value>\\[\\s*+(?>(?>\\g<object>|\\g<value>)\\s*+,?+\\s*+)*+]|\"(?>[^\\\\\"]++|\\\\.)*+\"|true|false|null|-?+(?>0|[1-9][0-9]*+)(?>(?>\\.[0-9]++)?+(?>[eE][+-]?+[0-9]++)?+)?+))\\s*+,?+\\s*+)*+})

That one seems to work fine in the playground: https://slevithan.github.io/oniguruma-to-es/demo/

@RedCMD
Copy link
Contributor

RedCMD commented Jan 19, 2025

sorry that was json escaped

Image

(?<object>{\s*+(?>\"(?>[^\\\"]++|\\.)*+\"\s*+:\s*+(?>\g<object>|(?<value>\[\s*+(?>(?>\g<object>|\g<value>)\s*+,?+\s*+)*+]|\"(?>[^\\\"]++|\\.)*+\"|true|false|null|-?+(?>0|[1-9][0-9]*+)(?>(?>\.[0-9]++)?+(?>[eE][+-]?+[0-9]++)?+)?+))\s*+,?+\s*+)*+})

Image

@slevithan
Copy link
Author

slevithan commented Jan 20, 2025

[Reposted after several big edits with more details/options]

Nested callout braces solution comparison

mine is slightly faster :) except for the first case for some reason https://regex101.com/r/JzV2D5/2

Those tests show the number of backtracking steps for yours is higher. 😊 And it's possible to construct cases where the difference in steps is greater than in those examples. Backtracking steps is IMO much better to compare than microseconds of runtime, because steps doesn't suffer from all the usual issues that can affect microbenchmarks and that are out of our control. regex101 isn't even testing the right regex engine (PCRE2 vs Oniguruma), so given that it's definitely better to compare steps over runtime.

spiced up version of @slevithan's

\(\?(?>{{{{{.+?}}}}}|{{{{.+?}}}}|{{{.+?}}}|{{.+?}}|{.+?})\)

That matches different strings and uses more backtracking that the version I shared, so it's not a fair comparison. In any case, microbenchmarks are probably a distraction (my fault--I should have left it at "they're both fine"). But the difference in matched strings likely matters.

For example, unlike the version I shared:

  • It doesn't match (?{a}}a}), (?{{}}}}), (?{{a}}}), (?{a}a}), (?{a}}).
  • It does match (?{{{{}}), (?{{a}).

Overlapping recursions

Alas, lack of support for overlapping recursions is described in the oniguruma-to-es readme, and the reason for not allowing it is because it can get totally unmanageable very fast. Expanding a regex pattern with up to 20×20 expansions (since 20 is Oniguruma's depth limit and is oniguruma-to-es's default/max limit) is maybe on the edge of manageable, but as soon as you do something like (20×20)×(20×20) it seems like a very bad idea. In this case, where @jtbandes is doing two nested recursions (which don't overlap each other) within an outer (overlapping) recursion, it would create up to 20×(20+20) expansions.

Rather than just start & end patterns that look for / delimiters, I needed to figure out the boundaries of the whole regex up front to avoid ambiguities with operators, and help correctly enforce other details of the regex syntax. Once the boundaries are determined, simpler rules are used to provide detailed scopes/colors for all the different regex syntax. So you can think of this giant pattern as a combination of all the regex-related patterns in the grammar, but simplified, so its main job is to correctly determine what is and is not a regex literal.

I missed the obvious on my first look--that the guts recursion is for handling nested groups. And your explanation also helps a lot.

I see two paths to work around this, neither of which has to be negative on language support:

  1. Drop the inner recursions (for character classes and callouts).
    OR
  2. Drop the outer (guts) recursion. The two inner recursions are non-overlapping so they would then be fine.

Dropping the inner recursions

  • The regex I shared earlier matches callouts without using recursion, with 6 levels of depth handled accurately (and anything beyond that still handled reasonably, with the same matches compared to the prior, conditional-based solution).
  • The character class recursion can be replaced with a version that supports a fixed level of nesting. E.g. with 3 levels (which is probably enough for ~99.99% of real-world regexes):
\[ (?:
  \\. |
  [^\[\]\\] |
  (?: \[ (?:
    \\. |
    [^\[\]\\] |
    (?: \[ (?: \\. | [^\[\]\\] )+ \] )
  )+ \] )
)+ \]

Not pretty, but also not hard to reason about. Also note:

  • I added \\ to the negated classes. It isn't necessary but is more explicit and reduces backtracking a tiny bit.
  • This might need to get a bit more complex if Swift allows an unescaped ] as a literal char when it's the first char in a class.
    • If so, that support is also currently lacking in your recursion-based version.
    • To support this, you could just change the char-class-opener delimiters (the \[ tokens outside of the char classes) to \[\^?\]?.
  • It would be straightforward enough (though verbose, as you can see) to extend this to support any depth level you want, even up to Oniguruma's 20-level depth. I'd be happy to write an extended version if that helps!

Dropping the outer recursion

Given your explanation that this regex is only used to determine the regex's boundaries, as far as I can tell, accurately balancing parentheses is not important for that (unlike balancing character class brackets, which does affect regex boundaries since you're allowing unescaped literal / within character classes). You mentioned that, "Once the boundaries are determined, simpler rules are used to provide detailed scopes/colors for all the different regex syntax." So, it sounds like that could continue to highlight nested groups correctly even if you refactored/removed the guts recursion.

If dropping the outer recursion is feasible, it's probably the option I'd take.

@jtbandes
Copy link
Owner

Thanks for all the input. I think it might be prudent to add some real tests to this repo before committing to one of the solutions. Maybe Shiki (with the oniguruma engine) is a good tool to use for creating tests :)

@slevithan
Copy link
Author

slevithan commented Jan 21, 2025

Maybe Shiki (with the oniguruma engine) is a good tool to use for creating tests :)

Hopefully so! I'm not all that knowledgeable about Shiki apart from its JS engine, though. @RedCMD has written a bunch of TM grammars (including this one for highlighting TM grammars!) so might be knowledgeable about TM grammar test setup (but maybe you already have that down). 😊

In case it's helpful: If you just want a simple way to run vscode-oniguruma, you can use or copy from my very basic onig:match console script here.

BTW, I didn't realize a regex was using overlapping recursion when I opened this issue, since it was hidden by the error about conditionals in the same regex. 😓

@RedCMD
Copy link
Contributor

RedCMD commented Jan 21, 2025

TM grammar test setup

yes there's vscode-tmgrammar-test and textmate-tester
TypeScript-TmLanguage also has its own testing

I haven't actually used any before
I just take a visual look over a test file
makes it easy when the file your coding in is also a test file

@slevithan
Copy link
Author

slevithan commented Jan 21, 2025

Perhaps of interest: conditionals and overlapping recursions are the only two features not supported by oniguruma-to-es v2.3.0 (just released) that are present in any of the TM grammars provided by Shiki.

It's exciting that you're continuing to move this forward despite the challenges! Getting Swift working in Shiki's JS engine will not only mean that Swift syntax highlighting can have faster load times for many sites across the web, but will also be the last big step in the journey to 100% grammar support. More details about supporting all grammars are in this Shiki issue, if you're interested.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants