-
-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consider refactoring/removing regex conditionals to allow support in Shiki's JS engine #16
Comments
Thanks for reaching out and sharing the detailed background info! I definitely could be open to tweaking the grammar for improved compatibility, and already had to do so for VS Code (e.g. 711310c, b788c77, efe7774). If you’re interested in making a PR to propose specific changes, I could review it. I haven’t spent much time on this grammar in the last several months, but can try to do some with what little is left of the holiday season :) Is there an easy way to test compatibility with Shiki? Like what would be the best way to iterate on the grammar and see how it looks with this test file? I’m also generally curious about the approach of “transpiling” oniguruma patterns, rather than implementing/using an oniguruma engine—is it mostly for performance and bundle size reasons? Otherwise, I would think the oniguruma wasm approach would be fairly reliable? |
Thanks! Unfortunately I don't think I'm the right person to refactor these regexes, since I don't have the requisite knowledge of the Swift language to make smart choices and test properly. So if you're able to do it that would be fantastic.
Yes, it's for bundle size and performance reasons. Shiki offers both a JS engine (which transpiles regexes) and WASM engine. The benefits of the JS engine are most relevant when running syntax highlighting in the browser, since |
I think you'd just use the latest/development version of Shiki as normal with a custom language (your modified grammar). CC Shiki's author @antfu in case there is anything that would help with iterating on grammar development and testing with Shiki as you go. Aside: Here's how you'd run Shiki's JS engine compatibility report:
That runs the script here, and its output goes to Also possibly helpful is the |
Sitting down with this a little longer, you've definitely picked out the most salient example of where conditionals are used. (Aside: it looks like all the places I ended up using conditionals are in the patterns related to Swift's regex literals. They were definitely one of the trickier things to write patterns for, and this grammar certainly doesn't do it perfectly...if that's even possible, which I doubt it is.) I kinda understand the alternatives you're suggesting, but it does get tougher in the context of the whole pattern. I'd be curious to get your thoughts on how to handle this. Apologies as what follows is a bit of a stream-of-consciousness rather than a concrete question...
The structure of the whole pattern looks like
The intent here is to handle:
Thinking about it a bit more, I suspect the first two The conditionals inside the One idea would be to restructure the whole pattern as (roughly)
but this requires duplicating the whole "guts" of the pattern in between the start+end delimiters, or possibly separating it into two nearly-identical patterns (assuming I don't discover some reason that it has to be a single pattern). Maybe that's the only option? |
The difficulty I have with collaborating on refactoring these particular regexes for parsing Swift regex literals is that I don't know Swift's regex syntax (and don't know Swift, so it would be harder to test). Even if I read the whole documentation page you linked, I'm not optimistic it would cover all the edge cases (which regex syntax is usually chock full of). But, at a high level, I think the best path to avoid conditionals, given the need to match all of For the regex that allows Is there any way to embed a pattern within a pattern in a TM grammar? If so, that would avoid the need for redundancy. Obviously the redundancy could be avoided by using a script to generate the grammars, but I'd understand if you don't think it's appropriate to go down that road. Splitting the regexes in two would not be all bad, since in addition to simplifying the delimiter logic, I imagine it would simplify at least some of the logic inside (given the change in meaning for unescaped Another way to reduce redundancy would be to simplify the overall pattern. On that front, I think the following...
...can be simplified to |
Relates to #16 / cc @slevithan The conditional negative lookaheads were conditional based on absence of `#` in the opening delimiter. This can be simplified by including `/` in the negative lookaheads and moving them before the opening delimiter. (The comment rule seems like it is not actually necessary since `#comment` is given a higher priority, but I kept it for good measure.)
Oh, also, if it makes things easier, the regex doesn't literally need to be split in two. Instead it could be one long free-spaced regex with |
It's been a while, but I think the reason I didn't do this was because I didn't want to restrict the interior to |
That example string is matched just fine by the recursive pattern. 😊 |
Sorry, bad example. I think this is also valid: |
What about this?
That handles your example, and then some. Here's the original sub-pattern I'm comparing it to:
And here are my tests, which I ran against both the original and my new regex, with identical results: Matches:
Doesn't match:
I believe the only difference is that the new regex matches recursive
Edit: After additional testing, tweaked the regexes and removed my (invalid) concern about the |
swift-tmlanguage/Swift.tmLanguage.yaml Line 1895 in d0ccac7
this is the easiest to fix I found an interesting quirk of how backreferences work however it is definitely not compatible with JS looking at the Callouts swift-tmlanguage/Swift.tmLanguage.yaml Line 1994 in d0ccac7
if Swift follows Oniguruma then
this will work however Oniguruma limits recursion to 20 levels swift-tmlanguage/Swift.tmLanguage.yaml Line 1838 in d0ccac7
for the hashtag delimiters (?(3) (?=(?<!\s)/|/(?=#))(?=\2) must end with 1 or more # OR end with no pre-whitespace swift-tmlanguage/Swift.tmLanguage.yaml Line 1857 in d0ccac7
same thing but put it with guts (?>(?<guts>...)(?=(?<!\\s)/|/#))?+ or just duplicate the whole regex again |
→ #19 Thanks as always for the cursed advice :) |
I like this and it seems you are both heading in the same direction (using
This does seem to work fine I'm not sure why the edit: I see that when surrounded with the
I think that's ok :) |
I don't think catastrophic backtracking is even possible |
→ #20 |
Since you only enforced 6 levels of balancing in the original, I guess you could just code out each of the options:
Apply free-spacing as you wish, but this is significantly shorter than the original, and it works the same (for reals this time, apart from |
Hm - would you advocate for that over @RedCMD’s version? I’m not sure how to think about which is better other than some obscure performance issues that I’d probably have to spend another couple hours thinking about to grasp fully 😄 |
Well, leaving aside whatever the Swift callout syntax is that should be matched, if you compare your original regex and my latest regex (which both match the same strings, and neither of which use recursion) to This version from @RedCMD with recursion doesn't match things like Re: perf, I think they're both fine, but my version without recursion will often require less backtracking and perform at least slightly better. My version without recursion is in general easier to reason about (since it's less clever), and it's easy to visually verify that it will not suffer backtracking-related perf problems because all of the alternatives are mutually exclusive. So yes, I would advocate for it unless you were also trying to change what the regex matched. |
PS: Would be happy for @RedCMD to weigh in. For the record, I'm a fan of Red's regex/Oniguruma work. |
Yeah, I'm honestly not sure if these are supposed to be matched or not 😬 It looks like Xcode currently rejects I'm not going to claim the original patterns were 100% correct. I don't think we can even define 100% correct given Swift doesn't fully support all this syntax yet – they've just described it and will probably eventually support it someday. Here's a set of test cases I came up with https://regex101.com/r/JzV2D5/2 and it seems like the version in #20 is working as expected so I'll probably just go with that for now. |
) More work towards #16 If we wanted to capture the `{` `}` delimiters with some scopes, I think we might run into microsoft/vscode-textmate#164 & related issues. But for now we aren't highlighting them so it's fine. --------- Co-authored-by: RedCMD <theredcmd@gmail.com>
I just copied how oniguruma parses its callouts I don't think catastrophic backtracking is even possible since both sides of the group are mutually exclusive
|
I think this should do it: #21 |
It's hard to know for certain in advance how things will go after resolving errors (since errors can affect subsequent matches), but I think that if you're able to avoid the newly-added overlapping recursions, everything will work correctly. It's possible though that you'll need to use Shiki 1.27.1+ (which added support for some
Yes, it's your addition of a (Emulating overlapping recursion [which is extremely rarely used] within JS is technically possible but would lead to exponentially growing regex sizes and would be terrifying to debug.)
Does |
Rather than just That's why the pattern needed to include some recursion — it needs to ensure the coarse structure of the whole regex is correct, so it has to understand things like escapes, groups, character classes etc. (and support recursively nesting those). |
mine is slightly faster :) tho the recursion does cause problems with es and when scoping the brackets spiced up version of @slevithan's \(\?(?>{{{{{.+?}}}}}|{{{{.+?}}}}|{{{.+?}}}|{{.+?}}|{.+?})\)
are you saying this doesn't work? :( (?<object>{\\s*+(?>\"(?>[^\\\\\"]++|\\\\.)*+\"\\s*+:\\s*+(?>\\g<object>|(?<value>\\[\\s*+(?>(?>\\g<object>|\\g<value>)\\s*+,?+\\s*+)*+]|\"(?>[^\\\\\"]++|\\\\.)*+\"|true|false|null|-?+(?>0|[1-9][0-9]*+)(?>(?>\\.[0-9]++)?+(?>[eE][+-]?+[0-9]++)?+)?+))\\s*+,?+\\s*+)*+}) |
That one seems to work fine in the playground: https://slevithan.github.io/oniguruma-to-es/demo/ |
[Reposted after several big edits with more details/options] Nested callout braces solution comparison
Those tests show the number of backtracking steps for yours is higher. 😊 And it's possible to construct cases where the difference in steps is greater than in those examples. Backtracking steps is IMO much better to compare than microseconds of runtime, because steps doesn't suffer from all the usual issues that can affect microbenchmarks and that are out of our control. regex101 isn't even testing the right regex engine (PCRE2 vs Oniguruma), so given that it's definitely better to compare steps over runtime.
That matches different strings and uses more backtracking that the version I shared, so it's not a fair comparison. In any case, microbenchmarks are probably a distraction (my fault--I should have left it at "they're both fine"). But the difference in matched strings likely matters. For example, unlike the version I shared:
Overlapping recursionsAlas, lack of support for overlapping recursions is described in the
I missed the obvious on my first look--that the I see two paths to work around this, neither of which has to be negative on language support:
Dropping the inner recursions
\[ (?:
\\. |
[^\[\]\\] |
(?: \[ (?:
\\. |
[^\[\]\\] |
(?: \[ (?: \\. | [^\[\]\\] )+ \] )
)+ \] )
)+ \] Not pretty, but also not hard to reason about. Also note:
Dropping the outer recursionGiven your explanation that this regex is only used to determine the regex's boundaries, as far as I can tell, accurately balancing parentheses is not important for that (unlike balancing character class brackets, which does affect regex boundaries since you're allowing unescaped literal If dropping the outer recursion is feasible, it's probably the option I'd take. |
Thanks for all the input. I think it might be prudent to add some real tests to this repo before committing to one of the solutions. Maybe Shiki (with the oniguruma engine) is a good tool to use for creating tests :) |
Hopefully so! I'm not all that knowledgeable about Shiki apart from its JS engine, though. @RedCMD has written a bunch of TM grammars (including this one for highlighting TM grammars!) so might be knowledgeable about TM grammar test setup (but maybe you already have that down). 😊 In case it's helpful: If you just want a simple way to run BTW, I didn't realize a regex was using overlapping recursion when I opened this issue, since it was hidden by the error about conditionals in the same regex. 😓 |
yes there's vscode-tmgrammar-test and textmate-tester I haven't actually used any before |
Perhaps of interest: conditionals and overlapping recursions are the only two features not supported by It's exciting that you're continuing to move this forward despite the challenges! Getting Swift working in Shiki's JS engine will not only mean that Swift syntax highlighting can have faster load times for many sites across the web, but will also be the last big step in the journey to 100% grammar support. More details about supporting all grammars are in this Shiki issue, if you're interested. |
Context: Shiki offers a JavaScript regex engine that includes highly accurate Oniguruma emulation, and supports the vast majority of TextMate grammars. The JS engine allows lightweight use of TM grammars without requiring a large Oniguruma WASM bundle, which is especially beneficial in browsers. In the few cases where a grammar (like Swift) isn't supported, it's due to one of the following reasons:
vscode-textmate
, but the error is not ignored by the JS engine). (Applies to the Ada and Hack grammars.)The Swift grammar falls under number 4, due to it's use of conditionals in 3 regexes. It is the only grammar of the 218 currently included with Shiki that uses conditionals. If these conditionals were refactored, the Swift grammar would be supported.
Some uses of conditionals can be refactored without any change in meaning. A simple example would be changing
(<)?foo(?(1)>)
to(?:(<)foo>|foo)
, and for cases as simple as that, perhaps future versions of Oniguruma-To-ES could be updated to translate them automatically. In other cases, deeper refactoring is required, or in some cases a regex might need to be split into multiple regexes.As a real example, here's the first use of conditionals in the Swift grammar:
Stripping the comments, we get
(((\#+)?)/) (?(3)|(?!/)) (?(3)|(?!\s))
, and this could be converted to(((\#+))/ | /(?![/\s]))
with the same matches and same values for captures 1 and 3. Unfortunately, capture 2's behavior changes subtly, since it will now be nonparticipating (rather than matching an empty string) if the first alternative isn't matched, which affects backreferencing behavior in Oniguruma. However, this can probably be addressed by simply changing later uses of\2
to\2?
. This is an example of refactoring that can't be done automatically, but might not present a major challenge if you were open to hand-editing the regexes to remove conditionals.What do you think? Are you open to refactoring/removing conditionals in the grammar?
Although this would currently only add support for your grammar to Shiki's JS engine, the JS engine (which is new) has seen significant adoption as it has continued to improve to its current robust state. I also expect more libraries that work with TM grammars to adopt Oniguruma-To-ES in the future. Removing the conditionals would mean that the Swift grammar is supported by all of them.
CC @antfu
The text was updated successfully, but these errors were encountered: