Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential features based on oniguruma-to-es #18

Open
slevithan opened this issue Jan 3, 2025 · 11 comments
Open

Potential features based on oniguruma-to-es #18

slevithan opened this issue Jan 3, 2025 · 11 comments

Comments

@slevithan
Copy link

slevithan commented Jan 3, 2025

Context: oniguruma-to-es is an advanced Oniguruma to JavaScript transpiler that's written in JS. It was first released recently, and has quickly improved. It's used by Shiki's JS engine and supports more than 97% of TM grammars provided with Shiki (it's handling more than 99.9% of regexes in these grammars, but one unsupported or invalid regex removes support for the grammar). Some details are here about supporting the few remaining grammars, if you're interested.

Do you think there might be opportunities to enhance TmLanguage-Syntax-Highlighter using oniguruma-to-es? For example:

  • You could inform users when a grammar won't be supported by Shiki's JS engine.
  • You could show what a particular Oniguruma regex looks like when transpiled to JS (so people more familiar with JS regexes can understand where there are differences in meaning).
  • The error messages given by oniguruma-to-es for invalid Oniguruma patterns could potentially be helpful when writing/debugging grammars.

Happy to answer any questions. But feel free to close this without comment if you don't think it's a good fit.

@RedCMD
Copy link
Owner

RedCMD commented Jan 4, 2025

I've been keeping an eye on oniguruma-to-es for a while now
is a very cool project indeed

I could add a feature to show what the onig regexes would look like in JS
using a hover/button/command
and extend the error reporting

do the error messages give the position of the error?

are there any plans for JS to onig?
then I could add a convert js regex into onig paste option

have you tried parsing the grammars in this repo?
I know I like to use conditionals and absents :)

would there be support for other versions of oniguruma?
cause VSCode uses oniguruma 6.9.8 and Apple's TextMate 2.0 uses v5.9.6 iirc

do you currently support all characters in group names? (as long as the first character is _a-zA-Z)
eg. (?<name@%_0-9>b)\g<name@%_0-9> is valid onig
image
but \k<name@%_0-9> is not valid
image

@slevithan
Copy link
Author

slevithan commented Jan 4, 2025

I've been keeping an eye on oniguruma-to-es for a while now
is a very cool project indeed

Thanks, glad to hear it. 😊

I could add a feature to show what the onig regexes would look like in JS using a hover/button/command

I think that would be very cool.

For this, it might help to be aware of the avoidSubclass option. For example, you get the following results for the pattern .++:

With default options:

toDetails('.++')
/* →
{ pattern: '(?:(?=($E$[^\\n]+))\\1)',
  flags: 'v',
  options: {
    useEmulationGroups: true,
  },
}
*/

toRegExp('.++')
/* →
new EmulatedRegExp('(?:(?=($E$[^\\n]+))\\1)', 'v', {
  useEmulationGroups: true,
})
*/

With avoidSubclass:

toDetails('.++', {avoidSubclass: true})
/* →
{ pattern: '(?:(?=([^\\n]+))\\1)',
  flags: 'v',
}
*/

toRegExp('.++', {avoidSubclass: true})
/* →
/(?:(?=([^\n]+))\1)/v
*/

// Alternatively, even when not using `avoidSubclass` you can do...
toRegExp('.++').toString()
/* →
'/(?:(?=([^\\n]+))\\1)/v'
...or read the regexp's `.source` and `.flags`
*/

The latter values don't include the $E$ marker (or sometimes $N$E$, where N is an integer 1 or greater) used for injected "emulation groups". All of these results match exactly the same strings. The difference is only in the properties of match results. EmulatedRegExp does some fancy things to hide emulation groups from results (and in some cases to transfer captured values between subpattern results to match Oniguruma's handling).

Note that if you pass values from toDetails as arguments to the EmulatedRegExp constructor (or optionally to RegExp if there is no options property on the returned object), you get the same result as from toRegExp. There are some additional details in the docs for avoidSubclass and in shikijs/shiki#878.

@slevithan
Copy link
Author

slevithan commented Jan 4, 2025

Reporting error positions

[...] and extend the error reporting
do the error messages give the position of the error?

No. That would be nice and I'd welcome contributions that enable this, but it would be difficult because in some cases (subroutines are an example) the generated results are fairly scrambled compared to the input, and errors can come from the tokenizer, parser, transformer, or code generator. But if you only wanted to know whether it's a valid Oniguruma regex (minus features that oniguruma-to-es doesn't yet support when generating its Oniguruma AST), then you only need to worry about toOnigurumaAst which only calls the tokenizer and parser. The tokenizer already includes a .raw property on tokens and as a result it would probably be easy to add a position property. Adding raw and position properties to the parser's AST output would presumably be significantly more work, but doable.

That said, maybe the errors are still useful without a position? In general, oniguruma-to-es's errors are more specific and understandable than the errors that the actual Oniguruma gives, and further improvements to error specificity/messages are certainly possible/welcome.

JS RegExp → Oniguruma

are there any plans for JS to onig? then I could add a convert js regex into onig paste option

Not currently. JS RegExp to Oniguruma would be a cool feature, but it has more limited use cases that I personally don't have.

However, I would welcome it if you wanted to collaborate on this. Compared to going from an Oniguruma AST to a JS RegExp, going from a JS RegExp AST to an Oniguruma pattern would be dramatically simpler. So most of the complex work would be in building a JS RegExp AST. But then, there are of course existing JS RegExp AST builders. The best / most up to date one is probably eslint-community/regexpp. If you used that, going from JS RegExp to Oniguruma wouldn't need to be a huge project like oniguruma-to-es, at least for someone (like yourself) with preexisting in-depth knowledge of Oniguruma and JS RegExp syntax/behavior.

Aside: Eventually I'd love to create a lightweight AST builder for Regex+ syntax. Regex+ syntax is a strict superset of JS RegExp syntax with flag v, so by including support for Regex+'s syntax extensions via options in the parser (or some kind of a plugin system), you'd get a JS-RegExp-with-v parser for free. And it could be further simplified by only supporting RegExp syntax from the latest ES version.

Support for absence and conditionals

have you tried parsing the grammars in this repo?

No, but I generally know the currently-missing features. They're documented in oniguruma-to-es's readme, or at least hinted at (e.g. for \G it mentions that common uses are supported, and gives some examples of supported cases). Of course, I'd love to learn about anything I'm missing.

I know I like to use conditionals and absents :)

Absent repeaters and absent expressions can be emulated, and I plan to support them in future versions. See the tracking issue here: slevithan/oniguruma-to-es#13 😊

Some conditionals can be emulated. E.g. it would be pretty straightforward to change a basic case like (<)?foo(?(1)>) to (?:(<)foo>|foo). But this would be more complicated or break down in some other cases, sometimes for quite nuanced reasons. If only JS didn't make backreferences to nonparticipating groups match the empty string, there would be additional strategies for emulating conditionals (something I wrote about back in 2007). 😞 I don't currently plan to add support for conditionals, but contributions that add support for basic cases would be welcome.

Aside: Oniguruma edge cases make the (?(…)…) structure relatively complex to deal with comprehensively, since the first can be any arbitrary regex, and the second can be empty (turning the conditional into a backreference checker) or include any number of top-level | (other regex flavors restrict it to one | for then/else).

Emulating older versions of Oniguruma

would there be support for other versions of oniguruma? cause VSCode uses oniguruma 6.9.8 and Apple's TextMate 2.0 uses v5.9.6 iirc

Supporting older versions is not currently planned but is possible. I'd welcome contributions that add this in a maintainable way.

Invalid JS identifiers as group names

do you currently support all characters in group names? (as long as the first character is _a-zA-Z)
eg. (?<name@%_0-9>b)\g<name@%_0-9> is valid onig [...]
but \k<name@%_0-9> is not valid

oniguruma-to-es internally distinguishes between names that are valid in Oniguruma vs JS. Currently, it restricts to group/subroutine/backreference names that are valid in both Oniguruma and JS, which is noted in the readme under Supported features → Groups → Named capturing.

Supporting group/subroutine/backreference names that are invalid JS identifiers would require:

  1. Automatically changing or removing the names during transpilation.
  2. Special handling in EmulatedRegExp to add the original name to match results.

I'm not currently planning to support this since I consider it low priority (and I'd encourage TM grammar authors that use invalid JS identifiers as groups names to update their regexes), but I'd welcome contributions that added support for this.

@slevithan
Copy link
Author

slevithan commented Jan 4, 2025

Aside: It's obvious you have extremely in-depth and hard-won knowledge of Oniguruma's nuances and complexity. Even if you don't end up using oniguruma-to-es in this library, if you're ever interested to play with it, I'd find your feedback extremely valuable. 😊 The demo page hopefully makes that easier.


Edit: Thanks for all the fantastic and detailed issues you've filed!! They've now all been addressed, with fixes published in v1.0.0.

@slevithan
Copy link
Author

do you currently support all characters in group names? (as long as the first character is _a-zA-Z) eg. (?<name@%_0-9>b)\g<name@%_0-9> is valid onig [...] but \k<name@%_0-9> is not valid

Adding to what I mentioned above...

Although the more restrictive rules for JS group names are understand and accurately handled (code here), I don't yet fully understand the Oniguruma group name rules (and the differences that govern names allowed in backreferences and subroutines). I took a brief look at the Oniguruma source code but group name parsing is not so straightforward. Anything more you can explain about the rules would be helpful!

What you mentioned above does not seem to be fully correct, based on testing Oniguruma 6.9.8 via vscode-oniguruma.

  • E.g., (?<日本語>…) is valid.
  • E.g., (?<a)b>…) is not valid.

@RedCMD
Copy link
Owner

RedCMD commented Jan 17, 2025

there is an exception for named groups not allowing )
https://github.com/kkos/oniguruma/blob/master/src/regparse.c#L5051
https://github.com/kkos/oniguruma/blob/master/src/regparse.c#L5077-L5080

all allow using <name> or 'name' as brackets
<name' and 'name> are not allowed

named group (?<name>) doesn't have recursion levels

  • must start with a word character _a-zA-Z (non-ascii characters allowed)
  • followed by any number of any characters except ) and > (or ')

subroutine \\g<name> doesn't have recursion levels

  • must start with a word character _a-zA-Z (non-ascii characters allowed)
  • followed by any number of any characters except > (or ')
    OR
  • leading 0's allowed
  • a valid number. 0 is allowed. limit of 1000??
    OR
  • start with + or -
  • leading 0's allowed
  • a valid relative number. 0 is not allowed. limit of 1000??

conditional (?(<name>)) and backreference \\k<name> do allow recursion levels

  • must start with a word character _a-zA-Z (non-ascii characters allowed)
  • followed by any number of letters, underscores and numbers (non-ascii characters allowed)
    OR
  • leading 0's allowed
  • a valid group number. 0 is not allowed. limit of 1000??
    OR
  • start with + or -
  • leading 0's allowed
  • a valid relative number. 0 is not allowed. limit of 1000??
    Then: optional recursion level
  • start with + or -
  • leading 0's allowed
  • recursion level number. 0 is allowed. limit of 2^31-1?

@slevithan
Copy link
Author

slevithan commented Jan 17, 2025

Thanks for breaking that down in detail!

must start with _a-zA-Z
[...] must start with a letter or underscore (non-ascii characters allowed)

Neither of these seem right.

E.g., as I pointed out above, (?<日本語>…) is valid. So the first char is not limited to [_a-zA-Z]. But it's also not simply allowing a Unicode equivalent:

  • It does seem to be allowing all connector punctuation rather than just underscore (ex: (?<⁀>…) is valid), but...
  • It's allowing non-ASCII decimal digits like ٤ and (which are not matched by \p{Letter} or \p{Alphabetic}).
  • It's also allowing Unicode Marks (which is not so surprising), but then without seeing the exact rules it's possible some are not allowed.

So I still need to figure out exactly what the rules are.


For the record, oniguruma-to-es respects most of what you described, except:

  • Named groups currently require a valid JS identifier; matched by JS /^[$_\p{IDS}][$\u200C\u200D\p{IDC}]*$/u.
    • Eventually this should expand to follow Oniguruma's more permissive rules, but there are some complications to work through in order to do so in a comprehensive way.
  • Recursion-level in backreferences (ex: \k<n+level> or \k<name-level>) aren't supported.
    • Technically this is supportable, but since it's complicated, doesn't have good use cases, and is essentially never used, it's probably not worth it. (I've never seen recursion level used within backreferences or conditionals even once, outside of the Oniguruma docs.)
  • Relative positive indexes like \k<+1> aren't supported for backreferences, but are supported for subroutines. Relative negative indexes are supported for both.
    • An error is the best path for Oniguruma-To-ES for relative positive backreference numbers because ① most placements are mistakes and can never match (based on the Oniguruma behavior for backreferences to nonparticipating groups), ② erroring matches the behavior of named backreferences (which aren't allowed to appear before the groups they reference), and ③ the edge cases where they're matchable rely on rules for backreference resetting within quantified groups that are different in JavaScript and aren't emulatable.
  • Unenclosed backreferences are capped at \999.
    • Of course, escaped numbers might be something other than a backreference (null, octal, identity escape, error), depending on context.
  • Enclosed backreferences are not capped.

limit of 1000??

Backreferences and subroutines don't seem to error for numbers greater than 1000 if as many captures are defined to the left, but then, no regex with more than 999 captures works due to an apparent Oniguruma bug (it fails to match, with no error). So, for example, ()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()()(999) matches 999, but add an empty () on the end of the regex and it will no longer match anything.

@RedCMD
Copy link
Owner

RedCMD commented Jan 17, 2025

I've edited it a bit (prob still wrong)

the word characters seem to be defined by ONIGENC_IS_CODE_WORD()
https://github.com/kkos/oniguruma/blob/master/src/oniguruma.h#L318-L319
ONIGENC_IS_CODE_CTYPE() with ONIGENC_CTYPE_WORD
https://github.com/kkos/oniguruma/blob/master/src/oniguruma.h#L219

then is_code_ctype() is branched off into many different files
not sure which one

doesn't seem to be any of the iso files, since they just check for < 256
https://github.com/kkos/oniguruma/tree/master/src

static int
is_code_ctype(OnigCodePoint code, unsigned int ctype)
{
  if (code < 256)
    return ENC_IS_ISO_8859_6_CTYPE(code, ctype);
  else
    return FALSE;
}

euc_jp.c and sjis.c are the same

static int
is_code_ctype(OnigCodePoint code, unsigned int ctype)
{
  if (ctype <= ONIGENC_MAX_STD_CTYPE) {
    if (code < 128)
      return ONIGENC_IS_ASCII_CODE_CTYPE(code, ctype);
    else {
      if (CTYPE_IS_WORD_GRAPH_PRINT(ctype)) {
        return (code_to_mbclen(code) > 1 ? TRUE : FALSE);
      }
    }
  }
  else {
    ctype -= (ONIGENC_MAX_STD_CTYPE + 1);
    if (ctype >= (unsigned int )(sizeof(PropertyList)/sizeof(PropertyList[0])))
      return ONIGERR_TYPE_BUG;

    return onig_is_in_code_range((UChar* )PropertyList[ctype], code);
  }

  return FALSE;
}

CTYPE_IS_WORD_GRAPH_PRINT maybe?
https://github.com/kkos/oniguruma/blob/master/src/regenc.h#L103

#define CTYPE_IS_WORD_GRAPH_PRINT(ctype) \
  ((ctype) == ONIGENC_CTYPE_WORD || (ctype) == ONIGENC_CTYPE_GRAPH ||\
   (ctype) == ONIGENC_CTYPE_PRINT)

idk

@RedCMD
Copy link
Owner

RedCMD commented Jan 17, 2025

code_to_mbclen(code) > 1 ? TRUE : FALSE

euc_jp.c excludes 8bit ascii?

static int
code_to_mbclen(OnigCodePoint code)
{
  if (ONIGENC_IS_CODE_ASCII(code)) return 1;
  else if ((code & 0xff0000) != 0) {
    if (EncLen_EUCJP[(int )(code >> 16) & 0xff] == 3)
      return 3;
  }
  else if ((code & 0xff00) != 0) {
    if (EncLen_EUCJP[(int )(code >> 8) & 0xff] == 2)
      return 2;
  }
  else if (code < 256) {
    if (EncLen_EUCJP[(int )(code & 0xff)] == 1)
      return 1;
  }

  return ONIGERR_INVALID_CODE_POINT_VALUE;
}

sjis.c allows some 8bit ascii??

static int
code_to_mbclen(OnigCodePoint code)
{
  if (code < 256) {
    if (EncLen_SJIS[(int )code] == 1)
      return 1;
  }
  else if (code < 0x10000) {
    if (EncLen_SJIS[(int )(code >>  8) & 0xff] == 2)
      return 2;
  }

  return ONIGERR_INVALID_CODE_POINT_VALUE;
}

@slevithan
Copy link
Author

slevithan commented Jan 17, 2025

Thanks for looking. Yeah, I went down a similar rabbit hole (but I'm guessing you're more familiar with C than I am) and also ended up with "idk". 🤷🏻‍♂️

This has nevertheless been quite helpful, and I'll probably end up with the right validation or at least something much closer as a result.

followed by any number of any characters except ) and > (or ')

You're probably already aware of this, but > and ' are allowed as non-start chars in group names so long as they're in the alternative enclosure type.

@slevithan
Copy link
Author

slevithan commented Jan 20, 2025

I did some spot tests, and thought maybe I'd figured it out and it was equivalent to JS /^[[\p{Alpha}\p{M}\p{Pc}\p{N}]--\d][^)]*$/v. This was based on tests for first char that included the following:

  • Allows some Alpha that are not included in L.
  • Allows some M that are not included in Alpha.
  • Allows some Nd (not including ASCII 0-9) and No.
    • This shows that the rule is not [[:word:]] minus 0-9, since word doesn't include No.
  • Does not allow any other categories I tested except Pc, M, L, N.

But then I saw that while it allows some No like ¼ (U+00BC), it doesn't allow other No like ௰ (U+0BF0). And this isn't about Unicode version since U+0BF0 was added in Unicode 1.1. Once again: 🤷🏻‍♂️.

Since the rules seem needlessly idiosyncratic and hard to copy perfectly, I updated oniguruma-to-es to just limit to the more sane /^[\p{Alpha}\p{Pc}][^)]*$/u.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants