-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Potential features based on oniguruma-to-es
#18
Comments
I've been keeping an eye on I could add a feature to show what the onig regexes would look like in JS do the error messages give the position of the error? are there any plans for JS to onig? have you tried parsing the grammars in this repo? would there be support for other versions of do you currently support all characters in |
Thanks, glad to hear it. 😊
I think that would be very cool. For this, it might help to be aware of the With default options: toDetails('.++')
/* →
{ pattern: '(?:(?=($E$[^\\n]+))\\1)',
flags: 'v',
options: {
useEmulationGroups: true,
},
}
*/
toRegExp('.++')
/* →
new EmulatedRegExp('(?:(?=($E$[^\\n]+))\\1)', 'v', {
useEmulationGroups: true,
})
*/ With toDetails('.++', {avoidSubclass: true})
/* →
{ pattern: '(?:(?=([^\\n]+))\\1)',
flags: 'v',
}
*/
toRegExp('.++', {avoidSubclass: true})
/* →
/(?:(?=([^\n]+))\1)/v
*/
// Alternatively, even when not using `avoidSubclass` you can do...
toRegExp('.++').toString()
/* →
'/(?:(?=([^\\n]+))\\1)/v'
...or read the regexp's `.source` and `.flags`
*/ The latter values don't include the Note that if you pass values from |
Reporting error positions
No. That would be nice and I'd welcome contributions that enable this, but it would be difficult because in some cases (subroutines are an example) the generated results are fairly scrambled compared to the input, and errors can come from the tokenizer, parser, transformer, or code generator. But if you only wanted to know whether it's a valid Oniguruma regex (minus features that That said, maybe the errors are still useful without a position? In general, JS RegExp → Oniguruma
Not currently. JS RegExp to Oniguruma would be a cool feature, but it has more limited use cases that I personally don't have. However, I would welcome it if you wanted to collaborate on this. Compared to going from an Oniguruma AST to a JS RegExp, going from a JS RegExp AST to an Oniguruma pattern would be dramatically simpler. So most of the complex work would be in building a JS RegExp AST. But then, there are of course existing JS RegExp AST builders. The best / most up to date one is probably eslint-community/regexpp. If you used that, going from JS RegExp to Oniguruma wouldn't need to be a huge project like Aside: Eventually I'd love to create a lightweight AST builder for Regex+ syntax. Regex+ syntax is a strict superset of JS RegExp syntax with flag Support for absence and conditionals
No, but I generally know the currently-missing features. They're documented in
Absent repeaters and absent expressions can be emulated, and I plan to support them in future versions. See the tracking issue here: slevithan/oniguruma-to-es#13 😊 Some conditionals can be emulated. E.g. it would be pretty straightforward to change a basic case like Aside: Oniguruma edge cases make the Emulating older versions of Oniguruma
Supporting older versions is not currently planned but is possible. I'd welcome contributions that add this in a maintainable way. Invalid JS identifiers as group names
Supporting group/subroutine/backreference names that are invalid JS identifiers would require:
I'm not currently planning to support this since I consider it low priority (and I'd encourage TM grammar authors that use invalid JS identifiers as groups names to update their regexes), but I'd welcome contributions that added support for this. |
Aside: It's obvious you have extremely in-depth and hard-won knowledge of Oniguruma's nuances and complexity. Even if you don't end up using Edit: Thanks for all the fantastic and detailed issues you've filed!! They've now all been addressed, with fixes published in v1.0.0. |
Adding to what I mentioned above... Although the more restrictive rules for JS group names are understand and accurately handled (code here), I don't yet fully understand the Oniguruma group name rules (and the differences that govern names allowed in backreferences and subroutines). I took a brief look at the Oniguruma source code but group name parsing is not so straightforward. Anything more you can explain about the rules would be helpful! What you mentioned above does not seem to be fully correct, based on testing Oniguruma 6.9.8 via
|
there is an exception for named groups not allowing all allow using named group
subroutine
conditional
|
Thanks for breaking that down in detail!
Neither of these seem right. E.g., as I pointed out above,
So I still need to figure out exactly what the rules are. For the record,
Backreferences and subroutines don't seem to error for numbers greater than 1000 if as many captures are defined to the left, but then, no regex with more than 999 captures works due to an apparent Oniguruma bug (it fails to match, with no error). So, for example, |
I've edited it a bit (prob still wrong) the word characters seem to be defined by then doesn't seem to be any of the static int
is_code_ctype(OnigCodePoint code, unsigned int ctype)
{
if (code < 256)
return ENC_IS_ISO_8859_6_CTYPE(code, ctype);
else
return FALSE;
} euc_jp.c and sjis.c are the same static int
is_code_ctype(OnigCodePoint code, unsigned int ctype)
{
if (ctype <= ONIGENC_MAX_STD_CTYPE) {
if (code < 128)
return ONIGENC_IS_ASCII_CODE_CTYPE(code, ctype);
else {
if (CTYPE_IS_WORD_GRAPH_PRINT(ctype)) {
return (code_to_mbclen(code) > 1 ? TRUE : FALSE);
}
}
}
else {
ctype -= (ONIGENC_MAX_STD_CTYPE + 1);
if (ctype >= (unsigned int )(sizeof(PropertyList)/sizeof(PropertyList[0])))
return ONIGERR_TYPE_BUG;
return onig_is_in_code_range((UChar* )PropertyList[ctype], code);
}
return FALSE;
}
#define CTYPE_IS_WORD_GRAPH_PRINT(ctype) \
((ctype) == ONIGENC_CTYPE_WORD || (ctype) == ONIGENC_CTYPE_GRAPH ||\
(ctype) == ONIGENC_CTYPE_PRINT) idk |
code_to_mbclen(code) > 1 ? TRUE : FALSE euc_jp.c excludes 8bit ascii? static int
code_to_mbclen(OnigCodePoint code)
{
if (ONIGENC_IS_CODE_ASCII(code)) return 1;
else if ((code & 0xff0000) != 0) {
if (EncLen_EUCJP[(int )(code >> 16) & 0xff] == 3)
return 3;
}
else if ((code & 0xff00) != 0) {
if (EncLen_EUCJP[(int )(code >> 8) & 0xff] == 2)
return 2;
}
else if (code < 256) {
if (EncLen_EUCJP[(int )(code & 0xff)] == 1)
return 1;
}
return ONIGERR_INVALID_CODE_POINT_VALUE;
} sjis.c allows some 8bit ascii?? static int
code_to_mbclen(OnigCodePoint code)
{
if (code < 256) {
if (EncLen_SJIS[(int )code] == 1)
return 1;
}
else if (code < 0x10000) {
if (EncLen_SJIS[(int )(code >> 8) & 0xff] == 2)
return 2;
}
return ONIGERR_INVALID_CODE_POINT_VALUE;
} |
Thanks for looking. Yeah, I went down a similar rabbit hole (but I'm guessing you're more familiar with C than I am) and also ended up with "idk". 🤷🏻♂️ This has nevertheless been quite helpful, and I'll probably end up with the right validation or at least something much closer as a result.
You're probably already aware of this, but |
I did some spot tests, and thought maybe I'd figured it out and it was equivalent to JS
But then I saw that while it allows some Since the rules seem needlessly idiosyncratic and hard to copy perfectly, I updated |
Context:
oniguruma-to-es
is an advanced Oniguruma to JavaScript transpiler that's written in JS. It was first released recently, and has quickly improved. It's used by Shiki's JS engine and supports more than 97% of TM grammars provided with Shiki (it's handling more than 99.9% of regexes in these grammars, but one unsupported or invalid regex removes support for the grammar). Some details are here about supporting the few remaining grammars, if you're interested.Do you think there might be opportunities to enhance TmLanguage-Syntax-Highlighter using
oniguruma-to-es
? For example:oniguruma-to-es
for invalid Oniguruma patterns could potentially be helpful when writing/debugging grammars.Happy to answer any questions. But feel free to close this without comment if you don't think it's a good fit.
The text was updated successfully, but these errors were encountered: