Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Escape posix bracket #23

Closed
RedCMD opened this issue Jan 5, 2025 · 7 comments
Closed

Escape posix bracket #23

RedCMD opened this issue Jan 5, 2025 · 7 comments

Comments

@RedCMD
Copy link

RedCMD commented Jan 5, 2025

Posix bracket can be escaped

details of posix bracket were changed in v6.9.9 #253

[[:-\:]] matches :

image

@slevithan
Copy link
Owner

Thanks for the report. I've previously seen your report at kkos/oniguruma#253 but the details are a little hard to follow.

The current handling in oniguruma-to-es's tokenizer starts here and continues here.

If I want to follow the handling from the latest version of Oniguruma (6.9.10), can you clarify what the change should be?

@RedCMD
Copy link
Author

RedCMD commented Jan 6, 2025

I see
v6.9.8 used to use str_exist_check_with_esc()
https://github.com/kkos/oniguruma/blob/v6.9.8/src/regparse.c#L5398
which is now changed in v6.9.9 to use is_posix_bracket_start()
https://github.com/kkos/oniguruma/blob/master/src/regparse.c#L5521

so VSCode allows escaping posix brackets [[:upper\\:]]
and I assume the latest oniguruma version now no longer supports posix escaping

@slevithan
Copy link
Owner

slevithan commented Jan 6, 2025

oniguruma-to-es emulates the latest version of Oniguruma (currently v6.9.10). So I'll go ahead and close this since it sounds like no change is needed for matching the behavior of v6.9.9+.

That said, it's helpful to have this documented here in case future versions of oniguruma-to-es add support for emulating older versions of Oniguruma (previously discussed in RedCMD/TmLanguage-Syntax-Highlighter#18 (comment)). But then, since the former behavior was considered an error by Oniguruma's author ("incorrect POSIX brackets" mentioned in the release notes) it might make sense to treat this as an error even if/when supporting older versions in the future, based on the reasoning described in #21 (comment).

@slevithan
Copy link
Owner

slevithan commented Jan 8, 2025

Although you're already aware of this @RedCMD, I'll note for other people that the changes in behavior for existing syntax between Oniguruma v6.9.8 (currently used by VS Code, Shiki, etc.) and Oniguruma v6.9.10 are extremely minor and are unlikely to cause problems for even a single TextMate grammar. And, of course, escaping POSIX classes in the obvious way (ex: [\[:word:\]], rather than escaping them from the inside out as this issue describes) works the same in all versions of Oniguruma.

If there were ever meaningful differences (that affected real grammars) between the version of Oniguruma used by VS Code and the latest Oniguruma, oniguruma-to-es would either add support for emulating older versions of Oniguruma or add options that allow opting into the old behavior for relevant features.

@tonco-miyazawa
Copy link

tonco-miyazawa commented Jan 17, 2025

Thank you so much for such a great tool!

The following is an explanation of Oniguruma 6.9.9 issue253.
Regarding Oniguruma issue253, there is no difference between 6.9.9 and 6.9.10.

The regular expressions that follow are written in String.raw() notation. \[a\] means String.raw`\[a\]` .


Below are examples of the case where it becomes COMPILE ERROR: -121: invalid POSIX bracket type.
[[:^:]]
[[:^u:]]
[[:^uper:]]
[[:u:]]
[[:upp:]]
[[:uppers:]]

Below are examples of the case where there are interpreted as standard nested classes.
[[:^1:]] // Equivalent to [[\:\^1\:]]
[[:1:]]
[[:upper :]]
[[:upper1:]]
[[:]]
[[::]]
[[:::]]
[[:abc[:upper:]def:]] // [:upper:] is a valid POSIX bracket.

And, Oniguruma test file (test/test_utf8.c)
https://github.com/kkos/oniguruma/blob/v6.9.10/test/test_utf8.c#L234-L248

The above determination is made by the following code:
https://github.com/kkos/oniguruma/blob/v6.9.10/src/regparse.c#L5231-L5265

The above code can be expressed as a regular expression as follows:

(?x) \[: \^? [[:alpha:]]+ :\] | \[: \^ :\]

If it doesn't match the regular expression above, it's standard nested classes.
Note) This is the current implementation, not the specification.

For the benefit of your users, you can treat edge cases as errors.


[[:alpha:]] matches different characters depending on the encoding used.
In Unicode, [[:alpha:]] matches some multi-byte characters.

Not Unicode Case: alpha alphabet
https://github.com/kkos/oniguruma/blob/v6.9.10/doc/RE#L223

Unicode Case: alpha Alphabetic
https://github.com/kkos/oniguruma/blob/v6.9.10/doc/RE#L241

ex. [[:\u212A:]]

In the case of ONIG_ENCODING_UTF8 : COMPILE ERROR: -121: invalid POSIX bracket type
In the case of ONIG_ENCODING_ASCII: This is standard nested classes. Equivalent to [[\:\u212A\:]]


In Oniguruma 6.9.9 and 6.9.10, there is no 'Character limit' for POSIX bracket validation.

In 6.9.8, this limit was 20.
https://github.com/kkos/oniguruma/blob/v6.9.8/src/regparse.c#L6563-L6564


About [. .] and [= =], There are no changes in Oniguruma issue 253.
In 6.9.10, these are still standard nested classes. These are not expected to change.


The subject of Oniguruma issue 253 is the following behavior.

[[:::] is treated as [\[:::] .

[[:[:[:[:upper:]] is treated as [\[:\[:\[:[:upper:]] . // [:upper:] is a valid POSIX bracket.

These [: should be interpreted as the start of nesting, but due to the bug these were interpreted as 'literal string' instead.
This was fixed just before the 6.9.8 release, but the fix was not included in 6.9.8 and was included in 6.9.9.

In 6.9.8, to avoid this bug, change [: to [\:, or avoid starting the nesting with :.
[[\:::]]
[[\:[\:[\:[:upper:]]]]]

[Edit] However, The patterns affected by this bug and its fixes are extremely rare.
Below is comment from K.Kosako (Oniguruma owner: @kkos) : kkos/oniguruma#253 (comment)

This change breaks compatibility, but only if the character classes contain "[:" and ":]" and they do not constitute a POSIX Bracket, which is not a problem in a practical sense. I think.


P.S. The demo of oniguruma-to-es (Now: ver 2.1.0)
[[:]]----------[:] // Error: Invalid POSIX class "[:]]----------[:]"

@slevithan slevithan reopened this Jan 17, 2025
@slevithan
Copy link
Owner

@tonco-miyazawa THANK YOU for the fantastic explanation, details, and examples! 🙇🏻‍♂️

I believe this is fixed now, but of course feel free to let me know if anything is still wrong.

Thanks to both you and @RedCMD for bringing your attention to detail and Oniguruma expertise to help improve this library.

@slevithan
Copy link
Owner

Fix published in v3.0.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants