Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Syntax Highlighting] Invalid unicode regex match #78

Closed
lildude opened this issue Oct 24, 2019 · 0 comments · Fixed by #79
Closed

[Syntax Highlighting] Invalid unicode regex match #78

lildude opened this issue Oct 24, 2019 · 0 comments · Fixed by #79

Comments

@lildude
Copy link
Contributor

lildude commented Oct 24, 2019

As with #76, our grammar compiler has found another error introduced in #72. This time it's an invalid unicode regex match:

Invalid regex in grammar: `source.hack` (in `syntaxes/hack.json`) contains a malformed regex (regex "`(?xi)([a-z_\x{7f}-\x{7fffffff}]`...": character value in \x{} or \o{} is too large (at offset 30))

... and ...

Invalid regex in grammar: `source.hack` (in `syntaxes/hack.json`) contains a malformed regex (regex "`(?i)[a-z_\x{7f}-\x{7fffffff}][a-`...": character value in \x{} or \o{} is too large (at offset 27))

The line numbers have been truncated. but they correspond to...

"match": "(?xi)\n([a-z_\\x{7f}-\\x{7fffffff}][a-z0-9_\\x{7f}-\\x{7fffffff}]*) # Exception class\n((?:\\s*\\|\\s*[a-z_\\x{7f}-\\x{7fffffff}][a-z0-9_\\x{7f}-\\x{7fffffff}]*)*) # Optional additional exception classes\n\\s*\n((\\$+)[a-z_\\x{7f}-\\x{7fffffff}][a-z0-9_\\x{7f}-\\x{7fffffff}]*) # Variable",

... and ...

"match": "(?i)[a-z_\\x{7f}-\\x{7fffffff}][a-z0-9_\\x{7f}-\\x{7fffffff}]*",

... respectively.

I suspect the intent here was to cover all unicode chars from 0x7F to the end, however 0x7FFFFFFF is no longer a valid UTF-8 unicode char. As of 2003, the max is 0x10FFFF.

From https://en.wikipedia.org/wiki/UTF-8#History:

In November 2003, UTF-8 was restricted by RFC 3629 to match the constraints of the UTF-16 character encoding: explicitly prohibiting code points corresponding to the high and low surrogate characters removed more than 3% of the three-byte sequences, and ending at U+10FFFF removed more than 48% of the four-byte sequences and all five- and six-byte sequences.

PR coming up to implement this change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant