-
-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Disallow surrogate halves in escape sequences of string and character literals #10443
Disallow surrogate halves in escape sequences of string and character literals #10443
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @HertzDevil 🙏
So, this breaks markd.
Those seem to be valid use cases to me. Should we revert this PR? |
…l-lang#10443)" This reverts commit 395d0bf.
The use case is probably fine. But I question if unicode escape sequences should support invalid unicode characters. This change seems like a good restriction. Now, regarding the use case: Strings and string literals can be invalid unicode (for now at least, see #2886). The "[\xed\xa0\x80-\xed\xaf\xbf][\xed\xb0\x80-\xed\xbf\xbf]" # equivalent to "[\uD800-\uDBFF][\uDC00-\uDFFF]"
"\xed\xa0\xb5\xed\xb4\x84" # equivalent to "\uD835\uDD04" (Afr)
"\xed\xa0\xb5\xed\xb4\x9e" # equivalent to "\uD835\uDD1E" (afr) ping @HertzDevil @icyleaf |
This is how to fix string literals with surrogate halves after this change: Byte escape sequences need to replace unicode escape sequences. "[\uD800-\uDBFF][\uDC00-\uDFFF]".dump # => "[\xED\xA0\x80-\xED\xAF\xBF][\xED\xB0\x80-\xED\xBF\xBF]" |
See #10440.
The lexer independently ensures the escaped character is a valid UTF-8 codepoint instead of relying on the exception raised by
Int#chr
(in other words the lexer could probably use#unsafe_chr
here).Int#chr
will belong in a different PR.