Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disallow surrogate halves in escape sequences of string and character literals #10443

Merged

Conversation

HertzDevil
Copy link
Contributor

See #10440.

The lexer independently ensures the escaped character is a valid UTF-8 codepoint instead of relying on the exception raised by Int#chr (in other words the lexer could probably use #unsafe_chr here). Int#chr will belong in a different PR.

@straight-shoota straight-shoota added kind:bug A bug in the code. Does not apply to documentation, specs, etc. topic:compiler:parser labels Feb 25, 2021
Copy link
Member

@sdogruyol sdogruyol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @HertzDevil 🙏

@straight-shoota straight-shoota added this to the 1.0.0 milestone Mar 17, 2021
@bcardiff bcardiff merged commit 395d0bf into crystal-lang:master Mar 19, 2021
@bcardiff
Copy link
Member

@straight-shoota
Copy link
Member

straight-shoota commented Mar 19, 2021

The use case is probably fine. But I question if unicode escape sequences should support invalid unicode characters. This change seems like a good restriction.

Now, regarding the use case: Strings and string literals can be invalid unicode (for now at least, see #2886). The \x escape sequence allows using any byte value in a string literal. So I think the best solution would be to rewrite the literals without unicode escape sequences:

"[\xed\xa0\x80-\xed\xaf\xbf][\xed\xb0\x80-\xed\xbf\xbf]" # equivalent to "[\uD800-\uDBFF][\uDC00-\uDFFF]"
"\xed\xa0\xb5\xed\xb4\x84" # equivalent to "\uD835\uDD04" (Afr)
"\xed\xa0\xb5\xed\xb4\x9e" # equivalent to "\uD835\uDD1E" (afr)

ping @HertzDevil @icyleaf

@HertzDevil HertzDevil deleted the bug/string-literal-surrogate branch March 20, 2021 07:41
@straight-shoota
Copy link
Member

straight-shoota commented Mar 20, 2021

This is how to fix string literals with surrogate halves after this change: Byte escape sequences need to replace unicode escape sequences. String#dump takes care of that, it automatically presents invalid unicode codepoints as byte escape sequences.

"[\uD800-\uDBFF][\uDC00-\uDFFF]".dump # => "[\xED\xA0\x80-\xED\xAF\xBF][\xED\xB0\x80-\xED\xBF\xBF]"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind:bug A bug in the code. Does not apply to documentation, specs, etc. topic:compiler:parser
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants