Disallow surrogate halves in escape sequences of string and character literals #10443

HertzDevil · 2021-02-25T09:48:26Z

The lexer independently ensures the escaped character is a valid UTF-8 codepoint instead of relying on the exception raised by Int#chr (in other words the lexer could probably use #unsafe_chr here). Int#chr will belong in a different PR.

sdogruyol

Thanks @HertzDevil 🙏

bcardiff · 2021-03-19T18:57:05Z

So, this breaks markd.

Those seem to be valid use cases to me.

Should we revert this PR?

…l-lang#10443)" This reverts commit 395d0bf.

straight-shoota · 2021-03-19T19:30:01Z

The use case is probably fine. But I question if unicode escape sequences should support invalid unicode characters. This change seems like a good restriction.

Now, regarding the use case: Strings and string literals can be invalid unicode (for now at least, see #2886). The \x escape sequence allows using any byte value in a string literal. So I think the best solution would be to rewrite the literals without unicode escape sequences:

"[\xed\xa0\x80-\xed\xaf\xbf][\xed\xb0\x80-\xed\xbf\xbf]" # equivalent to "[\uD800-\uDBFF][\uDC00-\uDFFF]"
"\xed\xa0\xb5\xed\xb4\x84" # equivalent to "\uD835\uDD04" (Afr)
"\xed\xa0\xb5\xed\xb4\x9e" # equivalent to "\uD835\uDD1E" (afr)

ping @HertzDevil @icyleaf

straight-shoota · 2021-03-20T14:58:25Z

This is how to fix string literals with surrogate halves after this change: Byte escape sequences need to replace unicode escape sequences. String#dump takes care of that, it automatically presents invalid unicode codepoints as byte escape sequences.

"[\uD800-\uDBFF][\uDC00-\uDFFF]".dump # => "[\xED\xA0\x80-\xED\xAF\xBF][\xED\xB0\x80-\xED\xBF\xBF]"

Disallow surrogate halves in string and char literals

2732a00

straight-shoota approved these changes Feb 25, 2021

View reviewed changes

straight-shoota added kind:bug A bug in the code. Does not apply to documentation, specs, etc. topic:compiler:parser labels Feb 25, 2021

HertzDevil mentioned this pull request Feb 27, 2021

Make Int#chr reject surrogate halves #10451

Merged

sdogruyol approved these changes Mar 7, 2021

View reviewed changes

straight-shoota added this to the 1.0.0 milestone Mar 17, 2021

bcardiff merged commit 395d0bf into crystal-lang:master Mar 19, 2021

bcardiff pushed a commit to bcardiff/crystal that referenced this pull request Mar 19, 2021

Revert "Disallow surrogate halves in string and char literals (crysta…

bc9b6db

…l-lang#10443)" This reverts commit 395d0bf.

bcardiff mentioned this pull request Mar 19, 2021

Revert "Disallow surrogate halves in string and char literals (#10443)" #10524

Closed

bcardiff mentioned this pull request Mar 19, 2021

Avoid surrogate halves icyleaf/markd#34

Merged

HertzDevil deleted the bug/string-literal-surrogate branch March 20, 2021 07:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Disallow surrogate halves in escape sequences of string and character literals #10443

Disallow surrogate halves in escape sequences of string and character literals #10443

HertzDevil commented Feb 25, 2021

sdogruyol left a comment

bcardiff commented Mar 19, 2021

straight-shoota commented Mar 19, 2021 •

edited

Loading

straight-shoota commented Mar 20, 2021 •

edited

Loading

Disallow surrogate halves in escape sequences of string and character literals #10443

Disallow surrogate halves in escape sequences of string and character literals #10443

Conversation

HertzDevil commented Feb 25, 2021

sdogruyol left a comment

Choose a reason for hiding this comment

bcardiff commented Mar 19, 2021

straight-shoota commented Mar 19, 2021 • edited Loading

straight-shoota commented Mar 20, 2021 • edited Loading

straight-shoota commented Mar 19, 2021 •

edited

Loading

straight-shoota commented Mar 20, 2021 •

edited

Loading