Unicode Escapes should cover more unicode planes #194

Pike · 2018-10-22T13:59:36Z

Now that we have more than the basic Unicode plane in the Fluent syntax, we should also support them in the Unicode escapes.

I suggest to use 4 or 6 digits, based on earlier conversations.

I wonder if we should exclude surrogate pairs at the same time, to prevent \uD83D\uDE02 in favor of \u01F602? The UTF-16 encoding these imply feel very implementation dependent to me.

The text was updated successfully, but these errors were encountered:

stasm · 2018-10-22T14:17:18Z

Are there characters which are better written as escapes rather than actual glyphs?

I suggest to use 4 or 6 digits, based on earlier conversations.

I'd like to throw the \u{…} proposal into the mix, too. The number of hexdigits between the braces can be between 1 and 6. Examples: \u{9}, \u{A0}, \u{1F602}.

stasm · 2018-10-23T12:56:00Z

I suggest to use 4 or 6 digits, based on earlier conversations.

Do you mean this as two alternatives of the proposal, or a single proposal which accepts both 4- and 6-digit-long sequences?

Pike · 2018-10-23T13:17:50Z

logical operators, pff. Support 4, 6. Not support 5.

stasm · 2018-10-23T13:37:30Z

Would "\u00a0ff" ~~parse~~be interpreted as <nbsp>ff or as ꃿ?

Pike · 2018-10-24T16:51:45Z

Would "\u00a0ff" be interpreted as <nbsp>ff or as ꃿ?

Yeah, that's a problem. Maybe just \U00a0ff ?

unicode_escape      ::= "\\u" /[0-9a-fA-F]{4}/
        | "\\U" /[0-9a-fA-F]{6}/

I'm not a fan of \u{}, for one because it gives {} a different meaning in that context. I'm also concerned about the amount of work we'd have to throw at it.

stasm · 2018-10-24T18:26:19Z

I like the \\U idea! That's how Python does it, right? Although in case of Python, it expects 8 hex digits after the \\U, for UTF-32 I suppose? I had to refresh my memory on how the different Unicode encoding s worked (this SO answer was very helpful). IIUC, U+10FFFF is the highest code point which the Unicode standard defines, due to compatibility reasons with UTF-16. If that's the case, expecting 6 digits after \\U would make sense to me.

I agree about the point about imbuing more meaning into {}, especially if we go ahead with #123.

zbraniecki · 2018-10-26T11:18:13Z

@Manishearth - do you have any thoughts on this from Rust? In particular, should we go for 6 digits, or 8?

Manishearth · 2018-10-27T13:27:40Z

Overall languages seem to be moving towards \u{...} because it's unambiguous and less confusing -- \u vs \U is something you have to remember, and the precise variant of this changes across languages.

I'd avoid UTF16 if possible (though I guess it's okay as long as you validate that there aren't any lone surrogates -- and users coming from JS may expect this).

I would go with 6 digits if you pick \U though.

stasm · 2018-10-29T09:24:40Z

Overall languages seem to be moving towards \u{...} because it's unambiguous and less confusing -- \u vs \U is something you have to remember, and the precise variant of this changes across languages.

I've noticed this too and I like this trend. The \u{...} syntax is explicit and easier to remember than \u vs \U.

In case of Fluent, however, the braces {...} already have another meaning in the syntax; they stand for interpolation. And because we're designing the Fluent syntax with non-technical localizers in mind, we're trying to be careful to not reuse tokens and sigils in different contexts with different meanings.

Fluent also allows astral Unicode characters in its source files, so I expect there will be little need to use escape sequences for codepoints requiring more than 4 hex digits. Their addition has been proposed for completeness sake and to make it possible to encode them without resorting to surrogate pairs.

I think we should go ahead with \UXXXXXX.

Manishearth · 2018-10-29T09:48:10Z

Rust allows for all code points in source files too, the reason escapes exist is to let people specify them explicitly, especially in cases where there are invisible code points.

You can also do something like \u[..], pick a brace syntax

stasm · 2018-10-29T11:59:19Z

To summarize: We could either have two syntaxes:

terms-u = Terms{"\u00A0"}and{"\u00A0"}Conditions
terms-U = Terms{"\U0000A0"}and{"\U0000A0"}Conditions

Or a single one using some kind of delimiters:

terms-brace = Terms{"\u{A0}"}and{"\u{A0}"}Conditions
terms-bracket = Terms{"\u[A0]"}and{"\u[A0]"}Conditions
terms-paren = Terms{"\u(A0)"}and{"\u(A0)"}Conditions
terms-angle = Terms{"\u<A0>"}and{"\u<A0>"}Conditions

Or perhaps just one always requiring 6 hex digits:

terms-one-u = Terms{"\u0000A0"}and{"\u0000A0"}Conditions

In the last case, we could even consider dropping the u prefix. The only other escape sequences which are currently supported are \\ and \". This would effectively reserve prefixes 0-9 and a-f.

terms-drop-u = Terms{"\0000A0"}and{"\0000A0"}Conditions

(The above could also be considered for the variants with delimiters.)

Taking a step back: the primary use-case of Unicode escape sequences is to be able to use invisible or whitespace characters in translations such that they are clearly visible to reviewers and other translators. For all visible characters or combinations of characters, localizers and developers should be encouraged to use the actual Unicode graphemes.

Given the above use-case, the syntax of escape sequences in Fluent doesn't have to be succinct, but it should be easily recognizable as something special. Localizers familiar with the concept of escape sequence will benefit from the syntax being similar to syntaxes they know from other languages. Other localizers will edit the translations around the escapes or copy them from other places.

Pike · 2018-10-29T13:06:10Z

We need the ability to have composed unicode escapes and regular text for call arguments, and possibly variant names in the future, right?

stasm · 2018-10-29T13:22:23Z

Yes, but with a note that Unicode escapes are primarily intended to represent whitespace and invisible characters. In other words, using made-up examples of call arguments: JOIN($list, separator: "\u00A0") but: DECORATE($text, with: "✨").

jfkthame · 2018-10-30T13:33:56Z

Yes, but with a note that Unicode escapes are primarily intended to represent whitespace and invisible characters.

Or for clarity when using characters whose glyphs may be visually ambiguous. If I see "–" in the source, I may be unsure exactly which dash it is; whereas "\u2013" is unquestionably an en-dash.

Of the options above, I would favor either the "\uXXXX" and "\UXXXXXX" pair, or "\u{...}" with up to 6 digits. These are widely familiar from other contexts, which helps a lot with recognition. (Don't force the use of 6 digits in all cases; that would make familiar codepoints like the 20xx block look quite unfamiliar.)

stasm · 2018-11-05T19:22:12Z

Thanks, everyone, for your input. It looks like everyone agrees that we should base the syntax of the Unicode escapes on existing solutions to maximize the chance that localizers are familiar with them.

The choice between the \uXXXX and \UXXXXXX pair, and the \u{…} syntax is a hard one for me. I see benefits to using both. Re. the \u{…} syntax, I was initially worried that re-using the braces here would be confusing because they already have another meaning in Fluent, but now I could argue that it's just another special use for them. They're still special, which is OK to me.

I wanted to see both approaches in action, and I prepared two PRs.

I opened #201 which adds the \UXXXXXX syntax to the existing \uXXXX one.

character-A = {"\u0041"}
face-with-tears-of-joy = {"\U01F602"}
terms = Terms{"\u00A0"}and{"\u00A0"}Conditions
copy = © 1998{"\u2013"}2018

I also opened #202 which changes the syntax to \u{…}.

character-A = {"\u{41}"}
face-with-tears-of-joy = {"\u{1F602}"}
terms = Terms{"\u{00A0}"}and{"\u{00A0}"}Conditions
copy = © 1998{"\u{2013}"}2018

In case of \u{…}, I think we should encourage serializers to left-pad codepoints below 4 digits with zeros. This looks like a common practice, used even in the charts published by Unicode.

# Both are valid but `padded` is preferred.
short = {"\u{41}"}
padded = {"\u{0041}"}

The benefits of the \u{…} are obvious when more characters are included in the StringLiteral. This might happen in function arguments, or in variant keys (#90), although there aren't currently many use-cases for it. Consequently, the examples below are contrived.

# A contrived example. This should use a numeric offset or an abbreviation.
now1 = It is {DATETIME($time, timezone: "Hawaii\u2013Aleutian Time Zone")} right now.
now2 = It is {DATETIME($time, timezone: "Hawaii\u{2013}Aleutian Time Zone")} right now.

# Another contrived example. A country code would be a better choice for the selector.
historic-countries1 = { $name ->
    ["Austria\u2013Hungary"] ...
}
historic-countries2 = { $name ->
    ["Austria\u{2013}Hungary"] ...
}

zbraniecki · 2018-11-05T20:44:27Z

I'm in favor of \u{XXXX} mainly because otherwise I'm afraid of \u2a2attention - trying to guess where the \u ends.

Pike · 2018-11-08T13:39:21Z

Looking at the tests of mishaps in #202, I extended them by an actual hex example:

num = \u{41}
msg = \u{a0}

yields num to be a NumberLiteral and msg to be a MessageReference. All of that parses fine, just creates runtime situations.

To me those fall out from the ambiguous use of {} if we use them as unicode escape delimiters.

For that, I prefer `{"\u1324"} and {"\U123456"}.

Manishearth · 2018-11-08T15:47:04Z

Wait, why does it parse as a MessageReference?

…

On Thu, Nov 8, 2018, 5:39 AM Axel Hecht ***@***.*** wrote: Looking at the tests of mishaps in #202 <#202>, I extended them by an actual hex example: num = \u{41}msg = \u{a0} yields num to be a NumberLiteral and msg to be a MessageReference. All of that parses fine, just creates runtime situations. To me those fall out from the ambiguous use of {} if we use them as unicode escape delimiters. For that, I prefer `{"\u1324"} and {"\U123456"}. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#194 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABivSE_fxwfj76umisGpXrYW3qaztrm3ks5utDQKgaJpZM4Xzb7I> .

stasm · 2018-11-08T15:51:33Z

Because Unicode escape sequences are not valid in text (they are only in quoted StringLiterals, #123) and because a0 is a valid identifier. msg = \u{a0} parses as a Pattern of two elements: TextElement {value: "\\u"} and Placeable {expression: {MessageReference {id: "a0"}}}.

zbraniecki · 2018-11-09T00:09:29Z

I think we should error on both.

stasm · 2018-11-09T07:51:02Z

Would you want to make the backslash illegal in TextElements? Or something else?

Pike · 2018-11-09T13:47:15Z

One more data point, we're already having strings with {"\u00a0"}, so keeping that logic and just adding \U will be easier to implement from a data compatibility point of view.

zbraniecki · 2018-11-09T18:27:00Z

Would you want to make the backslash illegal in TextElements? Or something else?

I would make \u illegal in TextElements I think.

stasm · 2018-11-13T14:20:10Z

I would make \u illegal in TextElements I think.

The big win of #123 is that the only special characters in TextElements are now the curly braces. I prefer to keep it that way and not introduce exceptions, like \u, which increase the learning curve and the discoverability of the syntax.

I'd like to go ahead with \uHHHH and \UHHHHHH. I see how the \u{...} syntax can help in some cases, but I predict that these cases will be very rare. In most cases where a Unicode escape is needed, it's to encode a single character for visibility purposes. Using a placeable is a great tool to achieve visibility: copy = © 1998{"\u2013"}2018 makes the escape sequence stand out. Adding two more characters to this syntax ({"\u{2013}"}) adds visual clutter for no significant benefit.

stasm · 2018-11-13T14:53:19Z

#201 is the PR adding the support for the \UHHHHHH escape sequence. I'll wait until Friday before merging it.

Pike added the syntax label Oct 22, 2018

This was referenced Nov 5, 2018

Recognize \UHHHHHH as an escape sequence #201

Merged

Only allow the \u{…} Unicode escapes #202

Closed

stasm mentioned this issue Nov 6, 2018

Store unescaped content in StringLiteral.value and raw content in StringLiteral.raw #203

Merged

stasm closed this as completed in #201 Nov 19, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unicode Escapes should cover more unicode planes #194

Unicode Escapes should cover more unicode planes #194

Pike commented Oct 22, 2018

stasm commented Oct 22, 2018

stasm commented Oct 23, 2018

Pike commented Oct 23, 2018

stasm commented Oct 23, 2018 •

edited

Loading

Pike commented Oct 24, 2018

stasm commented Oct 24, 2018

zbraniecki commented Oct 26, 2018

Manishearth commented Oct 27, 2018

stasm commented Oct 29, 2018

Manishearth commented Oct 29, 2018

stasm commented Oct 29, 2018

Pike commented Oct 29, 2018

stasm commented Oct 29, 2018 •

edited

Loading

jfkthame commented Oct 30, 2018

stasm commented Nov 5, 2018

zbraniecki commented Nov 5, 2018

Pike commented Nov 8, 2018

Manishearth commented Nov 8, 2018 via email

stasm commented Nov 8, 2018

zbraniecki commented Nov 9, 2018

stasm commented Nov 9, 2018

Pike commented Nov 9, 2018

zbraniecki commented Nov 9, 2018

stasm commented Nov 13, 2018

stasm commented Nov 13, 2018

Unicode Escapes should cover more unicode planes #194

Unicode Escapes should cover more unicode planes #194

Comments

Pike commented Oct 22, 2018

stasm commented Oct 22, 2018

stasm commented Oct 23, 2018

Pike commented Oct 23, 2018

stasm commented Oct 23, 2018 • edited Loading

Pike commented Oct 24, 2018

stasm commented Oct 24, 2018

zbraniecki commented Oct 26, 2018

Manishearth commented Oct 27, 2018

stasm commented Oct 29, 2018

Manishearth commented Oct 29, 2018

stasm commented Oct 29, 2018

Pike commented Oct 29, 2018

stasm commented Oct 29, 2018 • edited Loading

jfkthame commented Oct 30, 2018

stasm commented Nov 5, 2018

zbraniecki commented Nov 5, 2018

Pike commented Nov 8, 2018

Manishearth commented Nov 8, 2018 via email

stasm commented Nov 8, 2018

zbraniecki commented Nov 9, 2018

stasm commented Nov 9, 2018

Pike commented Nov 9, 2018

zbraniecki commented Nov 9, 2018

stasm commented Nov 13, 2018

stasm commented Nov 13, 2018

stasm commented Oct 23, 2018 •

edited

Loading

stasm commented Oct 29, 2018 •

edited

Loading