-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unicode Escapes should cover more unicode planes #194
Comments
Are there characters which are better written as escapes rather than actual glyphs?
I'd like to throw the |
Do you mean this as two alternatives of the proposal, or a single proposal which accepts both 4- and 6-digit-long sequences? |
logical operators, pff. Support |
Would |
Yeah, that's a problem. Maybe just
I'm not a fan of |
I like the I agree about the point about imbuing more meaning into |
@Manishearth - do you have any thoughts on this from Rust? In particular, should we go for 6 digits, or 8? |
Overall languages seem to be moving towards I'd avoid UTF16 if possible (though I guess it's okay as long as you validate that there aren't any lone surrogates -- and users coming from JS may expect this). I would go with 6 digits if you pick |
I've noticed this too and I like this trend. The In case of Fluent, however, the braces Fluent also allows astral Unicode characters in its source files, so I expect there will be little need to use escape sequences for codepoints requiring more than 4 hex digits. Their addition has been proposed for completeness sake and to make it possible to encode them without resorting to surrogate pairs. I think we should go ahead with |
Rust allows for all code points in source files too, the reason escapes exist is to let people specify them explicitly, especially in cases where there are invisible code points. You can also do something like |
To summarize: We could either have two syntaxes: terms-u = Terms{"\u00A0"}and{"\u00A0"}Conditions
terms-U = Terms{"\U0000A0"}and{"\U0000A0"}Conditions Or a single one using some kind of delimiters: terms-brace = Terms{"\u{A0}"}and{"\u{A0}"}Conditions
terms-bracket = Terms{"\u[A0]"}and{"\u[A0]"}Conditions
terms-paren = Terms{"\u(A0)"}and{"\u(A0)"}Conditions
terms-angle = Terms{"\u<A0>"}and{"\u<A0>"}Conditions Or perhaps just one always requiring 6 hex digits: terms-one-u = Terms{"\u0000A0"}and{"\u0000A0"}Conditions In the last case, we could even consider dropping the terms-drop-u = Terms{"\0000A0"}and{"\0000A0"}Conditions (The above could also be considered for the variants with delimiters.) Taking a step back: the primary use-case of Unicode escape sequences is to be able to use invisible or whitespace characters in translations such that they are clearly visible to reviewers and other translators. For all visible characters or combinations of characters, localizers and developers should be encouraged to use the actual Unicode graphemes. Given the above use-case, the syntax of escape sequences in Fluent doesn't have to be succinct, but it should be easily recognizable as something special. Localizers familiar with the concept of escape sequence will benefit from the syntax being similar to syntaxes they know from other languages. Other localizers will edit the translations around the escapes or copy them from other places. |
We need the ability to have composed unicode escapes and regular text for call arguments, and possibly variant names in the future, right? |
Yes, but with a note that Unicode escapes are primarily intended to represent whitespace and invisible characters. In other words, using made-up examples of call arguments: |
Or for clarity when using characters whose glyphs may be visually ambiguous. If I see "–" in the source, I may be unsure exactly which dash it is; whereas "\u2013" is unquestionably an en-dash. Of the options above, I would favor either the "\uXXXX" and "\UXXXXXX" pair, or "\u{...}" with up to 6 digits. These are widely familiar from other contexts, which helps a lot with recognition. (Don't force the use of 6 digits in all cases; that would make familiar codepoints like the 20xx block look quite unfamiliar.) |
Thanks, everyone, for your input. It looks like everyone agrees that we should base the syntax of the Unicode escapes on existing solutions to maximize the chance that localizers are familiar with them. The choice between the I wanted to see both approaches in action, and I prepared two PRs. I opened #201 which adds the character-A = {"\u0041"}
face-with-tears-of-joy = {"\U01F602"}
terms = Terms{"\u00A0"}and{"\u00A0"}Conditions
copy = © 1998{"\u2013"}2018 I also opened #202 which changes the syntax to character-A = {"\u{41}"}
face-with-tears-of-joy = {"\u{1F602}"}
terms = Terms{"\u{00A0}"}and{"\u{00A0}"}Conditions
copy = © 1998{"\u{2013}"}2018 In case of # Both are valid but `padded` is preferred.
short = {"\u{41}"}
padded = {"\u{0041}"} The benefits of the # A contrived example. This should use a numeric offset or an abbreviation.
now1 = It is {DATETIME($time, timezone: "Hawaii\u2013Aleutian Time Zone")} right now.
now2 = It is {DATETIME($time, timezone: "Hawaii\u{2013}Aleutian Time Zone")} right now. # Another contrived example. A country code would be a better choice for the selector.
historic-countries1 = { $name ->
["Austria\u2013Hungary"] ...
}
historic-countries2 = { $name ->
["Austria\u{2013}Hungary"] ...
} |
I'm in favor of |
Looking at the tests of mishaps in #202, I extended them by an actual hex example: num = \u{41}
msg = \u{a0} yields To me those fall out from the ambiguous use of For that, I prefer `{"\u1324"} and {"\U123456"}. |
Wait, why does it parse as a MessageReference?
…On Thu, Nov 8, 2018, 5:39 AM Axel Hecht ***@***.*** wrote:
Looking at the tests of mishaps in #202
<#202>, I extended them by an
actual hex example:
num = \u{41}msg = \u{a0}
yields num to be a NumberLiteral and msg to be a MessageReference. All of
that parses fine, just creates runtime situations.
To me those fall out from the ambiguous use of {} if we use them as
unicode escape delimiters.
For that, I prefer `{"\u1324"} and {"\U123456"}.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#194 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABivSE_fxwfj76umisGpXrYW3qaztrm3ks5utDQKgaJpZM4Xzb7I>
.
|
Because Unicode escape sequences are not valid in text (they are only in quoted |
I think we should error on both. |
Would you want to make the backslash illegal in |
One more data point, we're already having strings with |
I would make |
The big win of #123 is that the only special characters in I'd like to go ahead with |
#201 is the PR adding the support for the |
Now that we have more than the basic Unicode plane in the Fluent syntax, we should also support them in the Unicode escapes.
I suggest to use 4 or 6 digits, based on earlier conversations.
I wonder if we should exclude surrogate pairs at the same time, to prevent
\uD83D\uDE02
in favor of\u01F602
? The UTF-16 encoding these imply feel very implementation dependent to me.The text was updated successfully, but these errors were encountered: