-
-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The standard as is right now is unfriendly / unusual for tech stacks that are "native utf-16" #895
Comments
Just for reference, the PR that introduced this requirement is #290 (from August 2022). |
I agree that unpaired surrogates are invalid in UTFs. MessageFormat 2 is an advanced form of "take this string with markers inside, and replace the markers with something I give you at runtime". It should not be in the business of enforcing UTF correctness, or any other kind of correctness. If devs want to be strict, they can enforce it through linters, or in the storage format. |
(as an individual contributor) I agree that encoding/UTF considerations don't belong in our specification, because we are not a storage format. Disallowing standalone surrogates is mostly a Good Thing, since non-Unicode encodings can't do anything (other than replace them) and UTF-8 can't encode them (except by exceptional pleading). We shouldn't make the mistake of, in the course of fixing UTF-16 (really "UCS-2-like"), support that we require UTF-8 based or USV String based implementations to do hokey things. The spec is only marginally unfriendly to UCS-2 implementations, though. The ABNF and spec say that surrogate code points are not permitted in
|
(chair hat) I have tagged this for post-46. |
100% agree.
Surrogates are excluded from Which means I can't even do And I can't have it in |
I have no problem with that. |
Be careful: this sword is sharp on both edges. I don't know what practical use a string like Allowing unpaired surrogates means requiring support for them in the productions in languages that use byte-oriented (e.g. UTF-8) strings. If we allow unpaired in I'm somewhat sympathetic to allowing unpaired surrogates in |
I am not necessarily arguing to allow them in It is about "going against the grain" for some platforms. But we can shave this yak after LDML-46 :-) |
In the 2024-10-07 call, we agreed that @mihnita would make a PR adding unpaired to |
We agreed that I will create the PR. Not on the exact implementation.
I think that the last quote from EAO leaves that door open:
-- Scanning the notes it looks like the intent is really to allow surrogates in text, and not in code:
|
The rule for
content-char
currently looks like this:That is unusual for languages that use UTF-16 natively, like JavaScript, Java, and even the "wide version" of the Windows C APIs (using
wchar_t
, that is 16 bits on Windows)Such languages try to enforce utf-16 correctness (the same way C/C++ don't try to enforce utf-8 or any other kind of utf correctness).
Any validation is done "at the edge", when data is ingested, if at all.
Existing APIs that are similar to MessageFormat 2 work just fine with incorrect surrogate sequences.
The code above does not throw, and the result preserves the surrogates "as is".
The output looks like this:
The current restriction also contradicts what was agreed in this thread:
The text was updated successfully, but these errors were encountered: