The standard as is right now is unfriendly / unusual for tech stacks that are "native utf-16" #895

mihnita · 2024-09-19T20:33:28Z

The rule for content-char currently looks like this:

content-char = %x01-08        ; omit NULL (%x00), HTAB (%x09) and LF (%x0A)
             ...
             / %x3001-D7FF    ; omit surrogates
             / %xE000-10FFFF
             ...

That is unusual for languages that use UTF-16 natively, like JavaScript, Java, and even the "wide version" of the Windows C APIs (using wchar_t, that is 16 bits on Windows)

Such languages try to enforce utf-16 correctness (the same way C/C++ don't try to enforce utf-8 or any other kind of utf correctness).

Any validation is done "at the edge", when data is ingested, if at all.

Existing APIs that are similar to MessageFormat 2 work just fine with incorrect surrogate sequences.

@Test
public void testBadSurrogates() {
  dumpHex(String.format("\uda02 %d \udc02", 42));
  dumpHex(java.text.MessageFormat.format("\uda02 {0} \udc02", 42));
  dumpHex(com.ibm.icu.text.MessageFormat.format("\uda02 {0} \udc02", 42));
}

private void dumpHex(String str) {
  str.chars().forEach(c -> System.out.printf(" %04X", c));
  System.out.println();
}

The code above does not throw, and the result preserves the surrogates "as is".
The output looks like this:

 DA02 0020 0034 0032 0020 DC02
 DA02 0020 0034 0032 0020 DC02
 DA02 0020 0034 0032 0020 DC02

The current restriction also contradicts what was agreed in this thread:

RGN: Surrogate code points. Those are code points reserved for representing code points in UTF-16 that are beyond the first plane (BMP) of 2^16 code points.

MIH: I understand what you mean. But we also implement this is C and Java, and so on. So what should we do if we receive a message with invalid UTF-8 code points. Do we expect to replace them with the replacement character, or do we just pass them through?

RGN: I think what you're asking about, using JavaScript as a concrete example, is that a JS string is allowed to have unpaired surrogates. So the question is a question for the JS adapter / implementation, but that's not a question for the standard itself.

MIH: So we leave it to the implementation?

RGN: Yes.

MIH: Okay, that is fine with me.

https://github.com/unicode-org/message-format-wg/blob/main/meetings/2022/notes-2022-06-13.md

The text was updated successfully, but these errors were encountered:

catamorphism · 2024-09-19T20:35:23Z

Just for reference, the PR that introduced this requirement is #290 (from August 2022).

mihnita · 2024-09-19T20:50:18Z

I agree that unpaired surrogates are invalid in UTFs.
But some programming languages don't care about that, and they make no guarantees that their strings are UTF-16 correct.

MessageFormat 2 is an advanced form of "take this string with markers inside, and replace the markers with something I give you at runtime".

It should not be in the business of enforcing UTF correctness, or any other kind of correctness.
We don't we don't try to prevent the use of non-characters (U+?FFFE and U+?FFFF),
or U+0001–U+0008, U+000B–U+000C, U+000E–U+001F because they are invalid in XML 1.0,

If devs want to be strict, they can enforce it through linters, or in the storage format.

aphillips · 2024-09-19T21:48:03Z

(as an individual contributor)

I agree that encoding/UTF considerations don't belong in our specification, because we are not a storage format. Disallowing standalone surrogates is mostly a Good Thing, since non-Unicode encodings can't do anything (other than replace them) and UTF-8 can't encode them (except by exceptional pleading). We shouldn't make the mistake of, in the course of fixing UTF-16 (really "UCS-2-like"), support that we require UTF-8 based or USV String based implementations to do hokey things.

The spec is only marginally unfriendly to UCS-2 implementations, though. The ABNF and spec say that surrogate code points are not permitted in text or literal. I suspect the best approach here would be to allow "UCS-2" implementations to not enforce unpaired surrogate restrictions (or, more importantly, require them to be checked for in text). Along the lines of:

Implementations are not required to check for unpaired surrogate code points in text or literals.

[!NOTE]
Some implementations, in languages such as Java or JavaScript, use strings composed of
16-bit code units.
See for example Infra.
Such implementations do not check for unpaired surrogate code points,
even though these do not validly encode any character.
Such implementations are conformant, even though the grammar does not permit
these code points.

aphillips · 2024-09-19T21:49:05Z

(chair hat) I have tagged this for post-46.

mihnita · 2024-09-23T05:37:01Z

Disallowing standalone surrogates is mostly a Good Thing

100% agree.
This is something I would definitely enforce in a lint rule.
But not in this kind of spec.

The spec is only marginally unfriendly to UCS-2 implementations, though.
The ABNF and spec say that surrogate code points are not permitted in text or literal.

Surrogates are excluded from content-char and name-start
Meaning simple-start-char and text-char and quoted-char
So they are excluded from simple-message & pattern.

Which means I can't even do Hello \uD800 world!

And I can't have it in name and identifier.
So I am not really sure where can I have it.

mihnita · 2024-09-23T05:37:23Z

(chair hat) I have tagged this for post-46.

I have no problem with that.
Thank you!

aphillips · 2024-09-23T14:49:43Z

Surrogates are excluded from content-char and name-start Meaning simple-start-char and text-char and quoted-char So they are excluded from simple-message & pattern.

Which means I can't even do Hello \uD800 world!

And I can't have it in name and identifier. So I am not really sure where can I have it.

Be careful: this sword is sharp on both edges.

I don't know what practical use a string like Hello \uD800 world! has. Any message with unpaired surrogates faces ruin if it meets a UTF-8 encoder (such as (de)serializing it to/from a resource file) or in any number of tools. It doesn't mean anything different from and displays just like Hello \uFFFD world!.

Allowing unpaired surrogates means requiring support for them in the productions in languages that use byte-oriented (e.g. UTF-8) strings. If we allow unpaired in name then one has the problem of referring to values such as $\uD800 or invoking functions like :\uD800.

I'm somewhat sympathetic to allowing unpaired surrogates in text or, rather, to not checking if any appear in text. Permitting (which means requiring support for) their use elsewhere seems like something I'd rather impose on UTF-16 implementations.

mihnita · 2024-09-23T17:05:09Z

I am not necessarily arguing to allow them in name.
Or that they have a good use case.

It is about "going against the grain" for some platforms.
Even byte-oriented languages (often C/C++) treat strings as "a bunch a bytes, which just happen (or not) to be utf-8".

But we can shave this yak after LDML-46 :-)

aphillips · 2024-10-07T18:25:21Z

In the 2024-10-07 call, we agreed that @mihnita would make a PR adding unpaired to text-char with appropriate wording.

mihnita · 2024-10-11T20:02:17Z

In the 2024-10-07 call, we agreed that @mihnita would make a PR adding unpaired to text-char with appropriate wording.

We agreed that I will create the PR. Not on the exact implementation.

text-char or something else is an implementation detail.

I think that the last quote from EAO leaves that door open:

Let’s see what MIH comes up with and go from there

--

Scanning the notes it looks like the intent is really to allow surrogates in text, and not in code:

AAP: ... change at least the content-char in text to allow for unpaired surrogate values in there.

AAP: ... But disallowing them in names and other things is responsible.

EAO: allow for unpaired surrogates in content-char but only there

RCH: Mostly I wanted it nailed down. ... Nailing down names is acceptable to me, I don’t know why someone would want the names to be non-conforming,

mihnita mentioned this issue Sep 19, 2024

ICU-22890 MF2: Add lone surrogate test to parser unicode-org/icu#3167

Merged

7 tasks

aphillips added the Preview-Feedback Feedback gathered during the technical preview label Sep 19, 2024

aphillips added syntax Issues related with MF Syntax LDML46.1 MF2.0 Draft Candidate labels Sep 19, 2024

aphillips added the Agenda+ Requested for upcoming teleconference label Sep 29, 2024

aphillips added Action-Item Action item assigned by the WG and removed Agenda+ Requested for upcoming teleconference labels Oct 7, 2024

aphillips assigned mihnita Oct 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The standard as is right now is unfriendly / unusual for tech stacks that are "native utf-16" #895

The standard as is right now is unfriendly / unusual for tech stacks that are "native utf-16" #895

mihnita commented Sep 19, 2024

catamorphism commented Sep 19, 2024

mihnita commented Sep 19, 2024

aphillips commented Sep 19, 2024

aphillips commented Sep 19, 2024

mihnita commented Sep 23, 2024

mihnita commented Sep 23, 2024

aphillips commented Sep 23, 2024

mihnita commented Sep 23, 2024

aphillips commented Oct 7, 2024

mihnita commented Oct 11, 2024

The standard as is right now is unfriendly / unusual for tech stacks that are "native utf-16" #895

The standard as is right now is unfriendly / unusual for tech stacks that are "native utf-16" #895

Comments

mihnita commented Sep 19, 2024

catamorphism commented Sep 19, 2024

mihnita commented Sep 19, 2024

aphillips commented Sep 19, 2024

aphillips commented Sep 19, 2024

mihnita commented Sep 23, 2024

mihnita commented Sep 23, 2024

aphillips commented Sep 23, 2024

mihnita commented Sep 23, 2024

aphillips commented Oct 7, 2024

mihnita commented Oct 11, 2024