-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
passStyleOf
must validate a string isWellFormed
in order to be Passable
#1739
Comments
In XS, which uses CESU-8 for builtin strings for us at our request, how much more expensive would it be to convert two strings to UTF-8 (or rather, UTF-8 together with only unpaired surrogates) and then compare those? Attn @phoddie @patrick-soquet My sense is that we should NOT ask Moddable to switch us back to their UTF-8 support as the representation of builtin strings, as that support will never be consistent with JS. What happens when appending a string that ends with an unpaired surrogate with a string that starts with an unpaired surrogate? With CESU-8, we maintain reasonable compat with JS. With UTF-8 for builtin strings, we cannot. Besides, if we relied on native UTF-8 string representation from the engine, our code would not be portable to other JS engines. So really, our only choice is conversion on comparison in userland, possibly with some userland finite memo (since we cannot use weakness. Or perhaps a transient memo?). This is the best we should try to do if we want to repair the sort-order issue towards Unicode. If it is cheap enough, we should do that. If it is not cheap enough, we need to figure out what we want and what to negotiate with ocapn. |
Hi @gibson042 I assigned this to the two of us, as we seem to have the most interest -- or patience! -- with this topic ;) |
Can't we leverage a host implemented compare function? With a fallback to a JS implementation if an optimized host compare is not available. I think at the end of the day we should make sure none of our logic relies on the internal encoding of strings. I guess there is the question of what should the behavior be if the encoding allows invalid Unicode codepoints. |
If I'm understanding this correctly... Cases where the result is different from the standard, here as a result of the unusual string encoding choice, are conformance bugs. They should be fixed in the engine. As for strings "on the wire", the situation is the same as with any engine:
FWIW – the CESU-8 encoding, if I recall correctly, has better conformance as measured by test262 than the UTF-8 encoding. I don't think we should move backwards. |
The crux of this particular issue is that the concern of key order and comparison needs to be consistent and priced right for both the memory and wire models, although those representations necessarily differ, and be a good compromise between multiple languages. I do not think we need to transcode in memory to compare. |
I fully agree. That's the point I was making with the string append example. With UTF-8, there's no way the behavior of that example would be compat with JS. I am very grateful that our internal encoding is CESU-8. |
That sounds good to me. Or better, if TextEncoder/TextDecoder is fast enough to use to prepare the data for comparison, then the remaining need would be for a fast comparison by byte data, which we will eventually need anyway. (No rush on that.)
What does TextEncoder/TextDecoder do with invalid Unicode codepoints (i.e., lone surrogates)? What are they supposed to do? |
I can only imagine TextEncoder and TextDecoder as performing well enough as a stopgap for having an as-if-UTF-8 native comparison function. Avoiding allocation is usually the first big win when performance comes to roost. |
Right a comparison function should not allocate. Also to be pedantic, we need a Unicode codepoint comparison. I believe an advantage of UTF-8 encoding is that comparing bytes result in the same as comparing codepoints, but we don't need to encode to UTF-8 to perform that comparison. Also given that there is no comparison for bytearrays, I don't think TextEncoder really helps here. Iterating code units is probably sufficient, as long as we can handle both CESU-8 and UCS-2. |
I just gotta gripe for a moment: UTF-8 does preserve Unicode ordering If someone knows the history of the decision, I'd be curious. Was this considered? Had 16 bit Unicode already made commitments that prevented UTF-16 from preserving 21 bit Unicode order? |
As I understand it, the short answer is "yes". Unicode 1.1 had already claimed the top of the range, defining U+FFFD REPLACEMENT CHARACTER and annotating U+FFFE and U+FFFF as "Not character codes" (the former of which is the transposition of the U+FEFF byte order mark used for endianness identification in UTF-16 and possibly in its UCS-2 predecessor), so any 16-bit scheme that preserved lexicographic ordering by code point would need to start with a U+FFFF prefix followed by lower code units and thereby run afoul of the "overlap" problems mentioned in the encoding forms FAQ (barring a backwards-incompatible move that wasn't going to be palatable after what RFC 2781 refers to as the "Korean mess"). |
Given the flow of this conversation, I feel like I'm overlooking a key assumption. I don't see how a reasonable behavior can be achieved solely by script-level patches. That would work to provide a different default comparison function for sort. But it won't change the comparison operators ( |
@phoddie The contributors to this repository don't need ECMAScript built-in methods or operators to deviate from the specification (and in fact they must not), but we do need to come to some explicit decisions (for producing, always sort by code unit vs. always sort by code point, and for consuming, insist sorting by code unit vs. insist sorting by code point vs. normalize at ingress) and consistent implementation thereof to support interchange. I think you were mentioned to weigh in on the best available strategy for comparing ECMAScript strings lexicographically by code point in XS and how it performs relative to native operators comparing by code unit. |
Can someone confirm my understanding of the following:
|
@gibson042 Thank you for the notes. I understand now that the concern here is with standard JavaScript language behavior, not a conformance issue with the XS implementation. What you want is a comparison function that sorts as if the strings were UTF-8 encoded. That will be used to ease interoperability with systems that are not JavaScript-based. For efficiency, this UTF-8 sort should be implemented in native code when running on XS. That will be fast and not increase the burden on the GC. Even straightforward, unoptimized C will perform better. |
They are conformant. That was among the benefits of moving to CESU-8. |
Now that I think I understand the goal (thank you for your patience), a solution would look something like this?
My quick search for an implementation of a UTF-8 comparison function in JavaScript came up empty. We can surely write a C version from scratch, but a working algorithm as a starting point would be preferred. |
I very much doubt the existence of a code-point-order comparator for JavaScript strings. If we can’t find one (and please forgive me if this is obvious), we can probably convert a UTF-8 encoder into a comparator that co-iterates over a pair of string internals. |
Your intuition matches my search results. So.
Yes, something like that would be straightforward. There's no problem to do that, just looking to avoid reinvention if possible. That's probably optimistic at this point. |
const compareByCodePoints = (left, right) => {
for (let i = 0; ; ) {
const leftCodepoint = left.codePointAt(i);
const rightCodepoint = right.codePointAt(i);
if (leftCodepoint === undefined && rightCodepoint === undefined) {
return 0;
} else if (leftCodepoint === undefined) {
// left is a prefix of right.
return -1;
} else if (rightCodepoint === undefined) {
// right is a prefix of left
return 1;
} else if (leftCodepoint < rightCodepoint) {
return -1;
} else if (leftCodepoint > rightCodepoint) {
return 1;
}
i += leftCodepoint <= 0xffff ? 1 : 2;
}
};
[ '⚠', 'ſt', 'j', '𝌆', '💩', '🔥' ].sort(compareByCodeUnits);
// => [ '⚠', 'ſt', 'j', '𝌆', '💩', '🔥' ]
[ '⚠', '𝌆', '💩', '🔥', 'ſt', 'j' ].sort(compareByCodeUnits);
// => [ '⚠', 'ſt', 'j', '𝌆', '💩', '🔥' ]
[ '⚠', '𝌆', '💩', '🔥', 'ſt', 'j' ].sort((a, b) => compareByCodeUnits(b, a));
// => [ '🔥', '💩', '𝌆', 'j', 'ſt', '⚠' ]
[ '⚠', '𝌆', '💩', '🔥', 'ſt', 'j' ]
.flatMap((a, i, arr) => [...arr.map(b => b + a), a])
.sort(compareByCodeUnits);
/* =>
[
'⚠', '⚠⚠', '⚠ſt', '⚠j', '⚠𝌆',
'⚠💩', '⚠🔥', 'ſt', 'ſt⚠', 'ſtſt',
'ſtj', 'ſt𝌆', 'ſt💩', 'ſt🔥', 'j',
'j⚠', 'jſt', 'jj', 'j𝌆', 'j💩',
'j🔥', '𝌆', '𝌆⚠', '𝌆ſt', '𝌆j',
'𝌆𝌆', '𝌆💩', '𝌆🔥', '💩', '💩⚠',
'💩ſt', '💩j', '💩𝌆', '💩💩', '💩🔥',
'🔥', '🔥⚠', '🔥ſt', '🔥j', '🔥𝌆',
'🔥💩', '🔥🔥'
]
*/ |
Perfect. Thank you! |
@gibson042 is it safe to assume that the left and right indices will always be the same, since the algorithm short-circuits at a divergence? |
Yes, good observation. Updated, and also added a demonstration for strings of more than one code point. |
@gibson042 What’s the status of this issue? Did we make a choice? |
Skimming back over the above thread, the meeting of minds seems to be that we want In any case, the plan would be to use a native implementation of |
It is less clear to me whether we have a meeting of the minds on unpaired surrogates. |
Obviously, to be coordinated with OCapN effort as well. Note: OCapN meeting this Tuesday. Topic for agenda? |
It looks like a native implementation of
The code from @gibson042 is clear on that. But, maybe you are referring to how that is defined in the spec of the wire format? |
Looked. It relies on codePointAt, so I looked at https://tc39.es/ecma262/multipage/text-processing.html#sec-string.prototype.codepointat which says
Given that, yes it is clear and is in any case my preferred semantics for unpaired surrogates.
I think so. Any objections? |
Jan 2024 OCapN meeting notes record that we agreed that strings can only be well-formed Unicode, i.e., cannot contain unpaired surrogates. For JavaScript, if a string does not pass the In light of that, I'm renaming this issue to keep track of that need. |
passStyleOf
must validate a string isWellFormed
in order to be Passable
Reopening, because endojs/endo#2002 itself hides the feature behind a feature flag that we currently have disabled by default. Once we change this to default to enabled, then we can close this issue. |
Describe the bug
JavaScript represents strings as sequences of 16-bit integers, each ostensibly representing a UTF-16 code unit, and the native relational operators and Array.prototype.sort compare strings as such rather than by code point:
This leaks into places like the wire formats associated with CapData and passable encodings, and also into internal representations derived from them and relevant to e.g. pattern matching (all of which are relevant for interoperability, cf. #1587).
Steps to reproduce
Expected behavior
I guess we need to decide on whether the necessary sorting is based on 16-bit code units or on code points, or alternatively that it is a responsibility of decoding (but that requires understanding the tag, because this lack of user-specifiable ordering is a special feature of CopySets, CopyBags, and CopyMaps, and in the latter case requires coordinating parallel key/value arrays).
Or alternatively, we could put something into the encoding layer below tagged typing to explicitly expose sets and/or maps in the data model such that equivalence ignores wire-format key ordering—perhaps in CapData, any array with a privileged first element is a set (e.g.,
{ "body": "#[\"(\", \"foo\", \"bar\", \"baz\"]", "slots": [] }
might encode the logical set of "foo", "bar", and "baz" using smallcaps). But note that the wire format of a map or bag would need to be something like an array of [key, value] entries rather than independent (and thus independently reorderable) collections of keys and values.And either way, that decision should propagate into the OCapN discussion.
The text was updated successfully, but these errors were encountered: