Skip to content

Commit

Permalink
fix(marshal)!: compare strings by codepoint
Browse files Browse the repository at this point in the history
  • Loading branch information
erights committed Jan 29, 2024
1 parent 193e403 commit 7a3a43a
Show file tree
Hide file tree
Showing 7 changed files with 146 additions and 4 deletions.
8 changes: 8 additions & 0 deletions packages/marshal/NEWS.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,13 @@
User-visible changes in `@endo/marshal`:

# next release

- JavaScript's relational comparison operators like `<` compare strings by lexicographic UTF16 code unit order, which is exposes an internal representational detail not relevant to the string's meaning as a Unicode string. Previously, `compareRank` and associated functions compared strings using this JavaScript-native comparison. Now `compareRank` and associated functions compare strings by lexicographic Unicode Code Point order. ***This change only affects strings containing so-called supplementary characters, i.e., those whose Unicode character code does not fit in 16 bits***.
- This release does not change the `encodePassable` encoding. But now, when we say it is order preserving, we need to be careful about which order we mean. `encodePassable` is rank-order preserving when the encoded strings are compared using `compareRank`.
- The key order of strings defined by the @endo/patterns module is still defined to be the same as the rank ordering of those strings. So this release changes key order among strings to also be lexicographic comparison of Unicode Code Points. To accommodate this change, you may need to adapt applications that relied on key-order being the same as JS native order. This could include the use of any patterns expressing key inequality tests, like `M.gte(string)`.
- These string ordering changes brings Endo into conformance with any string ordering components of the OCapN standard.
- To accommodate these change, you may need to adapt applications that relied on rank-order or key-order being the same as JS native order. You may need to resort any data that had previously been rank sorted using the prior `compareRank` function. You may need to revisit any use of patterns like `M.gte(string)` expressing inequalities over strings.

# v0.8.1 (2022-12-23)

- Remote objects now reflect methods present on their prototype chain.
Expand Down
1 change: 1 addition & 0 deletions packages/marshal/index.js
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@ export {

export {
trivialComparator,
compareByCodePoints,
assertRankSorted,
compareRank,
isRankSorted,
Expand Down
43 changes: 41 additions & 2 deletions packages/marshal/src/rankOrder.js
Original file line number Diff line number Diff line change
Expand Up @@ -46,9 +46,46 @@ const { entries, fromEntries, setPrototypeOf, is } = Object;
*/
const sameValueZero = (x, y) => x === y || is(x, y);

/**
* @param {any} left
* @param {any} right
* @returns {RankComparison}
*/
export const trivialComparator = (left, right) =>
// eslint-disable-next-line no-nested-ternary, @endo/restrict-comparison-operands
left < right ? -1 : left === right ? 0 : 1;
harden(trivialComparator);

// Apparently eslint confused about whether the function can ever exit
// without an explicit return.
// eslint-disable-next-line jsdoc/require-returns-check
/**
* @param {string} left
* @param {string} right
* @returns {RankComparison}
*/
export const compareByCodePoints = (left, right) => {
const leftIter = left[Symbol.iterator]();
const rightIter = right[Symbol.iterator]();
for (;;) {
const { value: leftChar } = leftIter.next();
const { value: rightChar } = rightIter.next();
if (leftChar === undefined && rightChar === undefined) {
return 0;
} else if (leftChar === undefined) {
// left is a prefix of right.
return -1;
} else if (rightChar === undefined) {
// right is a prefix of left.
return 1;
}
const leftCodepoint = /** @type {number} */ (leftChar.codePointAt(0));
const rightCodepoint = /** @type {number} */ (rightChar.codePointAt(0));
if (leftCodepoint < rightCodepoint) return -1;
if (leftCodepoint > rightCodepoint) return 1;
}
};
harden(compareByCodePoints);

/**
* @typedef {Record<PassStyle, { index: number, cover: RankCover }>} PassStyleRanksRecord
Expand Down Expand Up @@ -140,8 +177,7 @@ export const makeComparatorKit = (compareRemotables = (_x, _y) => 0) => {
return 0;
}
case 'boolean':
case 'bigint':
case 'string': {
case 'bigint': {
// Within each of these passStyles, the rank ordering agrees with
// JavaScript's relational operators `<` and `>`.
if (left < right) {
Expand All @@ -151,6 +187,9 @@ export const makeComparatorKit = (compareRemotables = (_x, _y) => 0) => {
return 1;
}
}
case 'string': {
return compareByCodePoints(left, right);
}
case 'symbol': {
return comparator(
nameForPassableSymbol(left),
Expand Down
4 changes: 2 additions & 2 deletions packages/marshal/test/test-encodePassable.js
Original file line number Diff line number Diff line change
Expand Up @@ -67,7 +67,7 @@ const encodePassableInternal = makeEncodePassable({
encodeError: er => encodeThing('!', er),
});

const encodePassable = passable => {
export const encodePassable = passable => {
resetBuffers();
return encodePassableInternal(passable);
};
Expand All @@ -78,7 +78,7 @@ const decodePassableInternal = makeDecodePassable({
decodeError: e => decodeThing('!', e),
});

const decodePassable = encoded => {
export const decodePassable = encoded => {
resetCursors();
return decodePassableInternal(encoded);
};
Expand Down
51 changes: 51 additions & 0 deletions packages/marshal/test/test-string-rank-order.js
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
import { test } from './prepare-test-env-ava.js';

import { compareRank } from '../src/rankOrder.js';
import { encodePassable } from './test-encodePassable.js';

test('unicode code point order', t => {
// Test case from
// https://icu-project.org/docs/papers/utf16_code_point_order.html
const str0 = '\u{ff61}';
const str3 = '\u{d800}\u{dc02}';

// str1 and str2 become impossible examples once we prohibit
// non - well - formed strings.
// See https://github.com/endojs/endo/pull/2002
const str1 = '\u{d800}X';
const str2 = '\u{d800}\u{ff61}';

// harden to ensure it is not sorted in place, just for sanity
const strs = harden([str0, str1, str2, str3]);

/**
* @param {string} left
* @param {string} right
* @returns {import('../src/types.js').RankComparison}
*/
const nativeComp = (left, right) =>
// eslint-disable-next-line no-nested-ternary
left < right ? -1 : left > right ? 1 : 0;

const nativeSorted = strs.toSorted(nativeComp);

t.deepEqual(nativeSorted, [str1, str3, str2, str0]);

const rankSorted = strs.toSorted(compareRank);

t.deepEqual(rankSorted, [str1, str2, str0, str3]);

const nativeEncComp = (left, right) =>
nativeComp(encodePassable(left), encodePassable(right));

const nativeEncSorted = strs.toSorted(nativeEncComp);

t.deepEqual(nativeEncSorted, nativeSorted);

const rankEncComp = (left, right) =>
compareRank(encodePassable(left), encodePassable(right));

const rankEncSorted = strs.toSorted(rankEncComp);

t.deepEqual(rankEncSorted, rankSorted);
});
5 changes: 5 additions & 0 deletions packages/patterns/NEWS.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,10 @@
User-visible changes in `@endo/patterns`:

# next release

- JavaScript's relational comparison operators like `<` compare strings by lexicographic UTF16 code unit order, which is exposes an internal representational detail not relevant to the string's meaning as a Unicode string. Previously, `compareKeys` and associated functions compared strings using this JavaScript-native comparison. Now `compareKeys` and associated functions compare strings by lexicographic Unicode Code Point order. ***This change only affects strings containing so-called supplementary characters, i.e., those whose Unicode character code does not fit in 16 bits***.
- See the NEWS.md of @endo/marshal for more on this change.

# v0.2.6 (2023-09-11)

- Adds support for CopyMap patterns (e.g., `matches(specimen, makeCopyMap([]))`).
38 changes: 38 additions & 0 deletions packages/patterns/test/test-string-key-order.js
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
// modeled on test-string-rank-order.js

import { test } from './prepare-test-env-ava.js';

import { compareKeys } from '../src/keys/compareKeys.js';

test('unicode code point order', t => {
// Test case from
// https://icu-project.org/docs/papers/utf16_code_point_order.html
const str0 = '\u{ff61}';
const str3 = '\u{d800}\u{dc02}';

// str1 and str2 become impossible examples once we prohibit
// non - well - formed strings.
// See https://github.com/endojs/endo/pull/2002
const str1 = '\u{d800}X';
const str2 = '\u{d800}\u{ff61}';

// harden to ensure it is not sorted in place, just for sanity
const strs = harden([str0, str1, str2, str3]);

/**
* @param {string} left
* @param {string} right
* @returns {import('../src/types.js').KeyComparison}
*/
const nativeComp = (left, right) =>
// eslint-disable-next-line no-nested-ternary
left < right ? -1 : left > right ? 1 : 0;

const nativeSorted = strs.toSorted(nativeComp);

t.deepEqual(nativeSorted, [str1, str3, str2, str0]);

const keySorted = strs.toSorted(compareKeys);

t.deepEqual(keySorted, [str1, str2, str0, str3]);
});

0 comments on commit 7a3a43a

Please sign in to comment.