ICU-22707 Unicode 16 beta jun04 #3028

markusicu · 2024-06-05T15:35:04Z

new short aliases ID_Status, ID_Type
Unicode 16 beta data as of 2024-jun-04, including
- CLDR-17226 UCA 16 beta jun05 cldr#3783

Checklist

Required: Issue filed: https://unicode-org.atlassian.net/browse/ICU-22707
Required: The PR title must be prefixed with a JIRA Issue number.
Required: The PR description must include the link to the Jira Issue, for example by completing the URL in the first checklist item
Required: Each commit message must be prefixed with a JIRA Issue number.
Issue accepted (done by Technical Committee after discussion)
Tests included, if applicable
API docs and/or User Guide docs changed or added, if applicable

ALLOW_MANY_COMMITS=true

markusicu · 2024-06-05T22:07:36Z

@eggrobin I have the latest Unicode 16 data here. Locally, test pass except for intltest rbbi and intltest idna. I will probably disable the failing idna (UTS46) tests for a while. Can you please update the segmentation code & data as needed?

@echeran FYI

markusicu · 2024-06-05T22:21:05Z

Locally, test pass except for intltest rbbi and intltest idna. I will probably disable the failing idna (UTS46) tests for a while.

Done. Locally, only intltest rbbi fails now.

eggrobin · 2024-06-05T22:38:49Z

Can you please update the segmentation code & data as needed?

In this branch, or in a separate PR? (As discussed, I will want to do that with several commits, both to separate the proposals and because I want to keep a record of the steps of the LB25 derivation.)

markusicu · 2024-06-05T23:09:21Z

Can you please update the segmentation code & data as needed?

In this branch, or in a separate PR?

This pull request here is set up to allow multiple commits, and when it's done I will rebase-and-merge them, not squash them.

I assume that it would be easiest for you to add commits here directly for segmentation.
Otherwise we would need to disable the failing rbbi tests as well before merging this PR into main.
It feels like the risk from disabling tests is higher for rbbi than it is for idna.

eggrobin · 2024-06-05T23:12:35Z

Sounds reasonable, I will add commits into this one then.

eggrobin · 2024-06-21T13:58:53Z

Oh, this is fun:
createRuleBasedBreakIterator: ICU Error "U_BRK_RULE_EMPTY_SET" at line 292, column 5

This is the set [$IS & [\p{ea=F}\p{ea=W}\p{ea=H}]] which got emptied by UTC-179-C30:

[179-C30] Consensus: Change the Line_Break assignment of U+FE10 ︐ PRESENTATION FORM FOR VERTICAL COMMA to Close_Punctuation (CL), and that of U+FE13 ︓ PRESENTATION FORM FOR VERTICAL COLON and U+FE14 ︔ PRESENTATION FORM FOR VERTICAL SEMICOLON to Nonstarter (NS), to match their FULLWIDTH counterparts U+FF0C, U+FF1A, and U+FF1B. For Unicode Version 16.0. See document L2/24-064 item 5.7.

The set previously contained exactly these three characters: https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=%5B%5Cp%7BU15.1%3Alb%3DIS%7D+%26+%5B%5Cp%7BU15.1%3Aea%3DF%7D%5Cp%7BU15.1%3Aea%3DW%7D%5Cp%7BU15.1%3Aea%3DH%7D%5D%5D&g=&i=.

That item in the PAG report reads:

I only spotted that because of extremely obscure interactions between line breaking
rules in the optimized ICU implementation.

So I will now have to remove those extremely obscure lines from the rules, a welcome change from my usual routine of adding extremely obscure lines.

icu4c/source/data/brkitr/rules/line.txt

markusicu · 2024-06-21T17:26:47Z

Hi @eggrobin, thanks for making progress here!
It sounds like this is still WIP, and I see that a number of the CI checks are unhappy.
Are you going to consolidate the commits into fewer/chunkier ones?

eggrobin · 2024-06-21T17:33:04Z

It sounds like this is still WIP, and I see that a number of the CI checks are unhappy.

Yes; I have brought in all the work that was already done, but as expected I need to appease the new monkeys. (And some clang warnings, etc.)

Are you going to consolidate the commits into fewer/chunkier ones?

Mostly, no: things have already been consolidated (compare eggrobin/icu@unicode-org:icu:main...uax14-integration). What remains is split by UTC decision, and, e.g., the work on UTC-179-C35 is in turn split into the steps documented in the background section of item 5.15 of the report, plus the post UTC correction; I want to retain these steps in the history of line.txt and friends.

I expect that I will coalesce whatever additional work remains to be done into one or two commits though.

markusicu · 2024-06-28T23:25:52Z

Hi @eggrobin FYI @echeran now has two pending PRs that add support for new properties, which want to go in after this PR here...

eggrobin · 2024-06-28T23:28:41Z

Yes, I somehow got distracted from ICU4[CJ] matters last week and dropped this ball. I intend to get back to this on Monday, please poke me with a sharp stick if I don’t.

eggrobin · 2024-07-01T14:10:38Z

Exciting Development: While testing the new monkeys, I came across a string which exposes a bug in my rules for LB19a.
Somehow the old monkeys never came up with such a string over days of testing.

This seems completely tractable in ICU, and should not require a change on the UAX14 side, so this is not an all-hands-on-deck emergency. But it is still uncomfortably exciting.

The string in question is ︷ \U00016FF1\u302B⸠ᅛᆅ, where \U00016FF1\u302B are East_Asian_Width=Wide combining marks.
That \U00016FF1, lb=CM and ea=W, being after a space, gets treated as lb=AL, but remains ea=W, so LB19a should not apply.

In ICU, LB19a was implemented in a slightly strange way: LB19 was unchanged, and the complement of LB19a is given break rules (this is to avoid having to add a profusion of rules for overlapping context spanning more than two code points).
For a lb=CM following a break, the lb=CM-as-AL and ea=W case is handled by the rule

^[$CM & [\p{ea=F}\p{ea=W}\p{ea=H}]]                    / [\p{Pi} & $QU] $CM* [ [\p{ea=F}\p{ea=W}\p{ea=H}] - $CM];
^[$CM & [\p{ea=F}\p{ea=W}\p{ea=H}]] $CM* $CMX          / [\p{Pi} & $QU] $CM* [ [\p{ea=F}\p{ea=W}\p{ea=H}] - $CM];

But in this case, the lb=CM-as-AL does not follow a break, because LB14 applied.

The solution should be to copy the existing rules that end with $CM+ $AL_FOLLOW, namely

$OP $CM* $SP+ $CM+ $AL_FOLLOW?;
($OP $CM* $SP+ | [$OP $QU $GL] $CM*) ([\p{Pi} & $QU] $CM* $SP*)+ $SP $CM+ $AL_FOLLOW?;
^([\p{Pi} & $QU] $CM* $SP*)+ $SP $CM+ $AL_FOLLOW?;
$LB8NonBreaks [\p{Pf} & $QU] $CM* ([\p{Pi} & $QU] $CM* $SP*)+ $SP $CM+ $AL_FOLLOW?;
$CAN_CM $CM*  [\p{Pf} & $QU] $CM* ([\p{Pi} & $QU] $CM* $SP*)+ $SP $CM+ $AL_FOLLOW?;
^$CM+  [\p{Pf} & $QU] $CM* ([\p{Pi} & $QU] $CM* $SP*)+ $SP $CM+ $AL_FOLLOW?;

once with $CM+ $AL_FOLLOW? replaced by
[$CM & [\p{ea=F}\p{ea=W}\p{ea=H}]] / [\p{Pi} & $QU] $CM* [ [\p{ea=F}\p{ea=W}\p{ea=H}] - $CM],
once with that replaced by
[$CM & [\p{ea=F}\p{ea=W}\p{ea=H}]] $CM* $CMX / [\p{Pi} & $QU] $CM* [ [\p{ea=F}\p{ea=W}\p{ea=H}] - $CM].

This test case is sufficiently treacherous that it should be added both to rbbitst.txt and to the UCD’s own LineBreakTest.txt.

eggrobin · 2024-07-02T23:01:37Z

@markusicu Status report: 70089cd is green (except for clang warnings which I am fixing in the next commit), so if this is blocking too many things you could run with it.
It is however wrong, as the old monkeys demonstrate if they run for long enough. It is wrong in a way I understand and have documented in line.txt, and I think I know how to fix that, though it will involve writing some truly disgusting regular expressions.

Also note that so far this PR does not upgrade any of the tailored copies of the line breaking algorithm (which should receive the same changes as the default). I don’t want to do that before I get the changes to the default right.

markusicu · 2024-07-02T23:09:17Z

@markusicu Status report: 70089cd is green (except for clang warnings which I am fixing in the next commit),

Great, thanks! 🎉

so if this is blocking too many things you could run with it. It is however wrong, as the old monkeys demonstrate if they run for long enough.

Given the US holiday and your and Elango's travel schedules, I suggest that we keep this PR open for now. If you have more time to work on it, you can make progress right here. It would be nice if it was still "green" next week. At that point I (and maybe Andy) could look it over for plausibility and code changes, and merge. And then I might try to rebase Elango's InCB PR -- or I might just wait for his return. Separately I could start fixing ICU UTS46 code for 16 once this PR is in.

markusicu · 2024-07-02T23:10:09Z

Added Andy as a reviewer for the segmentation changes. (incomplete, see comments above and separate email)

…r [\p{ea=F}\p{ea=W}\p{ea=H}].

…phrase)?_cj)

…ine_(loose|normal)_cj

jira-pull-request-webhook · 2024-07-16T12:29:22Z

Hooray! The files in the branch are the same across the force-push. 😃

~ Your Friendly Jira-GitHub PR Checker Bot

eggrobin · 2024-07-16T13:56:47Z

Ten days ago I had written:

I am only partway through https://unicode-org.github.io/icu/userguide/dev/rules_update.html; the Tailorings step and the ICU4J steps will still need to be done (but I can do them in a subsequent PR).

I have now done the Tailorings step.
ICU4J segmentation is still at 15.1, but that can be dealt with in a subsequent PR (its copy of the old monkeys will probably need similarly substantial changes).

eggrobin · 2024-07-17T12:16:47Z

but that can be dealt with in a subsequent PR

As discussed, that would probably not work once we regenerate ICU4J data.

eggrobin · 2024-07-17T15:36:21Z

@markusicu, the ICU4J section of https://unicode-org.github.io/icu/userguide/dev/rules_update.html points me to icu4c/source/data/icu4j-readme.txt; following those instructions, I am able to generate icu4j\main\shared\data\icudata.jar, icu4j\main\shared\data\icutzdata.jar, and icu4j\main\shared\data\testdata.jar; but those are not a thing anymore. What should I actually be doing to regenerate ICU4J data?

markusicu · 2024-07-17T20:50:42Z

@eggrobin I just pushed a commit with the regenerated ICU4J binary .brk files.

jira-pull-request-webhook · 2024-07-18T13:52:18Z

Hooray! The files in the branch are the same across the force-push. 😃

~ Your Friendly Jira-GitHub PR Checker Bot

markusicu · 2024-07-18T18:45:37Z

Hi @eggrobin

I am going to rubber-stamp the line breaking changes...
Are you going to coalesce some of the commits? For example, the first two for UTC-179-C35? Or the two for UTC-179-C28?

eggrobin · 2024-07-18T18:59:20Z

Are you going to coalesce some of the commits? For example, the first two for UTC-179-C35? Or the two for UTC-179-C28?

Not those at least; see #3028 (comment). In general I think the commits should be sensible here, I have consistently been squashing minor tweaks into the relevant commits (hence the many force-pushes).

eggrobin · 2024-07-18T19:04:50Z

(Approving the changes in https://github.com/unicode-org/icu/pull/3028/files/1026f7464ec3966e49f263ead9215802d100ff05.)

markusicu assigned echeran Jun 5, 2024

eggrobin force-pushed the uni16-beta-jun04 branch from 2f69ca9 to c466f45 Compare June 21, 2024 13:17

This comment was marked as resolved.

Sign in to view

eggrobin force-pushed the uni16-beta-jun04 branch from 0e71e57 to d118c70 Compare June 21, 2024 13:27

This comment was marked as resolved.

Sign in to view

eggrobin force-pushed the uni16-beta-jun04 branch from fbac93c to b68325e Compare June 21, 2024 13:43

This comment was marked as resolved.

Sign in to view

eggrobin reviewed Jun 21, 2024

View reviewed changes

icu4c/source/data/brkitr/rules/line.txt Outdated Show resolved Hide resolved

echeran added a commit to echeran/icu that referenced this pull request Jun 28, 2024

ICU-22721 Redo ppucd.txt after rebasing on top of PR unicode-org#3028

1148283

This was referenced Jun 28, 2024

ICU-22503 Add support for property Indic_Conjunct_Break #3049

Merged

ICU-22707 Add support for property Modifier_Combining_Mark #3051

Merged

eggrobin force-pushed the uni16-beta-jun04 branch from 4f87a48 to 9782d0d Compare July 2, 2024 13:14

This comment was marked as outdated.

Sign in to view

eggrobin force-pushed the uni16-beta-jun04 branch from da1ebfc to 70089cd Compare July 2, 2024 22:12

This comment was marked as outdated.

Sign in to view

markusicu requested a review from aheninger July 2, 2024 23:09

eggrobin added 3 commits July 16, 2024 13:46

ICU-22707 UTC-179-A102 Consider using a macro throughout the rules fo…

793a0db

…r [\p{ea=F}\p{ea=W}\p{ea=H}].

ICU-22707 Patch tailored rules (manually for hunks 1 and 6 on loose(_…

336da08

…phrase)?_cj)

ICU-22707 Patch tailored new monkeys, manually for the last hunk on l…

73602b2

…ine_(loose|normal)_cj

eggrobin force-pushed the uni16-beta-jun04 branch from 87933b2 to 793a0db Compare July 16, 2024 11:58

This comment was marked as outdated.

Sign in to view

ICU-22707 UTC-179-C28 Improved expectation

cf1cbbb

eggrobin force-pushed the uni16-beta-jun04 branch from 398a85a to cf1cbbb Compare July 16, 2024 12:29

ICU-22707 generate ICU4J .brk files

169023a

eggrobin added 2 commits July 18, 2024 13:22

ICU-22707 Copy data-driven test file to ICU4J

1479ae1

ICU-22707 Port the line monkey partition to ICU4J

413bd01

eggrobin force-pushed the uni16-beta-jun04 branch from 615b13e to 4d77ea8 Compare July 18, 2024 13:49

This comment was marked as outdated.

Sign in to view

eggrobin added 3 commits July 18, 2024 15:51

ICU-22707 Port the old monkey rule changes to ICU4J

9a91499

ICU-22707 Fix an ancient bug in moveIndex32

a6ff66e

ICU-22707 Copy new monkey rules to ICU4J

4ad3566

eggrobin force-pushed the uni16-beta-jun04 branch from 4d77ea8 to 4ad3566 Compare July 18, 2024 13:52

eggrobin marked this pull request as ready for review July 18, 2024 15:36

eggrobin approved these changes Jul 18, 2024

View reviewed changes

markusicu merged commit 4acb472 into unicode-org:main Jul 18, 2024
104 checks passed

echeran added a commit to echeran/icu that referenced this pull request Jul 23, 2024

ICU-22721 Redo ppucd.txt after rebasing on top of PR unicode-org#3028

9036736

eggrobin mentioned this pull request Sep 4, 2024

Support Unicode 15.1 for line segmenter unicode-org/icu4x#5218

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ICU-22707 Unicode 16 beta jun04 #3028

ICU-22707 Unicode 16 beta jun04 #3028

markusicu commented Jun 5, 2024 •

edited

Loading

markusicu commented Jun 5, 2024

markusicu commented Jun 5, 2024

eggrobin commented Jun 5, 2024

markusicu commented Jun 5, 2024

eggrobin commented Jun 5, 2024

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

eggrobin commented Jun 21, 2024

markusicu commented Jun 21, 2024

eggrobin commented Jun 21, 2024 •

edited

Loading

markusicu commented Jun 28, 2024

eggrobin commented Jun 28, 2024

eggrobin commented Jul 1, 2024 •

edited

Loading

This comment was marked as outdated.

This comment was marked as outdated.

eggrobin commented Jul 2, 2024

markusicu commented Jul 2, 2024

markusicu commented Jul 2, 2024

This comment was marked as outdated.

jira-pull-request-webhook bot commented Jul 16, 2024

eggrobin commented Jul 16, 2024

eggrobin commented Jul 17, 2024

eggrobin commented Jul 17, 2024

markusicu commented Jul 17, 2024

This comment was marked as outdated.

jira-pull-request-webhook bot commented Jul 18, 2024

markusicu commented Jul 18, 2024

eggrobin commented Jul 18, 2024

eggrobin commented Jul 18, 2024

ICU-22707 Unicode 16 beta jun04 #3028

ICU-22707 Unicode 16 beta jun04 #3028

Conversation

markusicu commented Jun 5, 2024 • edited Loading

Checklist

markusicu commented Jun 5, 2024

markusicu commented Jun 5, 2024

eggrobin commented Jun 5, 2024

markusicu commented Jun 5, 2024

eggrobin commented Jun 5, 2024

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

eggrobin commented Jun 21, 2024

markusicu commented Jun 21, 2024

eggrobin commented Jun 21, 2024 • edited Loading

markusicu commented Jun 28, 2024

eggrobin commented Jun 28, 2024

eggrobin commented Jul 1, 2024 • edited Loading

This comment was marked as outdated.

This comment was marked as outdated.

eggrobin commented Jul 2, 2024

markusicu commented Jul 2, 2024

markusicu commented Jul 2, 2024

This comment was marked as outdated.

jira-pull-request-webhook bot commented Jul 16, 2024

eggrobin commented Jul 16, 2024

eggrobin commented Jul 17, 2024

eggrobin commented Jul 17, 2024

markusicu commented Jul 17, 2024

This comment was marked as outdated.

jira-pull-request-webhook bot commented Jul 18, 2024

markusicu commented Jul 18, 2024

eggrobin commented Jul 18, 2024

eggrobin commented Jul 18, 2024

markusicu commented Jun 5, 2024 •

edited

Loading

eggrobin commented Jun 21, 2024 •

edited

Loading

eggrobin commented Jul 1, 2024 •

edited

Loading