Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ICU-22707 Unicode 16 beta jun04 #3028

Merged
merged 27 commits into from
Jul 18, 2024
Merged

Conversation

markusicu
Copy link
Member

@markusicu markusicu commented Jun 5, 2024

Checklist
  • Required: Issue filed: https://unicode-org.atlassian.net/browse/ICU-22707
  • Required: The PR title must be prefixed with a JIRA Issue number.
  • Required: The PR description must include the link to the Jira Issue, for example by completing the URL in the first checklist item
  • Required: Each commit message must be prefixed with a JIRA Issue number.
  • Issue accepted (done by Technical Committee after discussion)
  • Tests included, if applicable
  • API docs and/or User Guide docs changed or added, if applicable

ALLOW_MANY_COMMITS=true

@markusicu
Copy link
Member Author

@eggrobin I have the latest Unicode 16 data here. Locally, test pass except for intltest rbbi and intltest idna. I will probably disable the failing idna (UTS46) tests for a while. Can you please update the segmentation code & data as needed?

@echeran FYI

@markusicu
Copy link
Member Author

Locally, test pass except for intltest rbbi and intltest idna. I will probably disable the failing idna (UTS46) tests for a while.

Done. Locally, only intltest rbbi fails now.

@eggrobin
Copy link
Member

eggrobin commented Jun 5, 2024

Can you please update the segmentation code & data as needed?

In this branch, or in a separate PR? (As discussed, I will want to do that with several commits, both to separate the proposals and because I want to keep a record of the steps of the LB25 derivation.)

@markusicu
Copy link
Member Author

Can you please update the segmentation code & data as needed?

In this branch, or in a separate PR?

This pull request here is set up to allow multiple commits, and when it's done I will rebase-and-merge them, not squash them.

I assume that it would be easiest for you to add commits here directly for segmentation.
Otherwise we would need to disable the failing rbbi tests as well before merging this PR into main.
It feels like the risk from disabling tests is higher for rbbi than it is for idna.

@eggrobin
Copy link
Member

eggrobin commented Jun 5, 2024

Sounds reasonable, I will add commits into this one then.

@jira-pull-request-webhook

This comment was marked as resolved.

@jira-pull-request-webhook

This comment was marked as resolved.

@jira-pull-request-webhook

This comment was marked as resolved.

@eggrobin
Copy link
Member

Oh, this is fun:
createRuleBasedBreakIterator: ICU Error "U_BRK_RULE_EMPTY_SET" at line 292, column 5

This is the set [$IS & [\p{ea=F}\p{ea=W}\p{ea=H}]] which got emptied by UTC-179-C30:

[179-C30] Consensus: Change the Line_Break assignment of U+FE10 ︐ PRESENTATION FORM FOR VERTICAL COMMA to Close_Punctuation (CL), and that of U+FE13 ︓ PRESENTATION FORM FOR VERTICAL COLON and U+FE14 ︔ PRESENTATION FORM FOR VERTICAL SEMICOLON to Nonstarter (NS), to match their FULLWIDTH counterparts U+FF0C, U+FF1A, and U+FF1B. For Unicode Version 16.0. See document L2/24-064 item 5.7.

The set previously contained exactly these three characters: https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=%5B%5Cp%7BU15.1%3Alb%3DIS%7D+%26+%5B%5Cp%7BU15.1%3Aea%3DF%7D%5Cp%7BU15.1%3Aea%3DW%7D%5Cp%7BU15.1%3Aea%3DH%7D%5D%5D&g=&i=.

That item in the PAG report reads:

I only spotted that because of extremely obscure interactions between line breaking
rules in the optimized ICU implementation.

So I will now have to remove those extremely obscure lines from the rules, a welcome change from my usual routine of adding extremely obscure lines.

@markusicu
Copy link
Member Author

Hi @eggrobin, thanks for making progress here!
It sounds like this is still WIP, and I see that a number of the CI checks are unhappy.
Are you going to consolidate the commits into fewer/chunkier ones?

@eggrobin
Copy link
Member

eggrobin commented Jun 21, 2024

It sounds like this is still WIP, and I see that a number of the CI checks are unhappy.

Yes; I have brought in all the work that was already done, but as expected I need to appease the new monkeys. (And some clang warnings, etc.)

Are you going to consolidate the commits into fewer/chunkier ones?

Mostly, no: things have already been consolidated (compare eggrobin/icu@unicode-org:icu:main...uax14-integration). What remains is split by UTC decision, and, e.g., the work on UTC-179-C35 is in turn split into the steps documented in the background section of item 5.15 of the report, plus the post UTC correction; I want to retain these steps in the history of line.txt and friends.

I expect that I will coalesce whatever additional work remains to be done into one or two commits though.

@markusicu
Copy link
Member Author

Hi @eggrobin FYI @echeran now has two pending PRs that add support for new properties, which want to go in after this PR here...

@eggrobin
Copy link
Member

Yes, I somehow got distracted from ICU4[CJ] matters last week and dropped this ball. I intend to get back to this on Monday, please poke me with a sharp stick if I don’t.

@eggrobin
Copy link
Member

eggrobin commented Jul 1, 2024

Exciting Development: While testing the new monkeys, I came across a string which exposes a bug in my rules for LB19a.
Somehow the old monkeys never came up with such a string over days of testing.

This seems completely tractable in ICU, and should not require a change on the UAX14 side, so this is not an all-hands-on-deck emergency. But it is still uncomfortably exciting.

The string in question is ︷ \U00016FF1\u302B⸠ᅛᆅ, where \U00016FF1\u302B are East_Asian_Width=Wide combining marks.
That \U00016FF1, lb=CM and ea=W, being after a space, gets treated as lb=AL, but remains ea=W, so LB19a should not apply.

In ICU, LB19a was implemented in a slightly strange way: LB19 was unchanged, and the complement of LB19a is given break rules (this is to avoid having to add a profusion of rules for overlapping context spanning more than two code points).
For a lb=CM following a break, the lb=CM-as-AL and ea=W case is handled by the rule

^[$CM & [\p{ea=F}\p{ea=W}\p{ea=H}]]                    / [\p{Pi} & $QU] $CM* [ [\p{ea=F}\p{ea=W}\p{ea=H}] - $CM];
^[$CM & [\p{ea=F}\p{ea=W}\p{ea=H}]] $CM* $CMX          / [\p{Pi} & $QU] $CM* [ [\p{ea=F}\p{ea=W}\p{ea=H}] - $CM];

But in this case, the lb=CM-as-AL does not follow a break, because LB14 applied.

The solution should be to copy the existing rules that end with $CM+ $AL_FOLLOW, namely

$OP $CM* $SP+ $CM+ $AL_FOLLOW?;
($OP $CM* $SP+ | [$OP $QU $GL] $CM*) ([\p{Pi} & $QU] $CM* $SP*)+ $SP $CM+ $AL_FOLLOW?;
^([\p{Pi} & $QU] $CM* $SP*)+ $SP $CM+ $AL_FOLLOW?;
$LB8NonBreaks [\p{Pf} & $QU] $CM* ([\p{Pi} & $QU] $CM* $SP*)+ $SP $CM+ $AL_FOLLOW?;
$CAN_CM $CM*  [\p{Pf} & $QU] $CM* ([\p{Pi} & $QU] $CM* $SP*)+ $SP $CM+ $AL_FOLLOW?;
^$CM+  [\p{Pf} & $QU] $CM* ([\p{Pi} & $QU] $CM* $SP*)+ $SP $CM+ $AL_FOLLOW?;
  1. once with $CM+ $AL_FOLLOW? replaced by
    [$CM & [\p{ea=F}\p{ea=W}\p{ea=H}]] / [\p{Pi} & $QU] $CM* [ [\p{ea=F}\p{ea=W}\p{ea=H}] - $CM],
  2. once with that replaced by
    [$CM & [\p{ea=F}\p{ea=W}\p{ea=H}]] $CM* $CMX / [\p{Pi} & $QU] $CM* [ [\p{ea=F}\p{ea=W}\p{ea=H}] - $CM].

This test case is sufficiently treacherous that it should be added both to rbbitst.txt and to the UCD’s own LineBreakTest.txt.

@jira-pull-request-webhook

This comment was marked as outdated.

@jira-pull-request-webhook

This comment was marked as outdated.

@eggrobin
Copy link
Member

eggrobin commented Jul 2, 2024

@markusicu Status report: 70089cd is green (except for clang warnings which I am fixing in the next commit), so if this is blocking too many things you could run with it.
It is however wrong, as the old monkeys demonstrate if they run for long enough. It is wrong in a way I understand and have documented in line.txt, and I think I know how to fix that, though it will involve writing some truly disgusting regular expressions.

Also note that so far this PR does not upgrade any of the tailored copies of the line breaking algorithm (which should receive the same changes as the default). I don’t want to do that before I get the changes to the default right.

@markusicu
Copy link
Member Author

@markusicu Status report: 70089cd is green (except for clang warnings which I am fixing in the next commit),

Great, thanks! 🎉

so if this is blocking too many things you could run with it. It is however wrong, as the old monkeys demonstrate if they run for long enough.

Given the US holiday and your and Elango's travel schedules, I suggest that we keep this PR open for now. If you have more time to work on it, you can make progress right here. It would be nice if it was still "green" next week. At that point I (and maybe Andy) could look it over for plausibility and code changes, and merge. And then I might try to rebase Elango's InCB PR -- or I might just wait for his return. Separately I could start fixing ICU UTS46 code for 16 once this PR is in.

@markusicu markusicu requested a review from aheninger July 2, 2024 23:09
@markusicu
Copy link
Member Author

Added Andy as a reviewer for the segmentation changes. (incomplete, see comments above and separate email)

@jira-pull-request-webhook

This comment was marked as outdated.

@jira-pull-request-webhook
Copy link

Hooray! The files in the branch are the same across the force-push. 😃

~ Your Friendly Jira-GitHub PR Checker Bot

@eggrobin
Copy link
Member

Ten days ago I had written:

I am only partway through https://unicode-org.github.io/icu/userguide/dev/rules_update.html; the Tailorings step and the ICU4J steps will still need to be done (but I can do them in a subsequent PR).

I have now done the Tailorings step.
ICU4J segmentation is still at 15.1, but that can be dealt with in a subsequent PR (its copy of the old monkeys will probably need similarly substantial changes).

@eggrobin
Copy link
Member

but that can be dealt with in a subsequent PR

As discussed, that would probably not work once we regenerate ICU4J data.

@eggrobin
Copy link
Member

@markusicu, the ICU4J section of https://unicode-org.github.io/icu/userguide/dev/rules_update.html points me to icu4c/source/data/icu4j-readme.txt; following those instructions, I am able to generate icu4j\main\shared\data\icudata.jar, icu4j\main\shared\data\icutzdata.jar, and icu4j\main\shared\data\testdata.jar; but those are not a thing anymore. What should I actually be doing to regenerate ICU4J data?

@markusicu
Copy link
Member Author

@eggrobin I just pushed a commit with the regenerated ICU4J binary .brk files.

@jira-pull-request-webhook

This comment was marked as outdated.

@jira-pull-request-webhook
Copy link

Hooray! The files in the branch are the same across the force-push. 😃

~ Your Friendly Jira-GitHub PR Checker Bot

@eggrobin eggrobin marked this pull request as ready for review July 18, 2024 15:36
@markusicu
Copy link
Member Author

Hi @eggrobin

  • I am going to rubber-stamp the line breaking changes...
  • Are you going to coalesce some of the commits? For example, the first two for UTC-179-C35? Or the two for UTC-179-C28?

@eggrobin
Copy link
Member

Are you going to coalesce some of the commits? For example, the first two for UTC-179-C35? Or the two for UTC-179-C28?

Not those at least; see #3028 (comment). In general I think the commits should be sensible here, I have consistently been squashing minor tweaks into the relevant commits (hence the many force-pushes).

@eggrobin
Copy link
Member

@markusicu markusicu merged commit 4acb472 into unicode-org:main Jul 18, 2024
104 checks passed
echeran added a commit to echeran/icu that referenced this pull request Jul 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants