ICU-22404 Unicode 15.1 beta data files & API constants #2492

echeran · 2023-06-05T22:51:23Z

Required: The PR description must include the link to the Jira Issue, for example by completing the URL in the first checklist item
Required: Each commit message must be prefixed with a JIRA Issue number.
Issue accepted (done by Technical Committee after discussion)
Tests included, if applicable
API docs and/or User Guide docs changed or added, if applicable

icu4c/source/common/localefallback_data.h

markusicu · 2023-06-12T05:33:35Z

Hi @eggrobin this branch/PR has pretty much all of the Unicode 15.1 beta changes for ICU: New property values (though not yet new properties), updated data & test data, collation...

Could you please try to implement the BreakIterator rule changes and add commits to this branch?

https://unicode-org.github.io/icu/userguide/dev/rules_update.html

The biggest change is the one for orthographic syllables:

Line breaking at orthographic syllable boundaries unicodetools#422
See also the ticket ICU-22039 and @aheninger's prototype at aheninger/icu@ortholb^...ortholb
- Partial notes from Andy last year:
- No ICU4J
- No LB tailorings
- RBBIMonkeyTest not updated. (There are two ICU RBBI monkey tests; the other one is updated.)
PAG-internal: https://groups.google.com/a/unicode.org/g/properties/c/IZ484tXVOHM
ICU-internal: https://groups.google.com/a/unicode.org/g/icu-team/c/qapZjIMFFso

There is also: Line breaking around « quotation marks »

Line breaking around « quotation marks » unicodetools#456

Was there something else in this area?
(I think we don't need to change ICU for the updated grapheme break rules because ICU already implemented those from CLDR.)

FYI @FrankYFTang

eggrobin · 2023-07-07T13:38:09Z

Could you […] add commits to this branch?

I cannot, probably because I do not have write access either to this repository nor to @echeran’s. echeran#48 has the changes.

markusicu · 2023-07-10T23:27:47Z

@echeran could you please rebase, and resolve the conflicts (jar files)? We should try to finish this soon.

See unicode-org#2492

jira-pull-request-webhook · 2023-07-11T18:11:57Z

Hooray! The files in the branch are the same across the force-push. 😃

~ Your Friendly Jira-GitHub PR Checker Bot

eggrobin · 2023-07-11T20:38:12Z

    [junit] Running com.ibm.icu.dev.test.rbbi.RBBIMonkeyTest
    [junit] Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 0.226 sec
    [junit] TEST com.ibm.icu.dev.test.rbbi.RBBIMonkeyTest FAILED

… on my machine it worked.

I’ll look at this once ICU-TC lets me in on Thursday.

markusicu · 2023-07-12T18:03:38Z

I’ll look at this once ICU-TC lets me in on Thursday.

You should not need to wait with debugging before we add you to the GitHub team.

markusicu · 2023-07-12T19:30:57Z

@eggrobin it turned out that we just needed to refresh the Unicode 15.1 data in the ICU4J data jar file. Now, at least locally, we no longer see RBBI test failures. There are a few other failures that I think (hope) are not related to BreakIterator.

eggrobin · 2023-07-12T20:11:19Z

There are a few other failures that I think (hope) are not related to BreakIterator.

The ants seem happier, but Windows still seems sad.

Interestingly, I am now unable to build this branch on Windows:

29>NMAKE : fatal error U1077: 'C:\Users\robin\Projects\Unicode\icu\icu4c\source\extra\uconv\..\..\..\bin64\pkgdata.EXE' : return code '0xc0000135'
30>NMAKE : fatal error U1077: 'C:\Users\robin\Projects\Unicode\icu\icu4c\bin64\pkgdata.EXE' : return code '0xc0000135'

I do not see that error on CI, but I see this, which I would naïvely think suggests that something went wrong when building:

D:\a\1\s\icu4c\source\test\intltest\rbbiapts.cpp:1054 Built rules and rebuilt rules are different line

(I have no idea where to look beyond that.)

markusicu · 2023-07-12T22:30:51Z

@eggrobin @mihnita On a hunch, Elango and I fixed some charset encoding issues in C++ code, but most of the Windows builds still have rbbi test failures. For example, MSVC x64 Debug (VS2019) / Run x64 Debug Tests.

Linux & macOS pass. Java passes.
Windows MSVC on ARM passes.
Cygwin gcc fails, and the other Windows builds also fail.

Our best guess is still that it somehow has to do with Windows defaulting to the windows-1252 charset rather than UTF-8, and that maybe Windows-on-ARM and Robin's Windows machine are set to use UTF-8 as the default charset. But we apparently failed to stare down the problem in the code diffs. (If the charset is the problem, then somewhere a non-ASCII char * string is incorrectly assumed to be in UTF-8, or a non-ASCII char16_t * string or character is used where char * is needed.)

@rp9-next
@aheninger FYI

markusicu · 2023-07-12T22:50:06Z

I do not see that error on CI, but I see this, which I would naïvely think suggests that something went wrong when building:

D:\a\1\s\icu4c\source\test\intltest\rbbiapts.cpp:1054 Built rules and rebuilt rules are different line

If this were indeed something about how the data files are built, then someone who has access to both a macOS or Linux machine, as well as a Windows machine with the windows-1252 default codepage, could check if the binary files (mostly .brk files) are the same.

eggrobin · 2023-07-13T00:46:07Z

Windows MSVC on ARM passes.

That one doesn’t have the Run Tests step, which is the one that fails, so I think that is a red herring.

I still have not managed to build this branch on my Windows machines (obviously it used to build before the rebase, with the RBBI tests passing), but I also can’t seem to build main, so I suspect I may have some local configuration that still thinks it is 73; my local issues are probably unrelated to the CI issues.

eggrobin · 2023-07-13T08:34:35Z

my local issues are probably unrelated to the CI issues.

Figured that out, the tools in icu4c\bin64 do not get replaced by new builds nor removed by cleaning. Manually removing that directory fixed it. I should fix the configuration of the Custom Build Steps that copy from icu4c\source\tools\whatever\x64\Release to icu4c\bin64 one of these days.

Having managed to build this branch… on my machine(s) it works:

[All tests passed successfully...]
Elapsed Time: 00:00:14.878
-
-
-
============================================================
Summary: ICU in "C:\Users\robin\Projects\Unicode\icu\icu4c\source\allinone\"\..\..  arch=x64 type=Release
-
Tests Run    :  icuinfo intltest iotest cintltst
" - All Passed!"

[All tests passed successfully...]
Elapsed Time: 00:00:11.474
-
-
-
============================================================
Summary: ICU in "C:\Users\robin\Projects\Unicode\icu\icu4c\source\allinone\"\..\..  arch=x64 type=Release
-
Tests Run    :  icuinfo intltest iotest cintltst
" - All Passed!"

maybe […] Robin's Windows machine[s] are set to use UTF-8 as the default charset

They are.

as well as a Windows machine with the windows-1252 default codepage

Hm. I can try messing with that setting.

eggrobin · 2023-07-13T10:07:26Z

Alright, setting my system to not-UTF-8, I can reproduce the failures:

Errors in total: 124.
            TestRoundtripRules
         RBBIAPITest
            TestUnicodeFiles
            TestMonkey
         RBBITest
            testMonkey
         RBBIMonkeyTest
      rbbi

--------------------------------------
Elapsed Time: 00:00:01.911

eggrobin@9c6cc1c (which you will have to cherry-pick since I cannot push commits to this branch) fixes the tests.

We should file a follow-up ticket to read the rule files as UTF-8; they are already non-ASCII UTF-8 interpreted as whichever system codepage (the comments are not ASCII).
The fix to genbrk would be easy, see below; but something in TestRoundtripRules seems to need to be fixed separately.

--- a/icu4c/source/tools/genbrk/genbrk.cpp
+++ b/icu4c/source/tools/genbrk/genbrk.cpp
@@ -234,8 +234,11 @@ int  main(int argc, char **argv) {
     if (U_FAILURE(status)) {
         exit(status);
     }
-    if(encoding!=nullptr ){
-        ruleSourceC  += signatureLength;
+    if (encoding == nullptr) {
+        // In the absence of a BOM, assume the rule file is in UTF-8.
+        encoding = "UTF-8";
+    } else {
+        ruleSourceC += signatureLength;
         ruleFileSize -= signatureLength;
     }

Having hopefully figured that one out, I will put my system back to UTF-8 before something breaks.

mihnita · 2023-07-13T11:09:24Z

I've checked the icu4c/source/tools/genbrk/genbrk.cpp fix.
Confirm that it works, all tests pass, it does not look like TestRoundtripRules needs any fixing.

See unicode-org#2492 Co-authored-by: Andy Heninger <andy.heninger@gmail.com> Co-authored-by: Robin Leroy <egg.robin.leroy@gmail.com>

jira-pull-request-webhook · 2023-07-13T16:31:50Z

Notice: the branch changed across the force-push!

icu4c/source/tools/genbrk/genbrk.cpp is now changed in the branch

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

echeran · 2023-07-13T18:45:31Z

I applied the patch that @eggrobin included in his previous comment to modify genbrk (the alternative to editing the break rule files by escaping non-ASCII characters).

After the change, TestRoundtripRules is the only remaining test that fails.

eggrobin · 2023-07-13T21:24:15Z

I think I have a fix (tested by once again setting my machine to codepage 1252 😩), but I still seem to be unable to push commits to this branch, so have another diff:

--- a/icu4c/source/test/intltest/rbbiapts.cpp
+++ b/icu4c/source/test/intltest/rbbiapts.cpp
@@ -1030,7 +1030,7 @@ void RBBIAPITest::RoundtripRule(const char *dataFile) {
     parseError.offset = 0;
     LocalUDataMemoryPointer data(udata_open(U_ICUDATA_BRKITR, "brk", dataFile, &status));
     uint32_t length;
-    const char *builtSource;
+    UnicodeString builtSource;
     const uint8_t *rbbiRules;
     const uint8_t *builtRules;

@@ -1040,12 +1040,13 @@ void RBBIAPITest::RoundtripRule(const char *dataFile) {
     }

     builtRules = (const uint8_t *)udata_getMemory(data.getAlias());
-    builtSource = (const char *)(builtRules + ((RBBIDataHeader*)builtRules)->fRuleSource);
+    builtSource = UnicodeString::fromUTF8(
+        (const char *)(builtRules + ((RBBIDataHeader *)builtRules)->fRuleSource));
     LocalPointer<RuleBasedBreakIterator> brkItr (new RuleBasedBreakIterator(builtSource, parseError, status));
     if (U_FAILURE(status)) {
         errln("%s:%d createRuleBasedBreakIterator: ICU Error \"%s\"  at line %d, column %d\n",
                 __FILE__, __LINE__, u_errorName(status), parseError.line, parseError.offset);
-        errln(UnicodeString(builtSource));
+        errln(builtSource);
         return;
     }
     rbbiRules = brkItr->getBinaryRules(length);

See unicode-org#2492 Co-authored-by: Andy Heninger <andy.heninger@gmail.com> Co-authored-by: Robin Leroy <egg.robin.leroy@gmail.com>

jira-pull-request-webhook · 2023-07-13T21:41:08Z

Notice: the branch changed across the force-push!

icu4c/source/test/intltest/rbbiapts.cpp is now changed in the branch

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

eggrobin · 2023-07-13T21:45:23Z

Note that this PR fixes ICU-22039; I don’t know whether we have a scheme for marking a commit as related to multiple issues.

icu4c/source/test/intltest/rbbiapts.cpp

See unicode-org#2492 Co-authored-by: Andy Heninger <andy.heninger@gmail.com> Co-authored-by: Robin Leroy <egg.robin.leroy@gmail.com>

jira-pull-request-webhook · 2023-07-13T22:07:28Z

Notice: the branch changed across the force-push!

icu4c/source/test/intltest/rbbiapts.cpp is different

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

markusicu · 2023-07-13T23:56:21Z

"All checks have passed" -- super!! thanks everyone!!

markusicu · 2023-07-13T23:58:44Z

Note that this PR fixes ICU-22039; I don’t know whether we have a scheme for marking a commit as related to multiple issues.

I just closed that with resolution "Fixed by Other Ticket" and a comment referring back here.

markusicu · 2023-07-14T00:11:59Z

icu4c/source/tools/genbrk/genbrk.cpp

+    if (encoding == nullptr) {
+        // In the absence of a BOM, assume the rule file is in UTF-8.
+        encoding = "UTF-8";


So this is a change in behavior of this tool.
In a follow-up change, we should document (genbrk usage output, maybe User Guide) that genbrk looks for a Unicode signature byte sequence and otherwise assumes UTF-8.

https://unicode.org/faq/utf_bom.html
https://www.unicode.org/glossary/#unicode_signature

markusicu

lgtm great!!

See unicode-org#2492 Co-authored-by: Andy Heninger <andy.heninger@gmail.com> Co-authored-by: Robin Leroy <egg.robin.leroy@gmail.com>

echeran assigned markusicu Jun 5, 2023

echeran marked this pull request as draft June 5, 2023 22:52

echeran requested review from macchiati and pedberg-icu June 5, 2023 22:53

markusicu reviewed Jun 8, 2023

View reviewed changes

icu4c/source/common/localefallback_data.h Outdated Show resolved Hide resolved

echeran added a commit to echeran/icu that referenced this pull request Jul 11, 2023

ICU-22404 Unicode 15.1 beta data files & API constants

84decc8

See unicode-org#2492

echeran force-pushed the ICU-22404-pt1 branch from f61e9ca to 84decc8 Compare July 11, 2023 18:11

echeran marked this pull request as ready for review July 11, 2023 18:11

echeran added a commit to echeran/icu that referenced this pull request Jul 13, 2023

ICU-22404 Unicode 15.1 beta data files & API constants

f5c2981

See unicode-org#2492 Co-authored-by: Andy Heninger <andy.heninger@gmail.com> Co-authored-by: Robin Leroy <egg.robin.leroy@gmail.com>

echeran force-pushed the ICU-22404-pt1 branch from dd8740f to f5c2981 Compare July 13, 2023 16:31

echeran added a commit to echeran/icu that referenced this pull request Jul 13, 2023

ICU-22404 Unicode 15.1 beta data files & API constants

7241822

See unicode-org#2492 Co-authored-by: Andy Heninger <andy.heninger@gmail.com> Co-authored-by: Robin Leroy <egg.robin.leroy@gmail.com>

echeran force-pushed the ICU-22404-pt1 branch from f5c2981 to 7241822 Compare July 13, 2023 21:40

eggrobin reviewed Jul 13, 2023

View reviewed changes

icu4c/source/test/intltest/rbbiapts.cpp Outdated Show resolved Hide resolved

ICU-22404 Unicode 15.1 beta data files & API constants

4f507fa

See unicode-org#2492 Co-authored-by: Andy Heninger <andy.heninger@gmail.com> Co-authored-by: Robin Leroy <egg.robin.leroy@gmail.com>

echeran force-pushed the ICU-22404-pt1 branch from 7241822 to 4f507fa Compare July 13, 2023 22:07

echeran requested a review from markusicu July 13, 2023 23:08

markusicu reviewed Jul 14, 2023

View reviewed changes

markusicu approved these changes Jul 14, 2023

View reviewed changes

echeran merged commit 2e45e6e into unicode-org:main Jul 14, 2023
101 checks passed

eggrobin mentioned this pull request Jul 24, 2023

ICU-22404 Improve documentation of segmentation rules #2532

Merged

7 tasks

catamorphism pushed a commit to catamorphism/icu that referenced this pull request Nov 1, 2023

ICU-22404 Unicode 15.1 beta data files & API constants

5fe9981

See unicode-org#2492 Co-authored-by: Andy Heninger <andy.heninger@gmail.com> Co-authored-by: Robin Leroy <egg.robin.leroy@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ICU-22404 Unicode 15.1 beta data files & API constants #2492

ICU-22404 Unicode 15.1 beta data files & API constants #2492

echeran commented Jun 5, 2023 •

edited by markusicu

Loading

markusicu commented Jun 12, 2023

eggrobin commented Jul 7, 2023

markusicu commented Jul 10, 2023

jira-pull-request-webhook bot commented Jul 11, 2023

eggrobin commented Jul 11, 2023 •

edited

Loading

markusicu commented Jul 12, 2023

markusicu commented Jul 12, 2023

eggrobin commented Jul 12, 2023

markusicu commented Jul 12, 2023

markusicu commented Jul 12, 2023

eggrobin commented Jul 13, 2023

eggrobin commented Jul 13, 2023 •

edited

Loading

eggrobin commented Jul 13, 2023

mihnita commented Jul 13, 2023

jira-pull-request-webhook bot commented Jul 13, 2023

echeran commented Jul 13, 2023

eggrobin commented Jul 13, 2023 •

edited

Loading

jira-pull-request-webhook bot commented Jul 13, 2023

eggrobin commented Jul 13, 2023

jira-pull-request-webhook bot commented Jul 13, 2023

markusicu commented Jul 13, 2023

markusicu commented Jul 13, 2023 •

edited by jira bot

Loading

markusicu Jul 14, 2023

markusicu left a comment

ICU-22404 Unicode 15.1 beta data files & API constants #2492

ICU-22404 Unicode 15.1 beta data files & API constants #2492

Conversation

echeran commented Jun 5, 2023 • edited by markusicu Loading

markusicu commented Jun 12, 2023

eggrobin commented Jul 7, 2023

markusicu commented Jul 10, 2023

jira-pull-request-webhook bot commented Jul 11, 2023

eggrobin commented Jul 11, 2023 • edited Loading

markusicu commented Jul 12, 2023

markusicu commented Jul 12, 2023

eggrobin commented Jul 12, 2023

markusicu commented Jul 12, 2023

markusicu commented Jul 12, 2023

eggrobin commented Jul 13, 2023

eggrobin commented Jul 13, 2023 • edited Loading

eggrobin commented Jul 13, 2023

mihnita commented Jul 13, 2023

jira-pull-request-webhook bot commented Jul 13, 2023

echeran commented Jul 13, 2023

eggrobin commented Jul 13, 2023 • edited Loading

jira-pull-request-webhook bot commented Jul 13, 2023

eggrobin commented Jul 13, 2023

jira-pull-request-webhook bot commented Jul 13, 2023

markusicu commented Jul 13, 2023

markusicu commented Jul 13, 2023 • edited by jira bot Loading

markusicu Jul 14, 2023

Choose a reason for hiding this comment

markusicu left a comment

Choose a reason for hiding this comment

echeran commented Jun 5, 2023 •

edited by markusicu

Loading

eggrobin commented Jul 11, 2023 •

edited

Loading

eggrobin commented Jul 13, 2023 •

edited

Loading

eggrobin commented Jul 13, 2023 •

edited

Loading

markusicu commented Jul 13, 2023 •

edited by jira bot

Loading