Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ICU-22404 Unicode 15.1 beta data files & API constants #2492

Merged
merged 1 commit into from
Jul 14, 2023

Conversation

echeran
Copy link
Contributor

@echeran echeran commented Jun 5, 2023

  • Required: The PR description must include the link to the Jira Issue, for example by completing the URL in the first checklist item
  • Required: Each commit message must be prefixed with a JIRA Issue number.
  • Issue accepted (done by Technical Committee after discussion)
  • Tests included, if applicable
  • API docs and/or User Guide docs changed or added, if applicable

@echeran echeran marked this pull request as draft June 5, 2023 22:52
@markusicu
Copy link
Member

Hi @eggrobin this branch/PR has pretty much all of the Unicode 15.1 beta changes for ICU: New property values (though not yet new properties), updated data & test data, collation...

Could you please try to implement the BreakIterator rule changes and add commits to this branch?

The biggest change is the one for orthographic syllables:

There is also: Line breaking around « quotation marks »

Was there something else in this area?
(I think we don't need to change ICU for the updated grapheme break rules because ICU already implemented those from CLDR.)

FYI @FrankYFTang

@eggrobin
Copy link
Member

eggrobin commented Jul 7, 2023

Could you […] add commits to this branch?

I cannot, probably because I do not have write access either to this repository nor to @echeran’s. echeran#48 has the changes.

@markusicu
Copy link
Member

@echeran could you please rebase, and resolve the conflicts (jar files)? We should try to finish this soon.

@jira-pull-request-webhook
Copy link

Hooray! The files in the branch are the same across the force-push. 😃

~ Your Friendly Jira-GitHub PR Checker Bot

@echeran echeran marked this pull request as ready for review July 11, 2023 18:11
@eggrobin
Copy link
Member

eggrobin commented Jul 11, 2023

    [junit] Running com.ibm.icu.dev.test.rbbi.RBBIMonkeyTest
    [junit] Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 0.226 sec
    [junit] TEST com.ibm.icu.dev.test.rbbi.RBBIMonkeyTest FAILED

… on my machine it worked.

I’ll look at this once ICU-TC lets me in on Thursday.

@markusicu
Copy link
Member

I’ll look at this once ICU-TC lets me in on Thursday.

You should not need to wait with debugging before we add you to the GitHub team.

@markusicu
Copy link
Member

@eggrobin it turned out that we just needed to refresh the Unicode 15.1 data in the ICU4J data jar file. Now, at least locally, we no longer see RBBI test failures. There are a few other failures that I think (hope) are not related to BreakIterator.

@eggrobin
Copy link
Member

There are a few other failures that I think (hope) are not related to BreakIterator.

The ants seem happier, but Windows still seems sad.

Interestingly, I am now unable to build this branch on Windows:

29>NMAKE : fatal error U1077: 'C:\Users\robin\Projects\Unicode\icu\icu4c\source\extra\uconv\..\..\..\bin64\pkgdata.EXE' : return code '0xc0000135'
30>NMAKE : fatal error U1077: 'C:\Users\robin\Projects\Unicode\icu\icu4c\bin64\pkgdata.EXE' : return code '0xc0000135'

I do not see that error on CI, but I see this, which I would naïvely think suggests that something went wrong when building:

D:\a\1\s\icu4c\source\test\intltest\rbbiapts.cpp:1054 Built rules and rebuilt rules are different line

(I have no idea where to look beyond that.)

@markusicu
Copy link
Member

@eggrobin @mihnita On a hunch, Elango and I fixed some charset encoding issues in C++ code, but most of the Windows builds still have rbbi test failures. For example, MSVC x64 Debug (VS2019) / Run x64 Debug Tests.

Linux & macOS pass. Java passes.
Windows MSVC on ARM passes.
Cygwin gcc fails, and the other Windows builds also fail.

Our best guess is still that it somehow has to do with Windows defaulting to the windows-1252 charset rather than UTF-8, and that maybe Windows-on-ARM and Robin's Windows machine are set to use UTF-8 as the default charset. But we apparently failed to stare down the problem in the code diffs. (If the charset is the problem, then somewhere a non-ASCII char * string is incorrectly assumed to be in UTF-8, or a non-ASCII char16_t * string or character is used where char * is needed.)

@rp9-next
@aheninger FYI

@markusicu
Copy link
Member

I do not see that error on CI, but I see this, which I would naïvely think suggests that something went wrong when building:

D:\a\1\s\icu4c\source\test\intltest\rbbiapts.cpp:1054 Built rules and rebuilt rules are different line

If this were indeed something about how the data files are built, then someone who has access to both a macOS or Linux machine, as well as a Windows machine with the windows-1252 default codepage, could check if the binary files (mostly .brk files) are the same.

@eggrobin
Copy link
Member

Windows MSVC on ARM passes.

That one doesn’t have the Run Tests step, which is the one that fails, so I think that is a red herring.

I still have not managed to build this branch on my Windows machines (obviously it used to build before the rebase, with the RBBI tests passing), but I also can’t seem to build main, so I suspect I may have some local configuration that still thinks it is 73; my local issues are probably unrelated to the CI issues.

@eggrobin
Copy link
Member

eggrobin commented Jul 13, 2023

my local issues are probably unrelated to the CI issues.

Figured that out, the tools in icu4c\bin64 do not get replaced by new builds nor removed by cleaning. Manually removing that directory fixed it. I should fix the configuration of the Custom Build Steps that copy from icu4c\source\tools\whatever\x64\Release to icu4c\bin64 one of these days.

Having managed to build this branch… on my machine(s) it works:

[All tests passed successfully...]
Elapsed Time: 00:00:14.878
-
-
-
============================================================
Summary: ICU in "C:\Users\robin\Projects\Unicode\icu\icu4c\source\allinone\"\..\..  arch=x64 type=Release
-
Tests Run    :  icuinfo intltest iotest cintltst
" - All Passed!"
[All tests passed successfully...]
Elapsed Time: 00:00:11.474
-
-
-
============================================================
Summary: ICU in "C:\Users\robin\Projects\Unicode\icu\icu4c\source\allinone\"\..\..  arch=x64 type=Release
-
Tests Run    :  icuinfo intltest iotest cintltst
" - All Passed!"

maybe […] Robin's Windows machine[s] are set to use UTF-8 as the default charset

They are.

as well as a Windows machine with the windows-1252 default codepage

Hm. I can try messing with that setting.

@eggrobin
Copy link
Member

Alright, setting my system to not-UTF-8, I can reproduce the failures:

Errors in total: 124.
            TestRoundtripRules
         RBBIAPITest
            TestUnicodeFiles
            TestMonkey
         RBBITest
            testMonkey
         RBBIMonkeyTest
      rbbi

--------------------------------------
Elapsed Time: 00:00:01.911

eggrobin@9c6cc1c (which you will have to cherry-pick since I cannot push commits to this branch) fixes the tests.

We should file a follow-up ticket to read the rule files as UTF-8; they are already non-ASCII UTF-8 interpreted as whichever system codepage (the comments are not ASCII).
The fix to genbrk would be easy, see below; but something in TestRoundtripRules seems to need to be fixed separately.

--- a/icu4c/source/tools/genbrk/genbrk.cpp
+++ b/icu4c/source/tools/genbrk/genbrk.cpp
@@ -234,8 +234,11 @@ int  main(int argc, char **argv) {
     if (U_FAILURE(status)) {
         exit(status);
     }
-    if(encoding!=nullptr ){
-        ruleSourceC  += signatureLength;
+    if (encoding == nullptr) {
+        // In the absence of a BOM, assume the rule file is in UTF-8.
+        encoding = "UTF-8";
+    } else {
+        ruleSourceC += signatureLength;
         ruleFileSize -= signatureLength;
     }

Having hopefully figured that one out, I will put my system back to UTF-8 before something breaks.

@mihnita
Copy link
Contributor

mihnita commented Jul 13, 2023

I've checked the icu4c/source/tools/genbrk/genbrk.cpp fix.
Confirm that it works, all tests pass, it does not look like TestRoundtripRules needs any fixing.

echeran added a commit to echeran/icu that referenced this pull request Jul 13, 2023
See unicode-org#2492

Co-authored-by: Andy Heninger <andy.heninger@gmail.com>
Co-authored-by: Robin Leroy <egg.robin.leroy@gmail.com>
@jira-pull-request-webhook
Copy link

Notice: the branch changed across the force-push!

  • icu4c/source/tools/genbrk/genbrk.cpp is now changed in the branch

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

@echeran
Copy link
Contributor Author

echeran commented Jul 13, 2023

I applied the patch that @eggrobin included in his previous comment to modify genbrk (the alternative to editing the break rule files by escaping non-ASCII characters).

After the change, TestRoundtripRules is the only remaining test that fails.

@eggrobin
Copy link
Member

eggrobin commented Jul 13, 2023

I think I have a fix (tested by once again setting my machine to codepage 1252 😩), but I still seem to be unable to push commits to this branch, so have another diff:

--- a/icu4c/source/test/intltest/rbbiapts.cpp
+++ b/icu4c/source/test/intltest/rbbiapts.cpp
@@ -1030,7 +1030,7 @@ void RBBIAPITest::RoundtripRule(const char *dataFile) {
     parseError.offset = 0;
     LocalUDataMemoryPointer data(udata_open(U_ICUDATA_BRKITR, "brk", dataFile, &status));
     uint32_t length;
-    const char *builtSource;
+    UnicodeString builtSource;
     const uint8_t *rbbiRules;
     const uint8_t *builtRules;

@@ -1040,12 +1040,13 @@ void RBBIAPITest::RoundtripRule(const char *dataFile) {
     }

     builtRules = (const uint8_t *)udata_getMemory(data.getAlias());
-    builtSource = (const char *)(builtRules + ((RBBIDataHeader*)builtRules)->fRuleSource);
+    builtSource = UnicodeString::fromUTF8(
+        (const char *)(builtRules + ((RBBIDataHeader *)builtRules)->fRuleSource));
     LocalPointer<RuleBasedBreakIterator> brkItr (new RuleBasedBreakIterator(builtSource, parseError, status));
     if (U_FAILURE(status)) {
         errln("%s:%d createRuleBasedBreakIterator: ICU Error \"%s\"  at line %d, column %d\n",
                 __FILE__, __LINE__, u_errorName(status), parseError.line, parseError.offset);
-        errln(UnicodeString(builtSource));
+        errln(builtSource);
         return;
     }
     rbbiRules = brkItr->getBinaryRules(length);

echeran added a commit to echeran/icu that referenced this pull request Jul 13, 2023
See unicode-org#2492

Co-authored-by: Andy Heninger <andy.heninger@gmail.com>
Co-authored-by: Robin Leroy <egg.robin.leroy@gmail.com>
@jira-pull-request-webhook
Copy link

Notice: the branch changed across the force-push!

  • icu4c/source/test/intltest/rbbiapts.cpp is now changed in the branch

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

@eggrobin
Copy link
Member

Note that this PR fixes ICU-22039; I don’t know whether we have a scheme for marking a commit as related to multiple issues.

See unicode-org#2492

Co-authored-by: Andy Heninger <andy.heninger@gmail.com>
Co-authored-by: Robin Leroy <egg.robin.leroy@gmail.com>
@jira-pull-request-webhook
Copy link

Notice: the branch changed across the force-push!

  • icu4c/source/test/intltest/rbbiapts.cpp is different

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

@echeran echeran requested a review from markusicu July 13, 2023 23:08
@markusicu
Copy link
Member

"All checks have passed" -- super!! thanks everyone!!

@markusicu
Copy link
Member

markusicu commented Jul 13, 2023

Note that this PR fixes ICU-22039; I don’t know whether we have a scheme for marking a commit as related to multiple issues.

I just closed that with resolution "Fixed by Other Ticket" and a comment referring back here.

Comment on lines +237 to +239
if (encoding == nullptr) {
// In the absence of a BOM, assume the rule file is in UTF-8.
encoding = "UTF-8";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this is a change in behavior of this tool.
In a follow-up change, we should document (genbrk usage output, maybe User Guide) that genbrk looks for a Unicode signature byte sequence and otherwise assumes UTF-8.

https://unicode.org/faq/utf_bom.html
https://www.unicode.org/glossary/#unicode_signature

Copy link
Member

@markusicu markusicu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm great!!

@echeran echeran merged commit 2e45e6e into unicode-org:main Jul 14, 2023
101 checks passed
catamorphism pushed a commit to catamorphism/icu that referenced this pull request Nov 1, 2023
See unicode-org#2492

Co-authored-by: Andy Heninger <andy.heninger@gmail.com>
Co-authored-by: Robin Leroy <egg.robin.leroy@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants