ICU-22941 Revert ICU-22112, untailoring root word break #3249

eggrobin · 2024-10-21T13:43:30Z

This brings the colon back into MidLetter (with no tailoring on top of the UCD), instead of its inclusion in MidLetter being an fi & sv tailoring.

Checklist

Required: Issue filed: https://unicode-org.atlassian.net/browse/ICU-22941
Required: The PR title must be prefixed with a JIRA Issue number. Example: "ICU-1234 Fix xyz"
Required: The PR description must include the link to the Jira Issue, for example by completing the URL in the first checklist item
Required: Each commit message must be prefixed with a JIRA Issue number. Example: "ICU-1234 Fix xyz"
Issue accepted (done by Technical Committee after discussion)
Tests included, if applicable
API docs and/or User Guide docs changed or added, if applicable

eggrobin · 2024-10-21T14:02:40Z

@markusicu, I am running into the same problem as in #3028 (comment): the documentation only tells me how to regenerate the old icudata.jar, not the ICU4J .brk files.

Can you perform the same thaumaturgy you did in 169023a? (At some point it would be good to update the documentation, too…)

eggrobin · 2024-10-25T17:42:24Z

@markusicu post-UTW poke

eggrobin · 2024-11-04T21:45:33Z

@markusicu in fcd04fc I tried copying over the .brk files that get generated when I rebuild on my machine; these don’t seem to work:

java.lang.AssertionError
	at com.ibm.icu.impl.ICUBinary.readHeader(ICUBinary.java:574)
	at com.ibm.icu.impl.RBBIDataWrapper.get(RBBIDataWrapper.java:295)

But genbrk.cpp does not seem to have any options for output format, so I don’t understand how I can be generating brk files that work for ICU4C but not for ICU4J.

markusicu · 2024-11-05T03:20:10Z

I just fetched your branch and ran the rbbi tests in Eclipse. 66 tests, 2 failed. So 64 tests worked :-)

The code fails to load "brkitr/word_fi_sv.brk". ICUBinary.getData() tries to find it two ways but ends up returning null because it's not there, and it doesn't throw an exception because the caller didn't ask for it. This is a bug --> ICU-22960

I see that you updated that .brk file, but I don't see why ICU can't find it :-(

markusicu · 2024-11-05T03:33:25Z

Oh, wait, you are deleting that file...

markusicu · 2024-11-05T03:47:30Z

I refreshed all of the ICU4J data on my Linux box. It still fails for me in Eclipse because it still tries to load the word_fi_sv file. I don't see where it still has that registered. Pushing my files to your branch in the hope that my Eclipse is just wedged...

markusicu · 2024-11-05T03:50:13Z

If this works, then I suspect that updating the res_index.res file did the trick.
The failing BreakIteratorTest.TestT5615() loops over all locales; if there is anything anywhere leftover that refers to the old file then we have a problem.

markusicu · 2024-11-05T04:01:56Z

I got it to work locally. You deleted the ICU4C brkitr/fi.txt & sv.txt files, but the repo still had the ICU4J .res versions. So when asked for Finnish word breaks, it found & loaded fi.res which referred to the deleted word_fi_sv.res file.

Hopefully this is it.

markusicu · 2024-11-05T05:53:44Z

Bingo! 🎉
Please squash all but the first commit before you get ready to merge.

jira-pull-request-webhook · 2024-11-05T13:57:22Z

Notice: the branch changed across the force-push!

icu4j/main/core/src/main/resources/com/ibm/icu/impl/data/icudata/brkitr/res_index.res is no longer changed in the branch
icu4j/main/core/src/main/resources/com/ibm/icu/impl/data/icudata/brkitr/word_POSIX.brk is different
icu4j/main/core/src/main/resources/com/ibm/icu/impl/data/icudata/brkitr/word.brk is different

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

eggrobin · 2024-11-05T15:25:32Z

Thanks a lot. I squashed it all, dropping ca87360 (but keeping 081efef) to see if the brk files I generated were fine after all—it looks like it works, so I shouldn’t need to ask you to turn the crank next time.

eggrobin · 2024-11-05T17:18:36Z

@markusicu As discussed over virtual tea, you might still want to flip the bytes.

markusicu · 2024-11-05T20:06:22Z

I regenerated the data and pushed the updated files. When the tests pass, please squash again.

Explanation for others: We want to keep the Java data in big-endian format, so that different people generating the data don't flip-flop on no-op data changes, and wonder why what they are doing affects BreakIterator data files.

…lorings for fi,sv" This reverts commit 49d192f.

jira-pull-request-webhook · 2024-11-05T21:15:40Z

Hooray! The files in the branch are the same across the force-push. 😃

~ Your Friendly Jira-GitHub PR Checker Bot

eggrobin · 2024-11-05T21:15:54Z

please squash again

Squished.

eggrobin requested a review from markusicu October 21, 2024 14:02

eggrobin assigned markusicu Oct 21, 2024

eggrobin force-pushed the intestine branch from 081efef to d0813c0 Compare November 5, 2024 13:57

ICU-22941 Revert "ICU-22112 word break updates for @,colon; colon tai…

646c5c8

…lorings for fi,sv" This reverts commit 49d192f.

eggrobin force-pushed the intestine branch from 2a47c2c to 646c5c8 Compare November 5, 2024 21:15

markusicu approved these changes Nov 5, 2024

View reviewed changes

eggrobin merged commit 8d86ca1 into unicode-org:main Nov 5, 2024
101 checks passed

eggrobin mentioned this pull request Nov 19, 2024

ICU-22127 Remove obsolete WordBreakTest.txt known issues #3271

Merged

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ICU-22941 Revert ICU-22112, untailoring root word break #3249

ICU-22941 Revert ICU-22112, untailoring root word break #3249

eggrobin commented Oct 21, 2024

eggrobin commented Oct 21, 2024

eggrobin commented Oct 25, 2024

eggrobin commented Nov 4, 2024

markusicu commented Nov 5, 2024

markusicu commented Nov 5, 2024

markusicu commented Nov 5, 2024

markusicu commented Nov 5, 2024

markusicu commented Nov 5, 2024

markusicu commented Nov 5, 2024

jira-pull-request-webhook bot commented Nov 5, 2024

eggrobin commented Nov 5, 2024 •

edited

Loading

eggrobin commented Nov 5, 2024

markusicu commented Nov 5, 2024

jira-pull-request-webhook bot commented Nov 5, 2024

eggrobin commented Nov 5, 2024

ICU-22941 Revert ICU-22112, untailoring root word break #3249

ICU-22941 Revert ICU-22112, untailoring root word break #3249

Conversation

eggrobin commented Oct 21, 2024

Checklist

eggrobin commented Oct 21, 2024

eggrobin commented Oct 25, 2024

eggrobin commented Nov 4, 2024

markusicu commented Nov 5, 2024

markusicu commented Nov 5, 2024

markusicu commented Nov 5, 2024

markusicu commented Nov 5, 2024

markusicu commented Nov 5, 2024

markusicu commented Nov 5, 2024

jira-pull-request-webhook bot commented Nov 5, 2024

eggrobin commented Nov 5, 2024 • edited Loading

eggrobin commented Nov 5, 2024

markusicu commented Nov 5, 2024

jira-pull-request-webhook bot commented Nov 5, 2024

eggrobin commented Nov 5, 2024

eggrobin commented Nov 5, 2024 •

edited

Loading