Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ICU-22941 Revert ICU-22112, untailoring root word break #3249

Merged
merged 1 commit into from
Nov 5, 2024

Conversation

eggrobin
Copy link
Member

This brings the colon back into MidLetter (with no tailoring on top of the UCD), instead of its inclusion in MidLetter being an fi & sv tailoring.

Checklist

  • Required: Issue filed: https://unicode-org.atlassian.net/browse/ICU-22941
  • Required: The PR title must be prefixed with a JIRA Issue number. Example: "ICU-1234 Fix xyz"
  • Required: The PR description must include the link to the Jira Issue, for example by completing the URL in the first checklist item
  • Required: Each commit message must be prefixed with a JIRA Issue number. Example: "ICU-1234 Fix xyz"
  • Issue accepted (done by Technical Committee after discussion)
  • Tests included, if applicable
  • API docs and/or User Guide docs changed or added, if applicable

@eggrobin
Copy link
Member Author

@markusicu, I am running into the same problem as in #3028 (comment): the documentation only tells me how to regenerate the old icudata.jar, not the ICU4J .brk files.

Can you perform the same thaumaturgy you did in 169023a? (At some point it would be good to update the documentation, too…)

@eggrobin
Copy link
Member Author

@markusicu post-UTW poke

@eggrobin
Copy link
Member Author

eggrobin commented Nov 4, 2024

@markusicu in fcd04fc I tried copying over the .brk files that get generated when I rebuild on my machine; these don’t seem to work:

java.lang.AssertionError
	at com.ibm.icu.impl.ICUBinary.readHeader(ICUBinary.java:574)
	at com.ibm.icu.impl.RBBIDataWrapper.get(RBBIDataWrapper.java:295)

But genbrk.cpp does not seem to have any options for output format, so I don’t understand how I can be generating brk files that work for ICU4C but not for ICU4J.

@markusicu
Copy link
Member

I just fetched your branch and ran the rbbi tests in Eclipse. 66 tests, 2 failed. So 64 tests worked :-)

The code fails to load "brkitr/word_fi_sv.brk". ICUBinary.getData() tries to find it two ways but ends up returning null because it's not there, and it doesn't throw an exception because the caller didn't ask for it. This is a bug --> ICU-22960

I see that you updated that .brk file, but I don't see why ICU can't find it :-(

@markusicu
Copy link
Member

Oh, wait, you are deleting that file...

@markusicu
Copy link
Member

I refreshed all of the ICU4J data on my Linux box. It still fails for me in Eclipse because it still tries to load the word_fi_sv file. I don't see where it still has that registered. Pushing my files to your branch in the hope that my Eclipse is just wedged...

@markusicu
Copy link
Member

If this works, then I suspect that updating the res_index.res file did the trick.
The failing BreakIteratorTest.TestT5615() loops over all locales; if there is anything anywhere leftover that refers to the old file then we have a problem.

@markusicu
Copy link
Member

I got it to work locally. You deleted the ICU4C brkitr/fi.txt & sv.txt files, but the repo still had the ICU4J .res versions. So when asked for Finnish word breaks, it found & loaded fi.res which referred to the deleted word_fi_sv.res file.

Hopefully this is it.

@markusicu
Copy link
Member

Bingo! 🎉
Please squash all but the first commit before you get ready to merge.

@jira-pull-request-webhook
Copy link

Notice: the branch changed across the force-push!

  • icu4j/main/core/src/main/resources/com/ibm/icu/impl/data/icudata/brkitr/res_index.res is no longer changed in the branch
  • icu4j/main/core/src/main/resources/com/ibm/icu/impl/data/icudata/brkitr/word_POSIX.brk is different
  • icu4j/main/core/src/main/resources/com/ibm/icu/impl/data/icudata/brkitr/word.brk is different

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

@eggrobin
Copy link
Member Author

eggrobin commented Nov 5, 2024

Thanks a lot. I squashed it all, dropping ca87360 (but keeping 081efef) to see if the brk files I generated were fine after all—it looks like it works, so I shouldn’t need to ask you to turn the crank next time.

@eggrobin
Copy link
Member Author

eggrobin commented Nov 5, 2024

@markusicu As discussed over virtual tea, you might still want to flip the bytes.

@markusicu
Copy link
Member

I regenerated the data and pushed the updated files. When the tests pass, please squash again.

Explanation for others: We want to keep the Java data in big-endian format, so that different people generating the data don't flip-flop on no-op data changes, and wonder why what they are doing affects BreakIterator data files.

@jira-pull-request-webhook
Copy link

Hooray! The files in the branch are the same across the force-push. 😃

~ Your Friendly Jira-GitHub PR Checker Bot

@eggrobin
Copy link
Member Author

eggrobin commented Nov 5, 2024

please squash again

Squished.

@eggrobin eggrobin merged commit 8d86ca1 into unicode-org:main Nov 5, 2024
101 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants