Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Unicode 15.0 line breaking #4389

Merged
merged 31 commits into from
Dec 1, 2023
Merged

Conversation

eggrobin
Copy link
Member

@eggrobin eggrobin commented Nov 30, 2023

The current implementation was attempting the LB25 tailoring recommended in Example 7 of Section 8.2 in UAX14 version 15.0; however, this requires more than one code point of lookahead* because of (PR | PO) × ( OP | HY )? NU, which the current implementation of the line segmenter cannot do. Instead this pull request goes back to the untailored LB25 from Unicode 15.0.

The implementation was tested with two million test cases; I last encountered a failure somewhere in the nine thousands. I should probably do an overnight run. Only 200 test cases are included here; as usual, anyone working on the rules should try very long monkey test runs.

This fixes #4146.


* This will be needed for 15.1 line segmentation too. While we have that capability in the other segmenters, used in the sentence segmenter (the relevant rules are called intermediate match rules or interm(ediate) break states in this implementation), straightforwardly reusing that code would run into into issues as we have so many states in line breaking that we cannot dedicate a whole bit to that property of the state. This can probably be worked around (as far as I can tell we use the sign bit for a property of two special states, so we could probably be a bit more sparing), but will come later.

@eggrobin eggrobin changed the title Fix 15.0 line breaking Fix Unicode 15.0 line breaking Nov 30, 2023
@eggrobin
Copy link
Member Author

The tests are failing because LineBreakTest.txt tests for LB25 as tailored by Example 7, not for the untailored version.
I know how that particular sausage is made, so I’ll head over to unicode-org/unicodetools to generate a LineBreakTest.txt for the untailored line breaking algorithm…

@eggrobin eggrobin marked this pull request as ready for review December 1, 2023 12:45
Comment on lines +16 to +17
# https://github.com/eggrobin/icu/tree/export-monkeys-15.0-untailor-lb
# (specifically, at ea318304775e0a5194785f51120672aafac7b2bd)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please get this change checked in. I'm not able to review the raw rule changes, so the test passing is all I can go by to confirm that they are correct. We also want to be able to regenerate this in the future.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We cannot check this in, because it is in the past (forked from Unicode 15.0), and because removing the tailoring is not something we want to do in ICU.

Once we get to parity with ICU, this will be generated from the real ICU at its release tag; but I do not want to go from where we are to 15.1 with all ICU tailorings in one giant step.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course the change that makes it possible to export the monkey tests will be checked in, as part of unicode-org/icu#2637 (whose reviewers I really need to poke); but that will be on top of 15.1.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tailorings could be behind a flag, and it could be on the 73 maintenance branch?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sadly that is not really feasible, as discussed.

# LineBreakTest-15.0.0.txt
# Date: 2022-02-26, 00:38:39 GMT
# © 2022 Unicode®, Inc.
# THIS IS NOT LineBreakTest-15.0.0.txt
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these manual changes to the test file? Please document its provenance.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Documented (though you won’t like it :-p).

Copy link
Member

@robertbastian robertbastian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🏳️‍🌈🏳️‍🌈🏳️‍🌈

@eggrobin eggrobin merged commit e080ecd into unicode-org:main Dec 1, 2023
29 checks passed
eggrobin added a commit that referenced this pull request Dec 1, 2023
…g and unused (#4400)

Something should still be done about #1637 eventually, but since the
higher states have been renumbered by #4389, let’s not leave them around
for someone to trip over. Since we have not yet migrated those
properties to 15.1, the states that correspond to LB property values
have not changed.
@robertbastian
Copy link
Member

this is changelog worthy

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

icu_segmenter::LineSegmenter incorrectly applies rule LB8a
2 participants