Fix Unicode 15.0 line breaking #4389

eggrobin · 2023-11-30T15:38:00Z

The current implementation was attempting the LB25 tailoring recommended in Example 7 of Section 8.2 in UAX14 version 15.0; however, this requires more than one code point of lookahead* because of (PR | PO) × ( OP | HY )? NU, which the current implementation of the line segmenter cannot do. Instead this pull request goes back to the untailored LB25 from Unicode 15.0.

The implementation was tested with two million test cases; I last encountered a failure somewhere in the nine thousands. I should probably do an overnight run. Only 200 test cases are included here; as usual, anyone working on the rules should try very long monkey test runs.

This fixes #4146.

—
* This will be needed for 15.1 line segmentation too. While we have that capability in the other segmenters, used in the sentence segmenter (the relevant rules are called intermediate match rules or interm(ediate) break states in this implementation), straightforwardly reusing that code would run into into issues as we have so many states in line breaking that we cannot dedicate a whole bit to that property of the state. This can probably be worked around (as far as I can tell we use the sign bit for a property of two special states, so we could probably be a bit more sparing), but will come later.

…ookahead

eggrobin · 2023-11-30T15:59:51Z

The tests are failing because LineBreakTest.txt tests for LB25 as tailored by Example 7, not for the untailored version.
I know how that particular sausage is made, so I’ll head over to unicode-org/unicodetools to generate a LineBreakTest.txt for the untailored line breaking algorithm…

robertbastian · 2023-12-01T13:19:33Z

components/segmenter/tests/testdata/LineBreakExtraTest.txt

+# https://github.com/eggrobin/icu/tree/export-monkeys-15.0-untailor-lb
+# (specifically, at ea318304775e0a5194785f51120672aafac7b2bd)


Please get this change checked in. I'm not able to review the raw rule changes, so the test passing is all I can go by to confirm that they are correct. We also want to be able to regenerate this in the future.

We cannot check this in, because it is in the past (forked from Unicode 15.0), and because removing the tailoring is not something we want to do in ICU.

Once we get to parity with ICU, this will be generated from the real ICU at its release tag; but I do not want to go from where we are to 15.1 with all ICU tailorings in one giant step.

Of course the change that makes it possible to export the monkey tests will be checked in, as part of unicode-org/icu#2637 (whose reviewers I really need to poke); but that will be on top of 15.1.

The tailorings could be behind a flag, and it could be on the 73 maintenance branch?

Sadly that is not really feasible, as discussed.

robertbastian · 2023-12-01T13:21:29Z

components/segmenter/tests/testdata/LineBreakTest.txt

-# LineBreakTest-15.0.0.txt
-# Date: 2022-02-26, 00:38:39 GMT
-# © 2022 Unicode®, Inc.
+# THIS IS NOT LineBreakTest-15.0.0.txt


Are these manual changes to the test file? Please document its provenance.

Documented (though you won’t like it :-p).

robertbastian

🏳️‍🌈🏳️‍🌈🏳️‍🌈

…g and unused (#4400) Something should still be done about #1637 eventually, but since the higher states have been renumbered by #4389, let’s not leave them around for someone to trip over. Since we have not yet migrated those properties to 15.1, the states that correspond to LB property values have not changed.

robertbastian · 2023-12-12T10:50:46Z

this is changelog worthy

#4389

eggrobin added 25 commits November 21, 2023 14:06

traces

88c80b4

The first monkey passes

cb3307f

More traces

cf91a15

trace the right thing

48eb682

Maybe we don’t need that LB25_HY state?

ec17694

What a complete mess

6592f33

This is going to be tedious, isn’t it

be70044

LB8a

6b6e173

Keep hammering at the ZWJ CM case

7bfd662

Surprisingly this moves me slightly further into the pile of tests

6f3cd7c

Onto the next test.

93f7186

Back to completely untailored LB25, the recommended tailoring needs l…

c1a12d8

…ookahead

Any and then some

147bf00

handle ZWJ for OP in extended context

1982d90

RI_RI_ZWJ

a9b2e85

HL_ZWJ in extended context in LB21

3fb0288

HL HY CM, more left Any for failure on test 387

ca04538

ID_CN_ZWJ for test case 2077 😭

6bcb745

Handle ZWJ ZWJ, pushing the failure to test case 3556

ff737aa

Push the failure to 5441

2056364

Now fails on 6437

055f1eb

9227...

a7e012c

twenty kilotests passing.

23df846

Remove traces

dd7471c

Check in a few tests

6616daf

eggrobin changed the title ~~Fix 15.0 line breaking~~ Fix Unicode 15.0 line breaking Nov 30, 2023

eggrobin added 3 commits November 30, 2023 17:47

Untailor LineBreakTest.txt

31e23d4

An attempt at reducing spurious changes in that untailored LineBreakTest

d985de2

Try to remove even more spurious diffs

184540a

cargo make testdata

442e00c

eggrobin marked this pull request as ready for review December 1, 2023 12:45

eggrobin requested review from sffc, robertbastian, Manishearth, aethanyc, makotokato and a team as code owners December 1, 2023 12:45

robertbastian reviewed Dec 1, 2023

View reviewed changes

eggrobin added 2 commits December 1, 2023 14:34

Document how the sausage was made

ea5ddc7

doc test for unicode-org#4146

8a1cb70

eggrobin requested a review from robertbastian December 1, 2023 14:34

robertbastian approved these changes Dec 1, 2023

View reviewed changes

eggrobin merged commit e080ecd into unicode-org:main Dec 1, 2023
29 checks passed

eggrobin mentioned this pull request Dec 1, 2023

Remove the hardcoded line breaking state constants that are both wrong and unused #4400

Merged

eggrobin mentioned this pull request Dec 4, 2023

Using an enum for rule break state #4401

Merged

sffc added a commit that referenced this pull request Dec 27, 2023

Update CHANGELOG.md

3a20e22

#4389

sffc mentioned this pull request Dec 27, 2023

Update CHANGELOG.md #4499

Merged

eggrobin mentioned this pull request Jan 15, 2024

Linebreak generated before CL #4523

Closed

YDX-2147483647 mentioned this pull request Jan 16, 2024

Chinese punctuation is placed at the beginning of the line in some cases typst/typst#3082

Closed

1 task

eggrobin mentioned this pull request Jan 24, 2024

Don't break word by MinNumLet with Extend. #4550

Merged

eggrobin mentioned this pull request Jun 6, 2024

Word segmentation is incorrect #5015

Open

makotokato mentioned this pull request Jul 10, 2024

Support Unicode 15.1 for line segmenter #5218

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Unicode 15.0 line breaking #4389

Fix Unicode 15.0 line breaking #4389

eggrobin commented Nov 30, 2023 •

edited

Loading

eggrobin commented Nov 30, 2023

robertbastian Dec 1, 2023

eggrobin Dec 1, 2023

eggrobin Dec 1, 2023

robertbastian Dec 1, 2023

eggrobin Dec 1, 2023

robertbastian Dec 1, 2023

eggrobin Dec 1, 2023

robertbastian left a comment

robertbastian commented Dec 12, 2023

		# https://github.com/eggrobin/icu/tree/export-monkeys-15.0-untailor-lb
		# (specifically, at ea318304775e0a5194785f51120672aafac7b2bd)

Fix Unicode 15.0 line breaking #4389

Fix Unicode 15.0 line breaking #4389

Conversation

eggrobin commented Nov 30, 2023 • edited Loading

eggrobin commented Nov 30, 2023

robertbastian Dec 1, 2023

Choose a reason for hiding this comment

eggrobin Dec 1, 2023

Choose a reason for hiding this comment

eggrobin Dec 1, 2023

Choose a reason for hiding this comment

robertbastian Dec 1, 2023

Choose a reason for hiding this comment

eggrobin Dec 1, 2023

Choose a reason for hiding this comment

robertbastian Dec 1, 2023

Choose a reason for hiding this comment

eggrobin Dec 1, 2023

Choose a reason for hiding this comment

robertbastian left a comment

Choose a reason for hiding this comment

robertbastian commented Dec 12, 2023

eggrobin commented Nov 30, 2023 •

edited

Loading