Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FYI: Proposals for changes to rules Unicode Standard Annexes 14 and 29 #3255

Closed
eggrobin opened this issue Apr 4, 2023 · 8 comments
Closed
Assignees
Labels
blocked A dependency must be resolved before this is actionable C-segmentation Component: Segmentation S-medium Size: Less than a week (larger bug fix or enhancement) T-core Type: Required functionality

Comments

@eggrobin
Copy link
Member

eggrobin commented Apr 4, 2023

The Properties and Algorithms Group plans to recommend the following proposals to Unicode Technical Committee #‌175 later this month. If they are accepted, the changes would be published as part of Unicode Version 15.1, in September.

UAX #‌14:

  • L2/23-063, Line breaking around quotation marks.
  • L2/23-072, Proposed changes for line breaking on orthographic syllables.
    • Note that this involves new property values for the Line_Break property.

UAX #‌29:

  • (No proposal paper, this will be part of L2/23-079.) Upstream the CLDR root tailoring for grapheme clusters, that is, add a new rule GB9c LinkingConsonant ExtCccZwj* Virama ExtCccZwj* × LinkingConsonant, where:
    • Virama=[\p{Gujr}\p{sc=Telu}\p{sc=Mlym}\p{sc=Orya}\p{sc=Beng}\p{sc=Deva}&\p{Indic_Syllabic_Category=Virama}]
    • LinkingConsonant=[\p{Gujr}\p{sc=Telu}\p{sc=Mlym}\p{sc=Orya}\p{sc=Beng}\p{sc=Deva}&\p{Indic_Syllabic_Category=Consonant}]
    • ExtCccZwj=[\p{gcb=Extend}-\p{ccc=0}] \p{gcb=ZWJ}]
@eggrobin eggrobin added the C-segmentation Component: Segmentation label Apr 4, 2023
@sffc
Copy link
Member

sffc commented Apr 4, 2023

@makotokato @aethanyc

@sffc sffc added S-medium Size: Less than a week (larger bug fix or enhancement) T-core Type: Required functionality labels Apr 4, 2023
@sffc
Copy link
Member

sffc commented Apr 20, 2023

@aethanyc or @makotokato can you take this issue? Probably for 1.x Priority.

@sffc
Copy link
Member

sffc commented May 11, 2023

Discussion: Longer term, we would like it if the upstreamed TOML files would be updated along with the specification, so that ICU4X does not need to do anything more than pulling in updates from upstream.

@sffc sffc added this to the 1.x Priority ⟨P2⟩ milestone May 11, 2023
@sffc sffc added the blocked A dependency must be resolved before this is actionable label May 11, 2023
@eggrobin
Copy link
Member Author

eggrobin commented May 17, 2023

Looking at the toml files, my impression is that they define a state machine transitioned by code point (that is, a [[tables]] record defines a transition from its left state to its name state when the next code point has the class right), and that the breaks at each step are determined by the [[rules]] with a matching left state, and looking ahead one code point matching the class right.

The following new line breaking rules require more lookahead than that:

  • × [\p{Pf}&QU] ( SP | GL | WJ | CL | QU | CP | EX | IS | SY | BK | CR | LF | NL | eot)
  • (AK | ◌ | AS) × (AK | ◌ | AS) VF

These require looking at two code points to the right of the (non-)break, plus any intervening CM (since these are after LB9).

@hsivonen
Copy link
Member

Gecko bug

@eggrobin
Copy link
Member Author

Henri, this is interesting.

In your comment you correctly identified what LB15a and LB15b are trying to do, and why they need to do that (instead of treating Pi as LB=OP and Pf as LB=CL: that would mess with German, Finnish, etc. usage of Pf initially or Pi finally).

However, these new rules do not help with the Chinese issue at hand, since there are no spaces (there may visually appear to be space, but that is because U+2018 etc. have ambiguous width; here they are wide). This has recently come to the attention of the Properties and Algorithms Group of the UTC; it may be possible to do something about it in the ID QU ID case.
I will mention that issue in that discussion. Nothing will happen on that front before Unicode 16.0 in September 2024 though.

@aethanyc
Copy link
Contributor

We still need to update line segmenter to Unicode 15.1. @makotokato is working on it.

@aethanyc aethanyc assigned makotokato and unassigned aethanyc May 17, 2024
@eggrobin
Copy link
Member Author

eggrobin commented Jun 4, 2024

I am experimenting with moving LB8a and LB9 into the code of the line segmenter, as

  1. the combination of these rules makes the state table extraordinarily painful to maintain (and it makes it large), as every state needs to be replicated: X ZWJ is different from X for most X since there is no break after ZWJ per LB8a, but X ZWJ CM brings you back to the X state, so the X ZWJ states cannot be merged;
  2. these rules cannot be tailored (so there is no reason to allow for custom data to change their behaviour), and are in practice reasonably stable: they last changed in Unicode 11 (2018), following up on some earlier Unicode 9 (2016) changes for emoji ZWJ sequences; contrast the other rules that have been changing wildly every year.

eggrobin added a commit that referenced this issue Jun 6, 2024
Hopefully no functional change.

Last time I attempted to look at Unicode 15.1 line breaking, that was
made impractical by the need, for every new state X, to add an X_ZWJ
state, transitions X CM → X, X ZWJ → X_ZWJ, X_ZWJ CM → X, as well as
X_ZWJ Y → Z for every transition X Y → Z, and to add or update rules to
prevent breaks after X_ZWJ.

Hopefully this will make that upgrade a little more tractable.
(Incidentally it makes the state table a bit smaller.)

Tested with 200 000 monkeys (recall that only 200 are checked in).

Related to #3255; see my comment there for the rationale.

Aside: While looking at this, it came to my attention that the
`LineBreakStrictness::Anywhere` option does not do what the standard
says, cf.
https://drafts.csswg.org/css-text-3/#valdef-line-break-anywhere and
https://drafts.csswg.org/css-text-3/#typographic-character-unit
referenced therein. Of course, we _do_ have a correct implementation of
`line-break: anywhere`, since we have a grapheme cluster segmenter.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
blocked A dependency must be resolved before this is actionable C-segmentation Component: Segmentation S-medium Size: Less than a week (larger bug fix or enhancement) T-core Type: Required functionality
Projects
None yet
Development

No branches or pull requests

5 participants