Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

icu_segmenter::LineSegmenter incorrectly applies rule LB8a #4146

Closed
tingerrr opened this issue Oct 12, 2023 · 2 comments · Fixed by #4389
Closed

icu_segmenter::LineSegmenter incorrectly applies rule LB8a #4146

tingerrr opened this issue Oct 12, 2023 · 2 comments · Fixed by #4389
Assignees
Labels
C-segmentation Component: Segmentation T-bug Type: Bad behavior, security, privacy

Comments

@tingerrr
Copy link

tingerrr commented Oct 12, 2023

The following code using icu_segmenter = "1.3.2" returns 3 breaks:

  • LineBreak::Unknown at 0
  • LineBreak::ZWJ at 10
  • LineBreak::Ideographic at 14
use icu_segmenter::LineSegmenter;

fn main() {
    assert_eq!(
        vec![0, 10, 14],
        LineSegmenter::new_auto()
            .segment_str("🏳️‍🌈")
            .collect::<Vec<_>>()
    );
}

According to the documentation, the segmenter should only return LB3 and LB7, not LB8a (LineBreak::ZWJ).

@sffc
Copy link
Member

sffc commented Oct 12, 2023

@sffc sffc added the C-segmentation Component: Segmentation label Oct 12, 2023
@eggrobin
Copy link
Member

I see what’s going on, the way the state machine works effectively means that we apply LB9 before LB8a.

@sffc sffc added this to the 1.4 Blocking ⟨P1⟩ milestone Oct 19, 2023
@eggrobin eggrobin added the T-bug Type: Bad behavior, security, privacy label Oct 19, 2023
@eggrobin eggrobin changed the title icu_segmenter::LineSegmenter returns breaks other than LB3 and LB7 icu_segmenter::LineSegmenter incorrectly applies rule LB8a Nov 30, 2023
eggrobin added a commit to eggrobin/icu4x that referenced this issue Dec 1, 2023
eggrobin added a commit that referenced this issue Dec 1, 2023
The current implementation was attempting the LB25 tailoring recommended
in Example 7 of [Section
8.2](https://www.unicode.org/reports/tr14/tr14-49.html#Examples) in
UAX14 version 15.0; however, this requires more than one code point of
lookahead* because of `(PR | PO) × ( OP | HY )? NU`, which the current
implementation of the line segmenter cannot do. Instead this pull
request goes back to the untailored LB25 from Unicode 15.0.

The implementation was tested with two million test cases; I last
encountered a failure somewhere in the nine thousands. I should probably
do an overnight run. Only 200 test cases are included here; as usual,
anyone working on the rules should try very long monkey test runs.

This fixes #4146.

—
\* This will be needed for 15.1 line segmentation too. While we have
that capability in the other segmenters, used in the sentence segmenter
(the relevant rules are called intermediate match rules or
interm(ediate) break states in this implementation), straightforwardly
reusing that code would run into into issues as we have so many states
in line breaking that we cannot dedicate a whole bit to that property of
the state. This can probably be worked around (as far as I can tell we
use the sign bit for a property of two special states, so we could
probably be a bit more sparing), but will come later.
mrobinson added a commit to mrobinson/servo that referenced this issue May 23, 2024
Emoji clusters, such ('🏳️‍🌈') do not render properly in Servo.
This is because xi-unicode is inserting a linebreak opportunity between
components of the cluster (see xi-editor/xi-editor#1322). This change
adds a workaround for this issue.

`xi-unicode` is fast, but supports an older version of the Unicode
standard than libraries like `icu4x`. In addition, `icu4x` does not
supoprt non-contiguous segmentation which Servo currently depends on.
Finally, the currently released version of `icu4x` has the same issue
(unicode-org/icu4x#4146).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-segmentation Component: Segmentation T-bug Type: Bad behavior, security, privacy
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants