Linebreak generated before CL #4523

Enter-tainer · 2024-01-13T17:45:54Z

This code with icu=1.4.0.

use icu::segmenter::LineSegmenter;

fn main() {
    let segmenter = LineSegmenter::new_auto();
    let test_str = "念姐遠米巴急（abcd），松黃貫誰。";
    let breakpoints: Vec<usize> = segmenter.segment_str(test_str).collect();
    println!("breakpoints: {:?}", breakpoints);
    // pretty print test str and break points
    for (i, c) in test_str.chars().enumerate() {
        if breakpoints.contains(&i) {
            print!("|");
        }
        print!("{}", c);
    }
    print!("|");
}

produces

breakpoints: [0, 3, 6, 9, 12, 15, 18, 31, 34, 37, 40, 46]
|念姐遠|米巴急|（ab|cd）|，松黃|貫誰。|

, where a breakpoint is produced before ，.

， is the full width comma, U+FF0C. It belongs to CL: Close Punctuation. Per LB13 × CL, we shouldn't produce that breakpoint.

Update: It seems that this bug happens on some string, but not all of them. 念姐遠米巴急（abcd），松黃貫誰。 is a ramdomly generated one.

The text was updated successfully, but these errors were encountered:

Enter-tainer · 2024-01-13T17:46:22Z

Related downstream issue: typst/typst#3082

YDX-2147483647 · 2024-01-14T14:47:21Z

(Copied from typst/typst#3082 (comment))

    // pretty print test str and break points
    for (i, c) in test_str.chars().enumerate() {
        if breakpoints.contains(&i) {

Well, it seems that breakpoints are counted in bytes (usize), but i represents chars(). This explains why there're more breakpoints than |s.

The following version might be better.

use icu_segmenter::LineSegmenter;

fn main() {
    let examples = vec![
        "念姐遠米巴急（abcd），松黃貫誰。",
        "念姐遠米巴急（abc0），松黃貫誰。",
        "念姐遠米巴急（0000），松黃貫誰。",
        "念姐遠米巴急（8888），松黃貫誰。",
    ];

    let segmenter = LineSegmenter::new_auto();

    examples.iter().for_each(|line| {
        let breakpoints: Vec<usize> = segmenter.segment_str(line).collect();
        println!("{}\n{:?}", line, breakpoints);

        for i in 1..breakpoints.len() {
            print!(
                "|{}",
                line.get(breakpoints[i - 1]..breakpoints[i])
                    .expect("Breakpoints should be at characters' boundaries")
            );
        }
        println!("|");
    });
}

念姐遠米巴急（abcd），松黃貫誰。
[0, 3, 6, 9, 12, 15, 18, 31, 34, 37, 40, 46]    
|念|姐|遠|米|巴|急|（abcd），|松|黃|貫|誰。|    
念姐遠米巴急（abc0），松黃貫誰。
[0, 3, 6, 9, 12, 15, 18, 28, 31, 34, 37, 40, 46]
|念|姐|遠|米|巴|急|（abc0）|，|松|黃|貫|誰。|   
念姐遠米巴急（0000），松黃貫誰。
[0, 3, 6, 9, 12, 15, 18, 28, 31, 34, 37, 40, 46]
|念|姐|遠|米|巴|急|（0000）|，|松|黃|貫|誰。|   
念姐遠米巴急（8888），松黃貫誰。
[0, 3, 6, 9, 12, 15, 18, 28, 31, 34, 37, 40, 46]
|念|姐|遠|米|巴|急|（8888）|，|松|黃|貫|誰。|

sffc · 2024-01-15T21:02:43Z

@eggrobin @aethanyc @makotokato

eggrobin · 2024-01-15T21:42:06Z

As noted in #4523 (comment), "念姐遠米巴急（abcd），松黃貫誰。" is segmented just fine, the snippet in the OP is just confused between code point and UTF-8 code unit indices. Indeed it gets broken fine in the screenshot in the downstream issue.

But @YDX-2147483647 does show examples of bad segmentation, such as |念|姐|遠|米|巴|急|（abc0）|，|松|黃|貫|誰。| .

with icu=1.4.0

That release is dated Nov 16, 2023. #4389 was merged on Dec 1, 2023, so we know that line breaking is broken in 1.4.0.

At main the example from #4523 (comment) prints

念姐遠米巴急（abcd），松黃貫誰。
[0, 3, 6, 9, 12, 15, 18, 31, 34, 37, 40, 46]
|念|姐|遠|米|巴|急|（abcd），|松|黃|貫|誰。|
念姐遠米巴急（abc0），松黃貫誰。
[0, 3, 6, 9, 12, 15, 18, 31, 34, 37, 40, 46]
|念|姐|遠|米|巴|急|（abc0），|松|黃|貫|誰。|
念姐遠米巴急（0000），松黃貫誰。
[0, 3, 6, 9, 12, 15, 18, 31, 34, 37, 40, 46]
|念|姐|遠|米|巴|急|（0000），|松|黃|貫|誰。|
念姐遠米巴急（8888），松黃貫誰。
[0, 3, 6, 9, 12, 15, 18, 31, 34, 37, 40, 46]
|念|姐|遠|米|巴|急|（8888），|松|黃|貫|誰。|

so this has been fixed by #4389.

(I suspect what we are seeing, namely a break between ） and ， after digits but not after letters, is likely a consequence of the attempt at implementing the tailoring from https://www.unicode.org/reports/tr14/tr14-49.html#Examples Example 7 prior to #4389.)

Enter-tainer · 2024-01-16T02:02:53Z

thank you! so i think this issue can be closed once a new version is released? (or it can be closed because it is already fixed in master)

sffc · 2024-01-16T18:46:42Z

I'll close this as fixed in 1.5. Thank you @eggrobin!

If you need the functionality sooner, you can use ICU4X from Git in your Cargo.toml.

sffc added the C-segmentation Component: Segmentation label Jan 15, 2024

YDX-2147483647 mentioned this issue Jan 16, 2024

Chinese punctuation is placed at the beginning of the line in some cases typst/typst#3082

Closed

1 task

sffc added this to the 1.5 Blocking ⟨P1⟩ milestone Jan 16, 2024

sffc closed this as completed Jan 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Linebreak generated before CL #4523

Linebreak generated before CL #4523

Enter-tainer commented Jan 13, 2024 •

edited

Loading

Enter-tainer commented Jan 13, 2024

YDX-2147483647 commented Jan 14, 2024

sffc commented Jan 15, 2024

eggrobin commented Jan 15, 2024 •

edited

Loading

Enter-tainer commented Jan 16, 2024

sffc commented Jan 16, 2024

Linebreak generated before CL #4523

Linebreak generated before CL #4523

Comments

Enter-tainer commented Jan 13, 2024 • edited Loading

Enter-tainer commented Jan 13, 2024

YDX-2147483647 commented Jan 14, 2024

sffc commented Jan 15, 2024

eggrobin commented Jan 15, 2024 • edited Loading

Enter-tainer commented Jan 16, 2024

sffc commented Jan 16, 2024

Enter-tainer commented Jan 13, 2024 •

edited

Loading

eggrobin commented Jan 15, 2024 •

edited

Loading