Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Linebreak generated before CL #4523

Closed
Enter-tainer opened this issue Jan 13, 2024 · 6 comments
Closed

Linebreak generated before CL #4523

Enter-tainer opened this issue Jan 13, 2024 · 6 comments
Labels
C-segmentation Component: Segmentation

Comments

@Enter-tainer
Copy link

Enter-tainer commented Jan 13, 2024

This code with icu=1.4.0.

use icu::segmenter::LineSegmenter;

fn main() {
    let segmenter = LineSegmenter::new_auto();
    let test_str = "念姐遠米巴急(abcd),松黃貫誰。";
    let breakpoints: Vec<usize> = segmenter.segment_str(test_str).collect();
    println!("breakpoints: {:?}", breakpoints);
    // pretty print test str and break points
    for (i, c) in test_str.chars().enumerate() {
        if breakpoints.contains(&i) {
            print!("|");
        }
        print!("{}", c);
    }
    print!("|");
}

produces

breakpoints: [0, 3, 6, 9, 12, 15, 18, 31, 34, 37, 40, 46]
|念姐遠|米巴急|(ab|cd)|,松黃|貫誰。|

, where a breakpoint is produced before .

is the full width comma, U+FF0C. It belongs to CL: Close Punctuation. Per LB13 × CL, we shouldn't produce that breakpoint.


Update: It seems that this bug happens on some string, but not all of them. 念姐遠米巴急(abcd),松黃貫誰。 is a ramdomly generated one.

@Enter-tainer
Copy link
Author

Related downstream issue: typst/typst#3082

@YDX-2147483647
Copy link

(Copied from typst/typst#3082 (comment))

    // pretty print test str and break points
    for (i, c) in test_str.chars().enumerate() {
        if breakpoints.contains(&i) {

Well, it seems that breakpoints are counted in bytes (usize), but i represents chars(). This explains why there're more breakpoints than |s.

The following version might be better.

use icu_segmenter::LineSegmenter;

fn main() {
    let examples = vec![
        "念姐遠米巴急(abcd),松黃貫誰。",
        "念姐遠米巴急(abc0),松黃貫誰。",
        "念姐遠米巴急(0000),松黃貫誰。",
        "念姐遠米巴急(8888),松黃貫誰。",
    ];

    let segmenter = LineSegmenter::new_auto();

    examples.iter().for_each(|line| {
        let breakpoints: Vec<usize> = segmenter.segment_str(line).collect();
        println!("{}\n{:?}", line, breakpoints);

        for i in 1..breakpoints.len() {
            print!(
                "|{}",
                line.get(breakpoints[i - 1]..breakpoints[i])
                    .expect("Breakpoints should be at characters' boundaries")
            );
        }
        println!("|");
    });
}
念姐遠米巴急(abcd),松黃貫誰。
[0, 3, 6, 9, 12, 15, 18, 31, 34, 37, 40, 46]    
|念|姐|遠|米|巴|急|(abcd),|松|黃|貫|誰。|    
念姐遠米巴急(abc0),松黃貫誰。
[0, 3, 6, 9, 12, 15, 18, 28, 31, 34, 37, 40, 46]
|念|姐|遠|米|巴|急|(abc0)|,|松|黃|貫|誰。|   
念姐遠米巴急(0000),松黃貫誰。
[0, 3, 6, 9, 12, 15, 18, 28, 31, 34, 37, 40, 46]
|念|姐|遠|米|巴|急|(0000)|,|松|黃|貫|誰。|   
念姐遠米巴急(8888),松黃貫誰。
[0, 3, 6, 9, 12, 15, 18, 28, 31, 34, 37, 40, 46]
|念|姐|遠|米|巴|急|(8888)|,|松|黃|貫|誰。|   

@sffc
Copy link
Member

sffc commented Jan 15, 2024

@sffc sffc added the C-segmentation Component: Segmentation label Jan 15, 2024
@eggrobin
Copy link
Member

eggrobin commented Jan 15, 2024

As noted in #4523 (comment), "念姐遠米巴急(abcd),松黃貫誰。" is segmented just fine, the snippet in the OP is just confused between code point and UTF-8 code unit indices. Indeed it gets broken fine in the screenshot in the downstream issue.


But @YDX-2147483647 does show examples of bad segmentation, such as |念|姐|遠|米|巴|急|(abc0)|,|松|黃|貫|誰。| .

with icu=1.4.0

That release is dated Nov 16, 2023. #4389 was merged on Dec 1, 2023, so we know that line breaking is broken in 1.4.0.

At main the example from #4523 (comment) prints

念姐遠米巴急(abcd),松黃貫誰。
[0, 3, 6, 9, 12, 15, 18, 31, 34, 37, 40, 46]
|念|姐|遠|米|巴|急|(abcd),|松|黃|貫|誰。|
念姐遠米巴急(abc0),松黃貫誰。
[0, 3, 6, 9, 12, 15, 18, 31, 34, 37, 40, 46]
|念|姐|遠|米|巴|急|(abc0),|松|黃|貫|誰。|
念姐遠米巴急(0000),松黃貫誰。
[0, 3, 6, 9, 12, 15, 18, 31, 34, 37, 40, 46]
|念|姐|遠|米|巴|急|(0000),|松|黃|貫|誰。|
念姐遠米巴急(8888),松黃貫誰。
[0, 3, 6, 9, 12, 15, 18, 31, 34, 37, 40, 46]
|念|姐|遠|米|巴|急|(8888),|松|黃|貫|誰。|

so this has been fixed by #4389.

(I suspect what we are seeing, namely a break between and after digits but not after letters, is likely a consequence of the attempt at implementing the tailoring from https://www.unicode.org/reports/tr14/tr14-49.html#Examples Example 7 prior to #4389.)

@Enter-tainer
Copy link
Author

thank you! so i think this issue can be closed once a new version is released? (or it can be closed because it is already fixed in master)

@sffc
Copy link
Member

sffc commented Jan 16, 2024

I'll close this as fixed in 1.5. Thank you @eggrobin!

If you need the functionality sooner, you can use ICU4X from Git in your Cargo.toml.

@sffc sffc closed this as completed Jan 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-segmentation Component: Segmentation
Projects
None yet
Development

No branches or pull requests

4 participants