Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add LocaleData parameter for word/sentence segmenter #5318

Merged
merged 12 commits into from
Sep 3, 2024

Conversation

makotokato
Copy link
Member

Add LocaleData parameter for word/sentence segmenter

This is a part of #3284.

ICU4C has some language break rules for word and sentence segmenter, so this fix adds some rules to ICU4X per locale.

This adds LocaleData argument to all constructors. Also, locale difference is small and 2 data only, I add the override table data marker for machine state property.

robertbastian
robertbastian previously approved these changes Jul 30, 2024
components/segmenter/src/provider/mod.rs Outdated Show resolved Hide resolved
components/segmenter/src/sentence.rs Outdated Show resolved Hide resolved
@@ -15,5 +15,7 @@ segmenter/lstm/wl_auto@1, und/Burmese_codepoints_exclusive_model4_heavy, 91369B,
segmenter/lstm/wl_auto@1, und/Khmer_codepoints_exclusive_model4_heavy, 74669B, b25f5219c4b970f2
segmenter/lstm/wl_auto@1, und/Lao_codepoints_exclusive_model4_heavy, 72164B, 7e0c3ea7801791bd
segmenter/lstm/wl_auto@1, und/Thai_codepoints_exclusive_model4_heavy, 72331B, c46e2e0c098c1fc1
segmenter/sentence/override@1, <singleton>, 332B, 745d858a06d47385
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

praise: these sizes look much better :)

Copy link
Member

@sffc sffc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, I approve of this architecturally, despite the extra branch at runtime. I think this is the right data model, and if it turns out to be a performance issue, we can tweak the runtime algorithm a bit.

);

#[doc = icu_provider::gen_any_buffer_unstable_docs!(UNSTABLE, Self::new)]
pub fn try_new_unstable<D>(provider: &D) -> Result<Self, DataError>
pub fn try_new_unstable<D>(provider: &D, locale: &DataLocale) -> Result<Self, DataError>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: This function signature is still not what we agreed in the issue, but it can be fixed later (but before 2.0)

components/segmenter/src/sentence.rs Outdated Show resolved Hide resolved

fn is_default_rule(locale: &DataLocale) -> bool {
let lang = locale.language();
lang != language!("fi") && lang != language!("sv")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here: please just try to load from data.

It would be okay to skip und, but please note that in the issue we agreed that the function should take an optional locale, and you should skip loading when the locale is None.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Please just always check the data provider. Don't rely on a hard-coded list of locales.

Copy link
Member

@sffc sffc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My comments were not fixed.

@sffc
Copy link
Member

sffc commented Aug 11, 2024

Example code for how to optionally load the locale data:

if let Some(locale) = options.content_locale {
    let request = // ...
    match provider.load(request) {
        Ok(response) => {
            self.locale_specific_data = response.payload;
        }
        Err(DataError {
            kind: DataErrorKind::IdentifierNotFound,
            ..
        }) => {
            // fall through
        }
        Err(e) => return Err(e),
    }
}

@makotokato
Copy link
Member Author

My understand is that current macro (gen_any_buffer_data_constructors) doesn't support optional locale parameter.

try_new_unstable<D>(provider: &D) should be try_new_unstalbe<D>((provider: &D, Option<LocaleData>& locale), add try_new_for_locale_unstable(...) (used by datetime), add try_new_with_option_unstable(...), or something?

@makotokato
Copy link
Member Author

#3284 (comment) explains about try_new_with_options_unstable

@makotokato makotokato requested a review from sffc August 28, 2024 21:43
components/segmenter/src/sentence.rs Outdated Show resolved Hide resolved

fn is_default_rule(locale: &DataLocale) -> bool {
let lang = locale.language();
lang != language!("fi") && lang != language!("sv")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Please just always check the data provider. Don't rely on a hard-coded list of locales.

ffi/capi/src/segmenter_sentence.rs Outdated Show resolved Hide resolved
ffi/capi/src/segmenter_word.rs Outdated Show resolved Hide resolved
@makotokato makotokato requested a review from sffc August 30, 2024 13:31
sffc
sffc previously approved these changes Aug 30, 2024
Copy link
Member

@sffc sffc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good work!

ffi/capi/src/segmenter_sentence.rs Outdated Show resolved Hide resolved
ffi/capi/src/segmenter_word.rs Outdated Show resolved Hide resolved
Copy link
Contributor

@aethanyc aethanyc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems all the previous review comments have been addressed.

@makotokato makotokato merged commit 9d45c5f into unicode-org:main Sep 3, 2024
28 checks passed
sffc pushed a commit that referenced this pull request Sep 3, 2024
For #5482 and
#5484, I have to update
segmenter's generated code due to
#5318
@makotokato makotokato deleted the uax29-locale-2 branch September 4, 2024 00:11
robertbastian added a commit that referenced this pull request Sep 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants