Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use key attributes for segmenter #4985

Merged
merged 3 commits into from
Jun 3, 2024
Merged

Conversation

robertbastian
Copy link
Member

@sffc
Copy link
Member

sffc commented Jun 2, 2024

I think we should remap the names from the lstm_word_segmentation repo to something more concise

Long Names Short Names
Burmese_codepoints_exclusive_model4_heavy my-cp-model4
Khmer_codepoints_exclusive_model4_heavy km-cp-model4
Lao_codepoints_exclusive_model4_heavy lo-cp-model4
Thai_codepoints_exclusive_model4_heavy th-cp-model4
Thai_graphclust_model4_heavy th-gc-model4

@sffc
Copy link
Member

sffc commented Jun 2, 2024

Also, wdyt about making the dictionaries named according to the languages contained in them. thaidict becomes th; cjdict becomes ja-zh

) -> Result<Option<DataPayload<M>>, DataError> {
match provider.load(DataRequest {
locale: &locale.into(),
key_attributes: &model.parse().unwrap(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: We shouldn't need to unwrap anything; I'm surprised Clippy isn't complaining since this is library code.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an infallible FromStr, which clippy apparently realises. The whole DataKeyAttributes construction will be overhauled anyway once we are clearer about requirements. Ideally we'd construct from static str here.

provider/datagen/src/transform/segmenter/dictionary.rs Outdated Show resolved Hide resolved
Comment on lines -440 to -441
#[cfg(feature = "provider")]
fn lstm_model_name_to_data_locale(name: &str) -> Option<DataLocale> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thought: It's nice that we can get rid of these functions. I suggested remapping with new names, but maybe the new names should be applied deep inside DatagenProvider such that users of DatagenDriver also use the new names.

Comment on lines -192 to -193
let model = crate::lstm_data_locale_to_model_name(req.locale)
.ok_or(DataErrorKind::MissingLocale.with_req(LstmForWordLineAutoV1Marker::KEY, req))?;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a good place to perform the mapping. Maybe if a long name isn't found, we fall back to the long name? This way it could be easier for customers to add custom models.

@robertbastian
Copy link
Member Author

I'd prefer not to maintain a mapping. Can we change the source repo to use more concise names, or declare short names for each model?

@robertbastian robertbastian requested a review from sffc June 3, 2024 10:37
@robertbastian robertbastian merged commit 05b220f into unicode-org:main Jun 3, 2024
28 checks passed
@robertbastian robertbastian deleted the seg branch June 3, 2024 17:13
robertbastian added a commit that referenced this pull request Jun 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants