Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use auxiliary keys instead of fixed locales for segmenter #4511

Closed
robertbastian opened this issue Jan 4, 2024 · 9 comments · Fixed by #5207
Closed

Use auxiliary keys instead of fixed locales for segmenter #4511

robertbastian opened this issue Jan 4, 2024 · 9 comments · Fixed by #5207
Assignees
Labels
C-segmentation Component: Segmentation

Comments

@robertbastian
Copy link
Member

Segmenter locales are not locales in the sense that they participate in fallback, or that they are even exposed to the user. Instead of using hardcoded locales to look up segmenter payloads, we should use auxiliary keys.

@robertbastian robertbastian added this to the ICU4X 2.0 milestone Jan 4, 2024
@aethanyc aethanyc added the C-segmentation Component: Segmentation label Jan 10, 2024
@sffc
Copy link
Member

sffc commented Mar 14, 2024

Please propose:

  1. What are the aux keys exactly?
  2. How do they fall back between each other?

Note: adding aux keys is nice because then we can introduce fallback between different models, but then we need to define how that fallback works.

@robertbastian
Copy link
Member Author

robertbastian commented Mar 15, 2024

This would basically remove these mappings:

fn model_name_to_data_locale(name: &str) -> Option<DataLocale> {
match name {
"khmerdict" => Some(langid!("km").into()),
"cjdict" => Some(langid!("ja").into()),
"laodict" => Some(langid!("lo").into()),
"burmesedict" => Some(langid!("my").into()),
"thaidict" => Some(langid!("th").into()),
_ => None,
}
}
pub(crate) fn data_locale_to_model_name(locale: &DataLocale) -> Option<&'static str> {
match locale.get_langid() {
id if id == langid!("km") => Some("khmerdict"),
id if id == langid!("ja") => Some("cjdict"),
id if id == langid!("lo") => Some("laodict"),
id if id == langid!("my") => Some("burmesedict"),
id if id == langid!("th") => Some("thaidict"),
_ => None,
}
}
,
fn model_name_to_data_locale(name: &str) -> Option<DataLocale> {
match name {
"Burmese_codepoints_exclusive_model4_heavy" => Some(langid!("my").into()),
"Khmer_codepoints_exclusive_model4_heavy" => Some(langid!("km").into()),
"Lao_codepoints_exclusive_model4_heavy" => Some(langid!("lo").into()),
"Thai_codepoints_exclusive_model4_heavy" => Some(langid!("th").into()),
_ => None,
}
}
pub(crate) fn data_locale_to_model_name(locale: &DataLocale) -> Option<&'static str> {
match locale.get_langid() {
id if id == langid!("my") => Some("Burmese_codepoints_exclusive_model4_heavy"),
id if id == langid!("km") => Some("Khmer_codepoints_exclusive_model4_heavy"),
id if id == langid!("lo") => Some("Lao_codepoints_exclusive_model4_heavy"),
id if id == langid!("th") => Some("Thai_codepoints_exclusive_model4_heavy"),
_ => None,
}
}
, and we'd use aux keys instead.

We would have these combinations

  • DictionaryForWordOnlyAutoV1Marker, aux key cjdict
  • DictionaryForWordLineExtendedV1Marker, aux keys khmerdict, laodict, burmesedict, thaidict
  • LstmForWordLineAutoV1Marker, aux keys Burmese_codepoints_exclusive_model4_heavy, Khmer_codepoints_exclusive_model4_heavy, Lao_codepoints_exclusive_model4_heavy, Thai_codepoints_exclusive_model4_heavy

If we support more models in the future, the segmenter constructor might do its own fallback between these.

@sffc
Copy link
Member

sffc commented Mar 15, 2024

We could shorten Burmese_codepoints_exclusive_model4_heavy as follows:

  1. Burmese should probably be either my (if langauge-specific) or Mymr (if script-specific)
  2. codepoints is probably okay to write as cp
  3. exclusive should be inferred by the key. The other mode is inclusive of Latin text, which the segmenter doesn't expect with this key. So I would say to drop this.
  4. model4 is probably good to keep as-is
  5. heavy can probably be dropped. We should only have heavy-trained models. Light-trained models could be called dev or something.

@sffc
Copy link
Member

sffc commented Jun 3, 2024

@robertbastian said:

I'd prefer not to maintain a mapping. Can we change the source repo to use more concise names, or declare short names for each model?

I do think it's ICU4X's job to choose the names that make the most sense for the specific context in which those names are being used. But, if this gets us unblocked, I'm not opposed to adding a short name upstream, similar to how we get short names for properties from icuexportdata.

@robertbastian
Copy link
Member Author

In that case I think we should stick with the upstream names. Neither the size of the keys nor the time it takes the binary search over them are significant in the context of segmentation data/runtime.

@sffc
Copy link
Member

sffc commented Jun 3, 2024

There has never been time put into figuring out what good names would be. These model names were invented by @SahandFarhoodi and I don't think they were intended to be publicly facing names.

@sffc
Copy link
Member

sffc commented Jun 3, 2024

In that case I think we should stick with the upstream names. Neither the size of the keys nor the time it takes the binary search over them are significant in the context of segmentation data/runtime.

In terms of specific issues, the names are not BCP47-compatible, and I think we should try to keep our key attributes being BCP47-compatible.

@robertbastian robertbastian added the discuss-priority Discuss at the next ICU4X meeting label Jun 6, 2024
@SahandFarhoodi
Copy link
Contributor

Sorry for my late response.
Yes, I just chose those names to be able to distinguish between multiple experiments we were doing at the time, they were not supposed to be public-facing names. Shane, I agree with your suggestions for shortening names, just one additional thing is that you can also drop codepoints maybe? It was initially there because we had other models that were trained on grapheme and grapheme clusters, but I think you are going to use those models anywhere anyway (correct me if I'm wrong).

@robertbastian robertbastian removed their assignment Jun 20, 2024
@robertbastian
Copy link
Member Author

Discussion:

  • Keep model names, as an extra mapping in any location is not worth the trouble
    • Potentially remove the _heavy suffix, because all models are heavy
  • During data loading, use prefix matching on marker attributes, i.e. Lao_ to load the first Lao model that appears in the data
    • This needs prefix matching support in load(DataRequest), we can add a match_attribute_prefix flag to DataRequestMetadata (trivial for binary search, ZeroTrieCursor::probe)
  • Users can then pass any of model names from the lstm_word_segmentation repo to datagen, and they will work

LGTM: @robertbastian @sffc

@robertbastian robertbastian self-assigned this Jun 20, 2024
@robertbastian robertbastian removed the discuss-priority Discuss at the next ICU4X meeting label Jun 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-segmentation Component: Segmentation
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

4 participants