Use auxiliary keys instead of fixed locales for segmenter #4511

robertbastian · 2024-01-04T10:28:35Z

Segmenter locales are not locales in the sense that they participate in fallback, or that they are even exposed to the user. Instead of using hardcoded locales to look up segmenter payloads, we should use auxiliary keys.

sffc · 2024-03-14T17:18:18Z

Please propose:

What are the aux keys exactly?
How do they fall back between each other?

Note: adding aux keys is nice because then we can introduce fallback between different models, but then we need to define how that fallback works.

robertbastian · 2024-03-15T15:34:52Z

This would basically remove these mappings:

icu4x/provider/datagen/src/transform/segmenter/dictionary.rs

Lines 19 to 39 in 9491b63

    
           fn model_name_to_data_locale(name: &str) -> Option<DataLocale> { 
        
               match name { 
        
                   "khmerdict" => Some(langid!("km").into()), 
        
                   "cjdict" => Some(langid!("ja").into()), 
        
                   "laodict" => Some(langid!("lo").into()), 
        
                   "burmesedict" => Some(langid!("my").into()), 
        
                   "thaidict" => Some(langid!("th").into()), 
        
                   _ => None, 
        
               } 
        
           } 
        
           pub(crate) fn data_locale_to_model_name(locale: &DataLocale) -> Option<&'static str> { 
        
               match locale.get_langid() { 
        
                   id if id == langid!("km") => Some("khmerdict"), 
        
                   id if id == langid!("ja") => Some("cjdict"), 
        
                   id if id == langid!("lo") => Some("laodict"), 
        
                   id if id == langid!("my") => Some("burmesedict"), 
        
                   id if id == langid!("th") => Some("thaidict"), 
        
                   _ => None, 
        
               } 
        
           }

,

icu4x/provider/datagen/src/transform/segmenter/lstm.rs

Lines 184 to 202 in 9491b63

    
           fn model_name_to_data_locale(name: &str) -> Option<DataLocale> { 
        
               match name { 
        
                   "Burmese_codepoints_exclusive_model4_heavy" => Some(langid!("my").into()), 
        
                   "Khmer_codepoints_exclusive_model4_heavy" => Some(langid!("km").into()), 
        
                   "Lao_codepoints_exclusive_model4_heavy" => Some(langid!("lo").into()), 
        
                   "Thai_codepoints_exclusive_model4_heavy" => Some(langid!("th").into()), 
        
                   _ => None, 
        
               } 
        
           } 
        
           pub(crate) fn data_locale_to_model_name(locale: &DataLocale) -> Option<&'static str> { 
        
               match locale.get_langid() { 
        
                   id if id == langid!("my") => Some("Burmese_codepoints_exclusive_model4_heavy"), 
        
                   id if id == langid!("km") => Some("Khmer_codepoints_exclusive_model4_heavy"), 
        
                   id if id == langid!("lo") => Some("Lao_codepoints_exclusive_model4_heavy"), 
        
                   id if id == langid!("th") => Some("Thai_codepoints_exclusive_model4_heavy"), 
        
                   _ => None, 
        
               } 
        
           }

, and we'd use aux keys instead.

We would have these combinations

DictionaryForWordOnlyAutoV1Marker, aux key cjdict
DictionaryForWordLineExtendedV1Marker, aux keys khmerdict, laodict, burmesedict, thaidict
LstmForWordLineAutoV1Marker, aux keys Burmese_codepoints_exclusive_model4_heavy, Khmer_codepoints_exclusive_model4_heavy, Lao_codepoints_exclusive_model4_heavy, Thai_codepoints_exclusive_model4_heavy

If we support more models in the future, the segmenter constructor might do its own fallback between these.

sffc · 2024-03-15T15:43:50Z

We could shorten Burmese_codepoints_exclusive_model4_heavy as follows:

Burmese should probably be either my (if langauge-specific) or Mymr (if script-specific)
codepoints is probably okay to write as cp
exclusive should be inferred by the key. The other mode is inclusive of Latin text, which the segmenter doesn't expect with this key. So I would say to drop this.
model4 is probably good to keep as-is
heavy can probably be dropped. We should only have heavy-trained models. Light-trained models could be called dev or something.

#4511

sffc · 2024-06-03T16:03:46Z

@robertbastian said:

I'd prefer not to maintain a mapping. Can we change the source repo to use more concise names, or declare short names for each model?

I do think it's ICU4X's job to choose the names that make the most sense for the specific context in which those names are being used. But, if this gets us unblocked, I'm not opposed to adding a short name upstream, similar to how we get short names for properties from icuexportdata.

robertbastian · 2024-06-03T16:21:36Z

In that case I think we should stick with the upstream names. Neither the size of the keys nor the time it takes the binary search over them are significant in the context of segmentation data/runtime.

sffc · 2024-06-03T16:24:18Z

There has never been time put into figuring out what good names would be. These model names were invented by @SahandFarhoodi and I don't think they were intended to be publicly facing names.

sffc · 2024-06-03T16:25:37Z

In that case I think we should stick with the upstream names. Neither the size of the keys nor the time it takes the binary search over them are significant in the context of segmentation data/runtime.

In terms of specific issues, the names are not BCP47-compatible, and I think we should try to keep our key attributes being BCP47-compatible.

SahandFarhoodi · 2024-06-08T17:23:43Z

Sorry for my late response.
Yes, I just chose those names to be able to distinguish between multiple experiments we were doing at the time, they were not supposed to be public-facing names. Shane, I agree with your suggestions for shortening names, just one additional thing is that you can also drop codepoints maybe? It was initially there because we had other models that were trained on grapheme and grapheme clusters, but I think you are going to use those models anywhere anyway (correct me if I'm wrong).

robertbastian · 2024-06-20T12:00:07Z

Discussion:

Keep model names, as an extra mapping in any location is not worth the trouble
- Potentially remove the _heavy suffix, because all models are heavy
During data loading, use prefix matching on marker attributes, i.e. Lao_ to load the first Lao model that appears in the data
- This needs prefix matching support in load(DataRequest), we can add a match_attribute_prefix flag to DataRequestMetadata (trivial for binary search, ZeroTrieCursor::probe)
Users can then pass any of model names from the lstm_word_segmentation repo to datagen, and they will work

LGTM: @robertbastian @sffc

Fixes #4511

robertbastian added this to the ICU4X 2.0 milestone Jan 4, 2024

aethanyc added the C-segmentation Component: Segmentation label Jan 10, 2024

sffc assigned robertbastian Mar 14, 2024

This was referenced Jun 1, 2024

Use auxiliary keys for segmenter models #4348

Closed

Use key attributes for segmenter #4985

Merged

robertbastian added a commit that referenced this issue Jun 3, 2024

Use key attributes for segmenter (#4985)

05b220f

#4511

robertbastian added the discuss-priority Discuss at the next ICU4X meeting label Jun 6, 2024

robertbastian removed their assignment Jun 20, 2024

robertbastian self-assigned this Jun 20, 2024

robertbastian removed the discuss-priority Discuss at the next ICU4X meeting label Jun 20, 2024

robertbastian mentioned this issue Jul 9, 2024

Add marker attribute prefix matching #5207

Merged

robertbastian closed this as completed in #5207 Jul 10, 2024

robertbastian added a commit that referenced this issue Jul 10, 2024

Add marker attribute prefix matching (#5207)

6ebbcb2

Fixes #4511

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use auxiliary keys instead of fixed locales for segmenter #4511

Use auxiliary keys instead of fixed locales for segmenter #4511

robertbastian commented Jan 4, 2024

sffc commented Mar 14, 2024

robertbastian commented Mar 15, 2024 •

edited

Loading

sffc commented Mar 15, 2024

sffc commented Jun 3, 2024

robertbastian commented Jun 3, 2024

sffc commented Jun 3, 2024

sffc commented Jun 3, 2024

SahandFarhoodi commented Jun 8, 2024

robertbastian commented Jun 20, 2024

Use auxiliary keys instead of fixed locales for segmenter #4511

Use auxiliary keys instead of fixed locales for segmenter #4511

Comments

robertbastian commented Jan 4, 2024

sffc commented Mar 14, 2024

robertbastian commented Mar 15, 2024 • edited Loading

sffc commented Mar 15, 2024

sffc commented Jun 3, 2024

robertbastian commented Jun 3, 2024

sffc commented Jun 3, 2024

sffc commented Jun 3, 2024

SahandFarhoodi commented Jun 8, 2024

robertbastian commented Jun 20, 2024

robertbastian commented Mar 15, 2024 •

edited

Loading