Tweak how locales are generated in icu_segmenter datagen #3408

sffc · 2023-05-03T16:34:02Z

Currently, the segmentation locales are filtered with the same machinery as languages. However, we should figure out a better way to filter them, since you usually want to carry segmentation models for all languages since you might encounter text in those languages.

For example, the th model should be included even if only building for English because we need to be able to segment th text, which can occur even in English documents.

The text was updated successfully, but these errors were encountered:

robertbastian · 2023-05-03T18:53:12Z

So don't filter at all? Or make segmentation languages a new option separate from locales?

sffc · 2023-05-11T18:24:18Z

Discuss with:

Optional:

aethanyc · 2023-05-18T05:41:54Z

I feel it is reasonable to decouple segmenter languages from locale. That is, always generate all segmenter languages regardless of the locales.

If it's desirable for some users to reduce the data size (e.g. they only ever want to use lstm but never dictionary), we probably should provide another datagen option to generate only partial data that matches the segmenter constructor. For example, the value for the option can be all (default), auto, lstm, and dictionary.

sffc · 2023-07-05T09:57:16Z

@Manishearth - We've generally seen datagen flags as a minus
@sffc - I think it's okay to have a big centralized list of flags so long as they are optional for most i18n correctness cases
@Manishearth - We could think about these identifiers as being model identifiers. Right now we have th and maybe in the future we'll have th-better.

Proposal:

Add a flag --segmenter-langids with the following options:
- auto (default): includes all of the models required for the code
- none: do not include any models
- explicit list of language identifiers, which must currently be in the list (ja, th, km, lo, my)

@eggrobin - Conceptually this is a mapping from for example Thai script to the th model
@sffc - The models should probably be stored as und-Thai instead of th. That's a known issue that we should fix in 2.0
@sffc - We should use auxiliary keys to store the extra models. For example, what we currently use is essentially und-Thai/th-model4
@eggrobin - It should be zxx-Thai if we're not intending to convey a language
@sffc - Should our other keys like week info also use zxx?
@eggrobin - ...
@robertbastian - I think it would be bad to add these as languages to the API now and then later change them to zxx-Thai

Updated proposal:

Add a flag --segmenter-models with the following options:
- auto (default): includes the recommended models for the code
- none: do not include any models
- explicit ordered list of model identifiers, which must currently be in the list:
  - cjdict
  - burmesedict
  - khmerdict
  - laodict
  - thaidict
  - Burmese_codepoints_exclusive_model4_heavy
  - Khmer_codepoints_exclusive_model4_heavy
  - Lao_codepoints_exclusive_model4_heavy
  - Thai_codepoints_exclusive_model4_heavy

In the API, this is passed in as a enum SegmenterModels { Auto, Explicit(Vec<String>) }; the name should be bikeshed along with the rest of the datagen API shipping in 1.3

Note: we can support Thai_graphclust_model4_heavy, as well as other ones like model5 and model7, through an auxiliary key.

LGTM: @sffc @robertbastian @Manishearth @eggrobin

sffc added discuss Discuss at a future ICU4X-SC meeting C-segmentation Component: Segmentation labels May 3, 2023

sffc mentioned this issue May 3, 2023

New datagen API #3386

Merged

sffc added the discuss-triaged The stakeholders for this issue have been identified and it can be discussed out-of-band label May 25, 2023

robertbastian added this to the 1.3 Blocking ⟨P1⟩ milestone Jun 27, 2023

sffc removed discuss Discuss at a future ICU4X-SC meeting discuss-triaged The stakeholders for this issue have been identified and it can be discussed out-of-band labels Jul 5, 2023

sffc assigned robertbastian Jul 5, 2023

sffc added T-core Type: Required functionality S-small Size: One afternoon (small bug fix or enhancement) labels Jul 5, 2023

robertbastian mentioned this issue Jul 12, 2023

Adding segmenter model option to datagen #3669

Merged

robertbastian closed this as completed in #3669 Jul 22, 2023

robertbastian mentioned this issue Nov 21, 2023

Use auxiliary keys for segmenter models #4348

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tweak how locales are generated in icu_segmenter datagen #3408

Tweak how locales are generated in icu_segmenter datagen #3408

sffc commented May 3, 2023

robertbastian commented May 3, 2023

sffc commented May 11, 2023

aethanyc commented May 18, 2023

sffc commented Jul 5, 2023 •

edited

Loading

Tweak how locales are generated in icu_segmenter datagen #3408

Tweak how locales are generated in icu_segmenter datagen #3408

Comments

sffc commented May 3, 2023

robertbastian commented May 3, 2023

sffc commented May 11, 2023

aethanyc commented May 18, 2023

sffc commented Jul 5, 2023 • edited Loading

sffc commented Jul 5, 2023 •

edited

Loading