Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tweak how locales are generated in icu_segmenter datagen #3408

Closed
sffc opened this issue May 3, 2023 · 4 comments · Fixed by #3669
Closed

Tweak how locales are generated in icu_segmenter datagen #3408

sffc opened this issue May 3, 2023 · 4 comments · Fixed by #3669
Assignees
Labels
C-segmentation Component: Segmentation S-small Size: One afternoon (small bug fix or enhancement) T-core Type: Required functionality

Comments

@sffc
Copy link
Member

sffc commented May 3, 2023

Currently, the segmentation locales are filtered with the same machinery as languages. However, we should figure out a better way to filter them, since you usually want to carry segmentation models for all languages since you might encounter text in those languages.

For example, the th model should be included even if only building for English because we need to be able to segment th text, which can occur even in English documents.

@sffc sffc added discuss Discuss at a future ICU4X-SC meeting C-segmentation Component: Segmentation labels May 3, 2023
@sffc sffc mentioned this issue May 3, 2023
@robertbastian
Copy link
Member

So don't filter at all? Or make segmentation languages a new option separate from locales?

@sffc
Copy link
Member Author

sffc commented May 11, 2023

Discuss with:

Optional:

@aethanyc
Copy link
Contributor

I feel it is reasonable to decouple segmenter languages from locale. That is, always generate all segmenter languages regardless of the locales.

If it's desirable for some users to reduce the data size (e.g. they only ever want to use lstm but never dictionary), we probably should provide another datagen option to generate only partial data that matches the segmenter constructor. For example, the value for the option can be all (default), auto, lstm, and dictionary.

@sffc sffc added the discuss-triaged The stakeholders for this issue have been identified and it can be discussed out-of-band label May 25, 2023
@robertbastian robertbastian added this to the 1.3 Blocking ⟨P1⟩ milestone Jun 27, 2023
@sffc
Copy link
Member Author

sffc commented Jul 5, 2023

  • @Manishearth - We've generally seen datagen flags as a minus
  • @sffc - I think it's okay to have a big centralized list of flags so long as they are optional for most i18n correctness cases
  • @Manishearth - We could think about these identifiers as being model identifiers. Right now we have th and maybe in the future we'll have th-better.

Proposal:

  • Add a flag --segmenter-langids with the following options:
    • auto (default): includes all of the models required for the code
    • none: do not include any models
    • explicit list of language identifiers, which must currently be in the list (ja, th, km, lo, my)

  • @eggrobin - Conceptually this is a mapping from for example Thai script to the th model
  • @sffc - The models should probably be stored as und-Thai instead of th. That's a known issue that we should fix in 2.0
  • @sffc - We should use auxiliary keys to store the extra models. For example, what we currently use is essentially und-Thai/th-model4
  • @eggrobin - It should be zxx-Thai if we're not intending to convey a language
  • @sffc - Should our other keys like week info also use zxx?
  • @eggrobin - ...
  • @robertbastian - I think it would be bad to add these as languages to the API now and then later change them to zxx-Thai

Updated proposal:

  • Add a flag --segmenter-models with the following options:
    • auto (default): includes the recommended models for the code
    • none: do not include any models
    • explicit ordered list of model identifiers, which must currently be in the list:
      • cjdict
      • burmesedict
      • khmerdict
      • laodict
      • thaidict
      • Burmese_codepoints_exclusive_model4_heavy
      • Khmer_codepoints_exclusive_model4_heavy
      • Lao_codepoints_exclusive_model4_heavy
      • Thai_codepoints_exclusive_model4_heavy

In the API, this is passed in as a enum SegmenterModels { Auto, Explicit(Vec<String>) }; the name should be bikeshed along with the rest of the datagen API shipping in 1.3

Note: we can support Thai_graphclust_model4_heavy, as well as other ones like model5 and model7, through an auxiliary key.

LGTM: @sffc @robertbastian @Manishearth @eggrobin

@sffc sffc removed discuss Discuss at a future ICU4X-SC meeting discuss-triaged The stakeholders for this issue have been identified and it can be discussed out-of-band labels Jul 5, 2023
@sffc sffc added T-core Type: Required functionality S-small Size: One afternoon (small bug fix or enhancement) labels Jul 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-segmentation Component: Segmentation S-small Size: One afternoon (small bug fix or enhancement) T-core Type: Required functionality
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants