Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support slicing likely subtags and adding extended likely subtags #2903

Closed
sffc opened this issue Dec 20, 2022 · 3 comments · Fixed by #3197
Closed

Support slicing likely subtags and adding extended likely subtags #2903

sffc opened this issue Dec 20, 2022 · 3 comments · Fixed by #3197
Assignees
Labels
C-data-infra Component: provider, datagen, fallback, adapters C-locale Component: Locale identifiers, BCP47 S-medium Size: Less than a week (larger bug fix or enhancement) T-core Type: Required functionality

Comments

@sffc
Copy link
Member

sffc commented Dec 20, 2022

CLDR is planning to expand the number of entries in likely subtags data from the current ~1000 to ~7000. This isn't sustainable with our current use of likely subtags for locale fallback.

I suggest we take an approach similar to the one we did with Japanese eras: the default key is defined to contain only the essentials, and an additional "extended" key contains the full set. The choice of which one you want is done at the constructor level.

Based on a discussion with @macchiati, CLDR can add some spec text to inform ICU4X of how to correctly slice the likely subtags data.

I would also like to use this opportunity to possibly consolidate the two copies of likely subtags: the one for fallback only, and the one for the LocaleExpander.

CC @zbraniecki @dminor who have worked on this component.

@sffc sffc added T-core Type: Required functionality C-locale Component: Locale identifiers, BCP47 C-data-infra Component: provider, datagen, fallback, adapters S-medium Size: Less than a week (larger bug fix or enhancement) labels Dec 20, 2022
@sffc sffc added this to the ICU4X 1.2 milestone Dec 20, 2022
@sffc sffc self-assigned this Dec 20, 2022
@sffc
Copy link
Member Author

sffc commented Dec 22, 2022

@sffc
Copy link
Member Author

sffc commented Feb 23, 2023

Based on the discussion in #3022, we should instead favor an approach more like Normalizer, where the extended subtags are a supplemental key that sits on top of the core key without duplicating the data in the core key.

@sffc
Copy link
Member Author

sffc commented Feb 24, 2023

I would like to end up with three data keys:

  1. Likely subtags needed for fallback, in core locales only:
    • L → R + S
    • L + R → S
    • L + S → R
  2. Remainder of likely subtags, also in core locales only:
    • S → L + R
    • S + R → L
    • R → L + S
  3. Extended data with non-core locales and all six mapping types

In addition, I would like to take some low-hanging fruit to reduce the data size of each of these bundles. I opened https://unicode-org.atlassian.net/browse/CLDR-16427 with one thing that should help a lot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-data-infra Component: provider, datagen, fallback, adapters C-locale Component: Locale identifiers, BCP47 S-medium Size: Less than a week (larger bug fix or enhancement) T-core Type: Required functionality
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant