Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Globaldata CLDR locales set #3538

Closed
sffc opened this issue Jun 14, 2023 · 10 comments
Closed

Globaldata CLDR locales set #3538

sffc opened this issue Jun 14, 2023 · 10 comments
Labels
A-design Area: Architecture or design C-data-infra Component: provider, datagen, fallback, adapters

Comments

@sffc
Copy link
Member

sffc commented Jun 14, 2023

What set of locales should we use in globaldata?

Discuss with:

Optional:

Plan to add this to an upcoming ICU-TC call.

@sffc sffc added the discuss Discuss at a future ICU4X-SC meeting label Jun 14, 2023
@robertbastian
Copy link
Member

I want to include at least modern, but can be convinced to include larger sets. We should not define a smaller set, as that will become a canonical Unicode approved locale set. Clients that want specific sets can run datagen.

@Manishearth
Copy link
Member

If size isn't a problem I'm also in favor of modern

@sffc sffc added the discuss-triaged The stakeholders for this issue have been identified and it can be discussed out-of-band label Jun 15, 2023
@eggrobin
Copy link
Member

I second @robertbastian’s comment about standard proliferation:

We should not define a smaller set, as that will become a canonical Unicode approved locale set.

I have no opinion as to the actual set (for tiny sets of locales whose sole purpose is a good coverage of I18N issues I might have something to say, but as far as I can tell this is not the use case here).

@sffc
Copy link
Member Author

sffc commented Jun 16, 2023

What defines the CLDR sets is not the usage or the need but rather the amount of data that happens to be collected for a particular locale. When a locale is in "modern", it is a stronger reflection of the well-connectedness of influencers in that locale than it is of whether that locale is a good choice for being a default locale.

@robertbastian
Copy link
Member

Do you have a proposal? I'm happy to use a smaller pre-existing set, I just don't want to define a new one.

I don't see a big problem with locales being modern because someone needed the locale and did the work to get it included in CLDR. Adding an extra locales has very limited runtime impact (logarithmic impact on constructors, which often do heavier work), so it's mainly a code size thing. Do we agree that code-size sensitive clients should use datagen, or do you want to reach these clients with baked data?

I also don't mind including all CLDR locales, including basic and moderate. This increases (postcard) size by 17%.

@sffc
Copy link
Member Author

sffc commented Jun 27, 2023

According to territoryInfo.json, there are 487 language-region pairs that are "official", "official_regional", or "de_facto_official":

>>> ti = json.load(open("Downloads/territoryInfo.json"))
>>> official_locales = [(lang, region) for (region,regionInfo) in ti["supplemental"]["territoryInfo"].items() for (lang,langInfo) in regionInfo.get("languagePopulation", {}).items() if langInfo.get("_officialStatus", None) is not None]
>>> ["%s_%s" % (lang, region) for (lang, region) in official_locales]
['ca_AD', 'ar_AE', 'fa_AF', 'ps_AF', 'tk_AF', 'uz_Arab_AF', 'en_AG', 'en_AI', 'sq_AL', 'hy_AM', 'pt_AO', 'es_AR', 'en_AS', 'sm_AS', 'de_AT', 'hr_AT', 'hu_AT', 'sl_AT', 'en_AU', 'nl_AW', 'pap_AW', 'sv_AX', 'az_AZ', 'az_Cyrl_AZ', 'bs_BA', 'bs_Cyrl_BA', 'hr_BA', 'sr_BA', 'sr_Latn_BA', 'en_BB', 'bn_BD', 'de_BE', 'fr_BE', 'nl_BE', 'fr_BF', 'bg_BG', 'ar_BH', 'en_BI', 'fr_BI', 'rn_BI', 'fr_BJ', 'fr_BL', 'en_BM', 'ms_BN', 'ms_Arab_BN', 'ay_BO', 'es_BO', 'qu_BO', 'nl_BQ', 'pt_BR', 'vec_BR', 'en_BS', 'dz_BT', 'en_BW', 'tn_BW', 'be_BY', 'ru_BY', 'en_BZ', 'chp_CA', 'cr_CA', 'den_CA', 'dgr_CA', 'en_CA', 'fr_CA', 'gwi_CA', 'iu_CA', 'iu_Latn_CA', 'en_CC', 'fr_CD', 'kg_CD', 'ln_CD', 'lua_CD', 'sw_CD', 'fr_CF', 'sg_CF', 'fr_CG', 'de_CH', 'fr_CH', 'gsw_CH', 'it_CH', 'rm_CH', 'fr_CI', 'en_CK', 'es_CL', 'en_CM', 'fr_CM', 'bo_CN', 'ko_CN', 'mn_Mong_CN', 'ug_CN', 'za_CN', 'zh_CN', 'es_CO', 'en_CQ', 'es_CR', 'es_CU', 'pt_CV', 'nl_CW', 'pap_CW', 'en_CX', 'el_CY', 'tr_CY', 'cs_CZ', 'de_DE', 'frr_DE', 'en_DG', 'ar_DJ', 'fr_DJ', 'da_DK', 'de_DK', 'kl_DK', 'en_DM', 'es_DO', 'ar_DZ', 'fr_DZ', 'es_EA', 'es_EC', 'qu_EC', 'et_EE', 'ar_EG', 'ar_EH', 'ar_ER', 'en_ER', 'ti_ER', 'ast_ES', 'ca_ES', 'es_ES', 'eu_ES', 'gl_ES', 'oc_ES', 'am_ET', 'fi_FI', 'sms_FI', 'sv_FI', 'en_FJ', 'fj_FJ', 'hif_FJ', 'en_FK', 'en_FM', 'fo_FO', 'fr_FR', 'fr_GA', 'cy_GB', 'en_GB', 'ga_GB', 'gd_GB', 'en_GD', 'ab_GE', 'ka_GE', 'os_GE', 'fr_GF', 'en_GG', 'ak_GH', 'ee_GH', 'en_GH', 'gaa_GH', 'en_GI', 'kl_GL', 'en_GM', 'fr_GN', 'fr_GP', 'es_GQ', 'fr_GQ', 'pt_GQ', 'el_GR', 'es_GT', 'quc_GT', 'ch_GU', 'en_GU', 'pt_GW', 'en_GY', 'en_HK', 'zh_Hant_HK', 'es_HN', 'hr_HR', 'it_HR', 'vec_HR', 'fr_HT', 'ht_HT', 'hu_HU', 'es_IC', 'id_ID', 'en_IE', 'ga_IE', 'ar_IL', 'he_IL', 'en_IM', 'gv_IM', 'as_IN', 'bn_IN', 'en_IN', 'gu_IN', 'hi_IN', 'kha_IN', 'kn_IN', 'kok_IN', 'ks_IN', 'mai_IN', 'ml_IN', 'mr_IN', 'ne_IN', 'or_IN', 'pa_IN', 'sa_IN', 'sat_IN', 'sd_IN', 'sd_Deva_IN', 'ta_IN', 'te_IN', 'ur_IN', 'en_IO', 'ar_IQ', 'az_Arab_IQ', 'ckb_IQ', 'fa_IR', 'is_IS', 'fr_IT', 'it_IT', 'vec_IT', 'en_JE', 'en_JM', 'ar_JO', 'ja_JP', 'en_KE', 'sw_KE', 'ky_KG', 'ru_KG', 'km_KH', 'en_KI', 'gil_KI', 'ar_KM', 'fr_KM', 'wni_KM', 'zdj_KM', 'en_KN', 'ko_KP', 'ko_KR', 'ar_KW', 'en_KY', 'kk_KZ', 'ru_KZ', 'lo_LA', 'ar_LB', 'en_LC', 'de_LI', 'gsw_LI', 'si_LK', 'ta_LK', 'en_LR', 'en_LS', 'st_LS', 'lt_LT', 'de_LU', 'fr_LU', 'lb_LU', 'lv_LV', 'ar_LY', 'ar_MA', 'fr_MA', 'tzm_MA', 'fr_MC', 'ro_MD', 'sr_Latn_ME', 'fr_MF', 'en_MG', 'fr_MG', 'mg_MG', 'en_MH', 'mh_MH', 'mk_MK', 'sq_MK', 'fr_ML', 'my_MM', 'mn_MN', 'pt_MO', 'zh_Hant_MO', 'en_MP', 'fr_MQ', 'ar_MR', 'en_MS', 'en_MT', 'mt_MT', 'en_MU', 'fr_MU', 'dv_MV', 'en_MW', 'ny_MW', 'es_MX', 'vec_MX', 'ms_MY', 'pt_MZ', 'en_NA', 'fr_NC', 'fr_NE', 'en_NF', 'en_NG', 'yo_NG', 'es_NI', 'fy_NL', 'nl_NL', 'nb_NO', 'nn_NO', 'no_NO', 'se_NO', 'ne_NP', 'en_NR', 'na_NR', 'en_NU', 'niu_NU', 'en_NZ', 'mi_NZ', 'ar_OM', 'es_PA', 'es_PE', 'qu_PE', 'fr_PF', 'ty_PF', 'en_PG', 'ho_PG', 'tpi_PG', 'ceb_PH', 'en_PH', 'fil_PH', 'hil_PH', 'ilo_PH', 'mdh_PH', 'pag_PH', 'tsg_PH', 'war_PH', 'en_PK', 'ur_PK', 'csb_PL', 'de_PL', 'lt_PL', 'pl_PL', 'fr_PM', 'en_PN', 'en_PR', 'es_PR', 'ar_PS', 'pt_PT', 'en_PW', 'pau_PW', 'es_PY', 'gn_PY', 'ar_QA', 'fr_RE', 'ro_RO', 'hr_RS', 'hu_RS', 'ro_RS', 'sk_RS', 'sr_RS', 'sr_Latn_RS', 'uk_RS', 'ady_RU', 'av_RU', 'az_Cyrl_RU', 'ba_RU', 'ce_RU', 'inh_RU', 'kbd_RU', 'koi_RU', 'krc_RU', 'kum_RU', 'kv_RU', 'lbe_RU', 'lez_RU', 'mdf_RU', 'myv_RU', 'ru_RU', 'sah_RU', 'tt_RU', 'tyv_RU', 'udm_RU', 'en_RW', 'fr_RW', 'rw_RW', 'ar_SA', 'en_SB', 'en_SC', 'fr_SC', 'ar_SD', 'en_SD', 'fi_SE', 'sv_SE', 'en_SG', 'ms_SG', 'ta_SG', 'zh_SG', 'en_SH', 'sl_SI', 'vec_SI', 'nb_SJ', 'sk_SK', 'en_SL', 'it_SM', 'bjt_SN', 'bsc_SN', 'dyo_SN', 'ff_SN', 'fr_SN', 'knf_SN', 'mey_SN', 'mfv_SN', 'sav_SN', 'snf_SN', 'srr_SN', 'tnr_SN', 'wo_SN', 'ar_SO', 'so_SO', 'nl_SR', 'en_SS', 'pt_ST', 'es_SV', 'en_SX', 'nl_SX', 'ar_SY', 'fr_SY', 'en_SZ', 'ss_SZ', 'en_TC', 'ar_TD', 'fr_TD', 'fr_TG', 'th_TH', 'tg_TJ', 'en_TK', 'tkl_TK', 'pt_TL', 'tet_TL', 'tk_TM', 'ar_TN', 'fr_TN', 'en_TO', 'to_TO', 'tr_TR', 'en_TT', 'en_TV', 'tvl_TV', 'zh_Hant_TW', 'en_TZ', 'sw_TZ', 'ru_UA', 'uk_UA', 'en_UG', 'sw_UG', 'en_UM', 'en_US', 'es_US', 'haw_US', 'es_UY', 'uz_UZ', 'uz_Cyrl_UZ', 'it_VA', 'en_VC', 'es_VE', 'en_VG', 'en_VI', 'vi_VN', 'bi_VU', 'en_VU', 'fr_VU', 'fr_WF', 'en_WS', 'sm_WS', 'sq_XK', 'sr_XK', 'sr_Latn_XK', 'ar_YE', 'fr_YT', 'af_ZA', 'en_ZA', 'nr_ZA', 'nso_ZA', 'ss_ZA', 'st_ZA', 'tn_ZA', 'ts_ZA', 've_ZA', 'xh_ZA', 'zu_ZA', 'en_ZM', 'en_ZW', 'nd_ZW', 'sn_ZW']

There are currently 93 modern locales in coverageLevels.json:

>>> cl = json.load(open("Downloads/coverageLevels.json"))
>>> modern = [x for (x,y) in cl["coverageLevels"].items() if y=="modern"]
['af', 'am', 'ar', 'as', 'az', 'be', 'bg', 'bn', 'bs', 'ca', 'cs', 'cy', 'da', 'de', 'el', 'en', 'es', 'et', 'eu', 'fa', 'fi', 'fil', 'fr', 'ga', 'gd', 'gl', 'gu', 'ha', 'he', 'hi', 'hi-Latn', 'hr', 'hu', 'hy', 'id', 'ig', 'is', 'it', 'ja', 'jv', 'ka', 'kk', 'km', 'kn', 'ko', 'kok', 'ky', 'lo', 'lt', 'lv', 'mk', 'ml', 'mn', 'mr', 'ms', 'my', 'ne', 'nl', 'nn', 'no', 'or', 'pa', 'pcm', 'pl', 'ps', 'pt', 'ro', 'ru', 'sd', 'si', 'sk', 'sl', 'so', 'sq', 'sr', 'sr-Latn', 'sv', 'sw', 'ta', 'te', 'th', 'tk', 'tr', 'uk', 'ur', 'uz', 'vi', 'yo', 'yue', 'yue-Hans', 'zh', 'zh-Hant', 'zu']

There are 6 languages that are modern but not official:

'pcm', 'yue', 'hi-Latn', 'ha', 'ig', 'yue-Hans', 'jv'

There are 132 that are official but not modern:

'lb', 'bjt', 'tvl', 'dz', 'bs-Cyrl', 'kg', 'mi', 'ln', 'fj', 'gil', 'xh', 'kv', 'ce', 'mfv', 'ak', 'sa', 'az-Arab', 'kum', 'dv', 'koi', 'den', 'st', 'wo', 'bsc', 'dyo', 'ks', 'vec', 'fy', 'pag', 'gsw', 'lua', 'ht', 'sg', 'ba', 'ilo', 'za', 'mey', 'gv', 'pap', 'qu', 'inh', 'ss', 'av', 'fo', 'kha', 'kbd', 'ti', 'ee', 'tt', 'nr', 'mh', 'mdf', 'tnr', 'cr', 'uz-Arab', 'mn-Mong', 'oc', 'ts', 'sat', 'bi', 'tg', 'bo', 'ny', 'csb', 'udm', 'sah', 'lez', 'hil', 'sn', 'war', 'sms', 'az-Cyrl', 'ho', 'dgr', 'tyv', 'tet', 'na', 'myv', 've', 'gn', 'chp', 'tsg', 'ty', 'tn', 'zdj', 'ay', 'frr', 'mai', 'rw', 'sd-Deva', 'iu-Latn', 'gaa', 'pau', 'srr', 'ms-Arab', 'quc', 'mdh', 'ug', 'mt', 'to', 'nd', 'ady', 'sm', 'ab', 'ast', 'ff', 'ceb', 'niu', 'haw', 'lbe', 'iu', 'nb', 'kl', 'krc', 'mg', 'ckb', 'tzm', 'tkl', 'wni', 'rn', 'tpi', 'se', 'gwi', 'hif', 'snf', 'uz-Cyrl', 'knf', 'nso', 'os', 'sav', 'rm', 'ch'

Also note that coverage levels does not seem to consider region, only language and script.

Perhaps we take the union of modern and official?

@sffc
Copy link
Member Author

sffc commented Jun 27, 2023

Languages that are modern/moderate/basic but not official:

'brx', 'su', 'jv', 'dsb', 'br', 'hsb', 'chr', 'bgc', 'pcm', 'cv', 'yue', 'bho', 'kea', 'ig', 'ff-Adlm', 'hi-Latn', 'ha', 'mni', 'kgp', 'sc', 'yrl', 'doi', 'ks-Deva', 'yue-Hans', 'raj', 'ia'

Languages that are official but not modern/moderate/basic:

'lb', 'bjt', 'tvl', 'dz', 'kg', 'ln', 'fj', 'gil', 'kv', 'ce', 'mfv', 'ak', 'az-Arab', 'kum', 'dv', 'koi', 'den', 'st', 'bsc', 'dyo', 'vec', 'fy', 'pag', 'gsw', 'lua', 'ht', 'sg', 'ba', 'ilo', 'za', 'mey', 'gv', 'pap', 'inh', 'ss', 'av', 'kha', 'kbd', 'ee', 'nr', 'mh', 'mdf', 'tnr', 'cr', 'uz-Arab', 'mn-Mong', 'oc', 'ts', 'bi', 'bo', 'ny', 'csb', 'udm', 'sah', 'lez', 'hil', 'sn', 'war', 'sms', 'az-Cyrl', 'ho', 'dgr', 'tyv', 'tet', 'na', 'myv', 've', 'gn', 'chp', 'tsg', 'ty', 'tn', 'zdj', 'ay', 'frr', 'rw', 'iu-Latn', 'gaa', 'pau', 'srr', 'ms-Arab', 'quc', 'mdh', 'ug', 'mt', 'nd', 'ady', 'sm', 'ab', 'ff', 'niu', 'haw', 'lbe', 'iu', 'nb', 'kl', 'krc', 'mg', 'ckb', 'tzm', 'tkl', 'wni', 'rn', 'tpi', 'se', 'gwi', 'hif', 'snf', 'knf', 'nso', 'os', 'sav', 'ch'

Above list but with regions:

'sm-AS', 'pap-AW', 'rn-BI', 'ay-BO', 'vec-BR', 'dz-BT', 'tn-BW', 'chp-CA', 'cr-CA', 'den-CA', 'dgr-CA', 'gwi-CA', 'iu-CA', 'kg-CD', 'ln-CD', 'lua-CD', 'sg-CF', 'gsw-CH', 'bo-CN', 'ug-CN', 'za-CN', 'pap-CW', 'frr-DE', 'kl-DK', 'oc-ES', 'sms-FI', 'fj-FJ', 'hif-FJ', 'ab-GE', 'os-GE', 'ak-GH', 'ee-GH', 'gaa-GH', 'kl-GL', 'quc-GT', 'ch-GU', 'vec-HR', 'ht-HT', 'gv-IM', 'kha-IN', 'ckb-IQ', 'vec-IT', 'gil-KI', 'wni-KM', 'zdj-KM', 'gsw-LI', 'st-LS', 'lb-LU', 'tzm-MA', 'mg-MG', 'mh-MH', 'mt-MT', 'dv-MV', 'ny-MW', 'vec-MX', 'fy-NL', 'nb-NO', 'se-NO', 'na-NR', 'niu-NU', 'ty-PF', 'ho-PG', 'tpi-PG', 'hil-PH', 'ilo-PH', 'mdh-PH', 'pag-PH', 'tsg-PH', 'war-PH', 'csb-PL', 'pau-PW', 'gn-PY', 'ady-RU', 'av-RU', 'ba-RU', 'ce-RU', 'inh-RU', 'kbd-RU', 'koi-RU', 'krc-RU', 'kum-RU', 'kv-RU', 'lbe-RU', 'lez-RU', 'mdf-RU', 'myv-RU', 'sah-RU', 'tyv-RU', 'udm-RU', 'rw-RW', 'vec-SI', 'nb-SJ', 'bjt-SN', 'bsc-SN', 'dyo-SN', 'ff-SN', 'knf-SN', 'mey-SN', 'mfv-SN', 'sav-SN', 'snf-SN', 'srr-SN', 'tnr-SN', 'ss-SZ', 'tkl-TK', 'tet-TL', 'tvl-TV', 'haw-US', 'bi-VU', 'sm-WS', 'nr-ZA', 'nso-ZA', 'ss-ZA', 'st-ZA', 'tn-ZA', 'ts-ZA', 've-ZA', 'nd-ZW', 'sn-ZW'

@sffc
Copy link
Member Author

sffc commented Jun 27, 2023

There are only 38 languages that are moderate/basic (less than half as many as are modern). So I think there's not a good reason to exclude them.

One can argue that it simply does not make a lot of sense to include locales that don't have good coverage. However, even sub-basic locales often have some coverage.

In terms of regional variants, we discussed a few options, which can be flags:

  1. Include all regional variants for languages that are included (enumerating CLDR data to identify them)
  2. Only include regional variants that are explicitly specified, and parents of them
  3. Include regional variants that are present in territoryInfo.json in one form or another (based on population, official status, etc)

I think we can start with 1 and 2, adding 3 later if someone asks for it.

One issue is that we seem to lose our current testdata locale Chakma (ccp). We mainly include it to test supplemental code points. The only language in modern/moderate/basic that seems to use supplemental code points is ff-Adlm. In all, here are the locales with supplemental code points in their exemplar sets:

'rhg-Rohg', 'ff-Adlm-GW', 'ff-Adlm-NE', 'ccp-IN', 'ff-Adlm-GM', 'ff-Adlm-GH', 'ccp', 'hnj-Hmnp', 'rhg', 'ff-Adlm-SN', 'ff-Adlm-NG', 'hnj', 'ff-Adlm-MR', 'en-Dsrt', 'ff-Adlm-CM', 'en-Shaw', 'ff-Adlm-BF', 'rhg-Rohg-BD', 'ff-Adlm-LR', 'osa', 'ff-Adlm', 'ff-Adlm-SL'

@sffc
Copy link
Member Author

sffc commented Jun 27, 2023

Conclusions:

  • Default set of languages for globaldata are modern,moderate,basic
  • We support both region selection modes 1 and 2, defaulting to 1

LGTM: @Manishearth @sffc @eggrobin @robertbastian

Furthermore:

  • Add --locales recommended and the corresponding datagen API LocaleInclude::Recommended
  • Make this the default value on the CLI

LGTM: @Manishearth @eggrobin @skius @robertbastian @sffc

@sffc
Copy link
Member Author

sffc commented Jun 27, 2023

@macchiati also approved of the above scheme.

@sffc sffc added A-design Area: Architecture or design C-meta Component: Relating to ICU4X as a whole C-data-infra Component: provider, datagen, fallback, adapters and removed discuss Discuss at a future ICU4X-SC meeting discuss-triaged The stakeholders for this issue have been identified and it can be discussed out-of-band C-meta Component: Relating to ICU4X as a whole labels Jun 28, 2023
@sffc sffc closed this as completed Oct 5, 2023
@sffc sffc added this to the 1.3 Blocking ⟨P1⟩ milestone Oct 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-design Area: Architecture or design C-data-infra Component: provider, datagen, fallback, adapters
Projects
None yet
Development

No branches or pull requests

4 participants