Globaldata CLDR locales set #3538

sffc · 2023-06-14T22:56:00Z

What set of locales should we use in globaldata?

Discuss with:

@sffc
@robertbastian
@eggrobin
Someone from CLDR and/or ICU

Optional:

@Manishearth

Plan to add this to an upcoming ICU-TC call.

robertbastian · 2023-06-15T09:26:57Z

I want to include at least modern, but can be convinced to include larger sets. We should not define a smaller set, as that will become a canonical Unicode approved locale set. Clients that want specific sets can run datagen.

Manishearth · 2023-06-15T11:47:58Z

If size isn't a problem I'm also in favor of modern

eggrobin · 2023-06-16T10:19:20Z

I second @robertbastian’s comment about standard proliferation:

We should not define a smaller set, as that will become a canonical Unicode approved locale set.

I have no opinion as to the actual set (for tiny sets of locales whose sole purpose is a good coverage of I18N issues I might have something to say, but as far as I can tell this is not the use case here).

sffc · 2023-06-16T15:32:46Z

What defines the CLDR sets is not the usage or the need but rather the amount of data that happens to be collected for a particular locale. When a locale is in "modern", it is a stronger reflection of the well-connectedness of influencers in that locale than it is of whether that locale is a good choice for being a default locale.

robertbastian · 2023-06-19T09:15:47Z

Do you have a proposal? I'm happy to use a smaller pre-existing set, I just don't want to define a new one.

I don't see a big problem with locales being modern because someone needed the locale and did the work to get it included in CLDR. Adding an extra locales has very limited runtime impact (logarithmic impact on constructors, which often do heavier work), so it's mainly a code size thing. Do we agree that code-size sensitive clients should use datagen, or do you want to reach these clients with baked data?

I also don't mind including all CLDR locales, including basic and moderate. This increases (postcard) size by 17%.

sffc · 2023-06-27T07:49:49Z

According to territoryInfo.json, there are 487 language-region pairs that are "official", "official_regional", or "de_facto_official":

>>> ti = json.load(open("Downloads/territoryInfo.json"))
>>> official_locales = [(lang, region) for (region,regionInfo) in ti["supplemental"]["territoryInfo"].items() for (lang,langInfo) in regionInfo.get("languagePopulation", {}).items() if langInfo.get("_officialStatus", None) is not None]
>>> ["%s_%s" % (lang, region) for (lang, region) in official_locales]

['ca_AD', 'ar_AE', 'fa_AF', 'ps_AF', 'tk_AF', 'uz_Arab_AF', 'en_AG', 'en_AI', 'sq_AL', 'hy_AM', 'pt_AO', 'es_AR', 'en_AS', 'sm_AS', 'de_AT', 'hr_AT', 'hu_AT', 'sl_AT', 'en_AU', 'nl_AW', 'pap_AW', 'sv_AX', 'az_AZ', 'az_Cyrl_AZ', 'bs_BA', 'bs_Cyrl_BA', 'hr_BA', 'sr_BA', 'sr_Latn_BA', 'en_BB', 'bn_BD', 'de_BE', 'fr_BE', 'nl_BE', 'fr_BF', 'bg_BG', 'ar_BH', 'en_BI', 'fr_BI', 'rn_BI', 'fr_BJ', 'fr_BL', 'en_BM', 'ms_BN', 'ms_Arab_BN', 'ay_BO', 'es_BO', 'qu_BO', 'nl_BQ', 'pt_BR', 'vec_BR', 'en_BS', 'dz_BT', 'en_BW', 'tn_BW', 'be_BY', 'ru_BY', 'en_BZ', 'chp_CA', 'cr_CA', 'den_CA', 'dgr_CA', 'en_CA', 'fr_CA', 'gwi_CA', 'iu_CA', 'iu_Latn_CA', 'en_CC', 'fr_CD', 'kg_CD', 'ln_CD', 'lua_CD', 'sw_CD', 'fr_CF', 'sg_CF', 'fr_CG', 'de_CH', 'fr_CH', 'gsw_CH', 'it_CH', 'rm_CH', 'fr_CI', 'en_CK', 'es_CL', 'en_CM', 'fr_CM', 'bo_CN', 'ko_CN', 'mn_Mong_CN', 'ug_CN', 'za_CN', 'zh_CN', 'es_CO', 'en_CQ', 'es_CR', 'es_CU', 'pt_CV', 'nl_CW', 'pap_CW', 'en_CX', 'el_CY', 'tr_CY', 'cs_CZ', 'de_DE', 'frr_DE', 'en_DG', 'ar_DJ', 'fr_DJ', 'da_DK', 'de_DK', 'kl_DK', 'en_DM', 'es_DO', 'ar_DZ', 'fr_DZ', 'es_EA', 'es_EC', 'qu_EC', 'et_EE', 'ar_EG', 'ar_EH', 'ar_ER', 'en_ER', 'ti_ER', 'ast_ES', 'ca_ES', 'es_ES', 'eu_ES', 'gl_ES', 'oc_ES', 'am_ET', 'fi_FI', 'sms_FI', 'sv_FI', 'en_FJ', 'fj_FJ', 'hif_FJ', 'en_FK', 'en_FM', 'fo_FO', 'fr_FR', 'fr_GA', 'cy_GB', 'en_GB', 'ga_GB', 'gd_GB', 'en_GD', 'ab_GE', 'ka_GE', 'os_GE', 'fr_GF', 'en_GG', 'ak_GH', 'ee_GH', 'en_GH', 'gaa_GH', 'en_GI', 'kl_GL', 'en_GM', 'fr_GN', 'fr_GP', 'es_GQ', 'fr_GQ', 'pt_GQ', 'el_GR', 'es_GT', 'quc_GT', 'ch_GU', 'en_GU', 'pt_GW', 'en_GY', 'en_HK', 'zh_Hant_HK', 'es_HN', 'hr_HR', 'it_HR', 'vec_HR', 'fr_HT', 'ht_HT', 'hu_HU', 'es_IC', 'id_ID', 'en_IE', 'ga_IE', 'ar_IL', 'he_IL', 'en_IM', 'gv_IM', 'as_IN', 'bn_IN', 'en_IN', 'gu_IN', 'hi_IN', 'kha_IN', 'kn_IN', 'kok_IN', 'ks_IN', 'mai_IN', 'ml_IN', 'mr_IN', 'ne_IN', 'or_IN', 'pa_IN', 'sa_IN', 'sat_IN', 'sd_IN', 'sd_Deva_IN', 'ta_IN', 'te_IN', 'ur_IN', 'en_IO', 'ar_IQ', 'az_Arab_IQ', 'ckb_IQ', 'fa_IR', 'is_IS', 'fr_IT', 'it_IT', 'vec_IT', 'en_JE', 'en_JM', 'ar_JO', 'ja_JP', 'en_KE', 'sw_KE', 'ky_KG', 'ru_KG', 'km_KH', 'en_KI', 'gil_KI', 'ar_KM', 'fr_KM', 'wni_KM', 'zdj_KM', 'en_KN', 'ko_KP', 'ko_KR', 'ar_KW', 'en_KY', 'kk_KZ', 'ru_KZ', 'lo_LA', 'ar_LB', 'en_LC', 'de_LI', 'gsw_LI', 'si_LK', 'ta_LK', 'en_LR', 'en_LS', 'st_LS', 'lt_LT', 'de_LU', 'fr_LU', 'lb_LU', 'lv_LV', 'ar_LY', 'ar_MA', 'fr_MA', 'tzm_MA', 'fr_MC', 'ro_MD', 'sr_Latn_ME', 'fr_MF', 'en_MG', 'fr_MG', 'mg_MG', 'en_MH', 'mh_MH', 'mk_MK', 'sq_MK', 'fr_ML', 'my_MM', 'mn_MN', 'pt_MO', 'zh_Hant_MO', 'en_MP', 'fr_MQ', 'ar_MR', 'en_MS', 'en_MT', 'mt_MT', 'en_MU', 'fr_MU', 'dv_MV', 'en_MW', 'ny_MW', 'es_MX', 'vec_MX', 'ms_MY', 'pt_MZ', 'en_NA', 'fr_NC', 'fr_NE', 'en_NF', 'en_NG', 'yo_NG', 'es_NI', 'fy_NL', 'nl_NL', 'nb_NO', 'nn_NO', 'no_NO', 'se_NO', 'ne_NP', 'en_NR', 'na_NR', 'en_NU', 'niu_NU', 'en_NZ', 'mi_NZ', 'ar_OM', 'es_PA', 'es_PE', 'qu_PE', 'fr_PF', 'ty_PF', 'en_PG', 'ho_PG', 'tpi_PG', 'ceb_PH', 'en_PH', 'fil_PH', 'hil_PH', 'ilo_PH', 'mdh_PH', 'pag_PH', 'tsg_PH', 'war_PH', 'en_PK', 'ur_PK', 'csb_PL', 'de_PL', 'lt_PL', 'pl_PL', 'fr_PM', 'en_PN', 'en_PR', 'es_PR', 'ar_PS', 'pt_PT', 'en_PW', 'pau_PW', 'es_PY', 'gn_PY', 'ar_QA', 'fr_RE', 'ro_RO', 'hr_RS', 'hu_RS', 'ro_RS', 'sk_RS', 'sr_RS', 'sr_Latn_RS', 'uk_RS', 'ady_RU', 'av_RU', 'az_Cyrl_RU', 'ba_RU', 'ce_RU', 'inh_RU', 'kbd_RU', 'koi_RU', 'krc_RU', 'kum_RU', 'kv_RU', 'lbe_RU', 'lez_RU', 'mdf_RU', 'myv_RU', 'ru_RU', 'sah_RU', 'tt_RU', 'tyv_RU', 'udm_RU', 'en_RW', 'fr_RW', 'rw_RW', 'ar_SA', 'en_SB', 'en_SC', 'fr_SC', 'ar_SD', 'en_SD', 'fi_SE', 'sv_SE', 'en_SG', 'ms_SG', 'ta_SG', 'zh_SG', 'en_SH', 'sl_SI', 'vec_SI', 'nb_SJ', 'sk_SK', 'en_SL', 'it_SM', 'bjt_SN', 'bsc_SN', 'dyo_SN', 'ff_SN', 'fr_SN', 'knf_SN', 'mey_SN', 'mfv_SN', 'sav_SN', 'snf_SN', 'srr_SN', 'tnr_SN', 'wo_SN', 'ar_SO', 'so_SO', 'nl_SR', 'en_SS', 'pt_ST', 'es_SV', 'en_SX', 'nl_SX', 'ar_SY', 'fr_SY', 'en_SZ', 'ss_SZ', 'en_TC', 'ar_TD', 'fr_TD', 'fr_TG', 'th_TH', 'tg_TJ', 'en_TK', 'tkl_TK', 'pt_TL', 'tet_TL', 'tk_TM', 'ar_TN', 'fr_TN', 'en_TO', 'to_TO', 'tr_TR', 'en_TT', 'en_TV', 'tvl_TV', 'zh_Hant_TW', 'en_TZ', 'sw_TZ', 'ru_UA', 'uk_UA', 'en_UG', 'sw_UG', 'en_UM', 'en_US', 'es_US', 'haw_US', 'es_UY', 'uz_UZ', 'uz_Cyrl_UZ', 'it_VA', 'en_VC', 'es_VE', 'en_VG', 'en_VI', 'vi_VN', 'bi_VU', 'en_VU', 'fr_VU', 'fr_WF', 'en_WS', 'sm_WS', 'sq_XK', 'sr_XK', 'sr_Latn_XK', 'ar_YE', 'fr_YT', 'af_ZA', 'en_ZA', 'nr_ZA', 'nso_ZA', 'ss_ZA', 'st_ZA', 'tn_ZA', 'ts_ZA', 've_ZA', 'xh_ZA', 'zu_ZA', 'en_ZM', 'en_ZW', 'nd_ZW', 'sn_ZW']

There are currently 93 modern locales in coverageLevels.json:

>>> cl = json.load(open("Downloads/coverageLevels.json"))
>>> modern = [x for (x,y) in cl["coverageLevels"].items() if y=="modern"]

['af', 'am', 'ar', 'as', 'az', 'be', 'bg', 'bn', 'bs', 'ca', 'cs', 'cy', 'da', 'de', 'el', 'en', 'es', 'et', 'eu', 'fa', 'fi', 'fil', 'fr', 'ga', 'gd', 'gl', 'gu', 'ha', 'he', 'hi', 'hi-Latn', 'hr', 'hu', 'hy', 'id', 'ig', 'is', 'it', 'ja', 'jv', 'ka', 'kk', 'km', 'kn', 'ko', 'kok', 'ky', 'lo', 'lt', 'lv', 'mk', 'ml', 'mn', 'mr', 'ms', 'my', 'ne', 'nl', 'nn', 'no', 'or', 'pa', 'pcm', 'pl', 'ps', 'pt', 'ro', 'ru', 'sd', 'si', 'sk', 'sl', 'so', 'sq', 'sr', 'sr-Latn', 'sv', 'sw', 'ta', 'te', 'th', 'tk', 'tr', 'uk', 'ur', 'uz', 'vi', 'yo', 'yue', 'yue-Hans', 'zh', 'zh-Hant', 'zu']

There are 6 languages that are modern but not official:

'pcm', 'yue', 'hi-Latn', 'ha', 'ig', 'yue-Hans', 'jv'

There are 132 that are official but not modern:

'lb', 'bjt', 'tvl', 'dz', 'bs-Cyrl', 'kg', 'mi', 'ln', 'fj', 'gil', 'xh', 'kv', 'ce', 'mfv', 'ak', 'sa', 'az-Arab', 'kum', 'dv', 'koi', 'den', 'st', 'wo', 'bsc', 'dyo', 'ks', 'vec', 'fy', 'pag', 'gsw', 'lua', 'ht', 'sg', 'ba', 'ilo', 'za', 'mey', 'gv', 'pap', 'qu', 'inh', 'ss', 'av', 'fo', 'kha', 'kbd', 'ti', 'ee', 'tt', 'nr', 'mh', 'mdf', 'tnr', 'cr', 'uz-Arab', 'mn-Mong', 'oc', 'ts', 'sat', 'bi', 'tg', 'bo', 'ny', 'csb', 'udm', 'sah', 'lez', 'hil', 'sn', 'war', 'sms', 'az-Cyrl', 'ho', 'dgr', 'tyv', 'tet', 'na', 'myv', 've', 'gn', 'chp', 'tsg', 'ty', 'tn', 'zdj', 'ay', 'frr', 'mai', 'rw', 'sd-Deva', 'iu-Latn', 'gaa', 'pau', 'srr', 'ms-Arab', 'quc', 'mdh', 'ug', 'mt', 'to', 'nd', 'ady', 'sm', 'ab', 'ast', 'ff', 'ceb', 'niu', 'haw', 'lbe', 'iu', 'nb', 'kl', 'krc', 'mg', 'ckb', 'tzm', 'tkl', 'wni', 'rn', 'tpi', 'se', 'gwi', 'hif', 'snf', 'uz-Cyrl', 'knf', 'nso', 'os', 'sav', 'rm', 'ch'

Also note that coverage levels does not seem to consider region, only language and script.

Perhaps we take the union of modern and official?

sffc · 2023-06-27T11:30:05Z

Languages that are modern/moderate/basic but not official:

'brx', 'su', 'jv', 'dsb', 'br', 'hsb', 'chr', 'bgc', 'pcm', 'cv', 'yue', 'bho', 'kea', 'ig', 'ff-Adlm', 'hi-Latn', 'ha', 'mni', 'kgp', 'sc', 'yrl', 'doi', 'ks-Deva', 'yue-Hans', 'raj', 'ia'

Languages that are official but not modern/moderate/basic:

'lb', 'bjt', 'tvl', 'dz', 'kg', 'ln', 'fj', 'gil', 'kv', 'ce', 'mfv', 'ak', 'az-Arab', 'kum', 'dv', 'koi', 'den', 'st', 'bsc', 'dyo', 'vec', 'fy', 'pag', 'gsw', 'lua', 'ht', 'sg', 'ba', 'ilo', 'za', 'mey', 'gv', 'pap', 'inh', 'ss', 'av', 'kha', 'kbd', 'ee', 'nr', 'mh', 'mdf', 'tnr', 'cr', 'uz-Arab', 'mn-Mong', 'oc', 'ts', 'bi', 'bo', 'ny', 'csb', 'udm', 'sah', 'lez', 'hil', 'sn', 'war', 'sms', 'az-Cyrl', 'ho', 'dgr', 'tyv', 'tet', 'na', 'myv', 've', 'gn', 'chp', 'tsg', 'ty', 'tn', 'zdj', 'ay', 'frr', 'rw', 'iu-Latn', 'gaa', 'pau', 'srr', 'ms-Arab', 'quc', 'mdh', 'ug', 'mt', 'nd', 'ady', 'sm', 'ab', 'ff', 'niu', 'haw', 'lbe', 'iu', 'nb', 'kl', 'krc', 'mg', 'ckb', 'tzm', 'tkl', 'wni', 'rn', 'tpi', 'se', 'gwi', 'hif', 'snf', 'knf', 'nso', 'os', 'sav', 'ch'

Above list but with regions:

'sm-AS', 'pap-AW', 'rn-BI', 'ay-BO', 'vec-BR', 'dz-BT', 'tn-BW', 'chp-CA', 'cr-CA', 'den-CA', 'dgr-CA', 'gwi-CA', 'iu-CA', 'kg-CD', 'ln-CD', 'lua-CD', 'sg-CF', 'gsw-CH', 'bo-CN', 'ug-CN', 'za-CN', 'pap-CW', 'frr-DE', 'kl-DK', 'oc-ES', 'sms-FI', 'fj-FJ', 'hif-FJ', 'ab-GE', 'os-GE', 'ak-GH', 'ee-GH', 'gaa-GH', 'kl-GL', 'quc-GT', 'ch-GU', 'vec-HR', 'ht-HT', 'gv-IM', 'kha-IN', 'ckb-IQ', 'vec-IT', 'gil-KI', 'wni-KM', 'zdj-KM', 'gsw-LI', 'st-LS', 'lb-LU', 'tzm-MA', 'mg-MG', 'mh-MH', 'mt-MT', 'dv-MV', 'ny-MW', 'vec-MX', 'fy-NL', 'nb-NO', 'se-NO', 'na-NR', 'niu-NU', 'ty-PF', 'ho-PG', 'tpi-PG', 'hil-PH', 'ilo-PH', 'mdh-PH', 'pag-PH', 'tsg-PH', 'war-PH', 'csb-PL', 'pau-PW', 'gn-PY', 'ady-RU', 'av-RU', 'ba-RU', 'ce-RU', 'inh-RU', 'kbd-RU', 'koi-RU', 'krc-RU', 'kum-RU', 'kv-RU', 'lbe-RU', 'lez-RU', 'mdf-RU', 'myv-RU', 'sah-RU', 'tyv-RU', 'udm-RU', 'rw-RW', 'vec-SI', 'nb-SJ', 'bjt-SN', 'bsc-SN', 'dyo-SN', 'ff-SN', 'knf-SN', 'mey-SN', 'mfv-SN', 'sav-SN', 'snf-SN', 'srr-SN', 'tnr-SN', 'ss-SZ', 'tkl-TK', 'tet-TL', 'tvl-TV', 'haw-US', 'bi-VU', 'sm-WS', 'nr-ZA', 'nso-ZA', 'ss-ZA', 'st-ZA', 'tn-ZA', 'ts-ZA', 've-ZA', 'nd-ZW', 'sn-ZW'

sffc · 2023-06-27T12:38:03Z

There are only 38 languages that are moderate/basic (less than half as many as are modern). So I think there's not a good reason to exclude them.

One can argue that it simply does not make a lot of sense to include locales that don't have good coverage. However, even sub-basic locales often have some coverage.

In terms of regional variants, we discussed a few options, which can be flags:

Include all regional variants for languages that are included (enumerating CLDR data to identify them)
Only include regional variants that are explicitly specified, and parents of them
Include regional variants that are present in territoryInfo.json in one form or another (based on population, official status, etc)

I think we can start with 1 and 2, adding 3 later if someone asks for it.

One issue is that we seem to lose our current testdata locale Chakma (ccp). We mainly include it to test supplemental code points. The only language in modern/moderate/basic that seems to use supplemental code points is ff-Adlm. In all, here are the locales with supplemental code points in their exemplar sets:

'rhg-Rohg', 'ff-Adlm-GW', 'ff-Adlm-NE', 'ccp-IN', 'ff-Adlm-GM', 'ff-Adlm-GH', 'ccp', 'hnj-Hmnp', 'rhg', 'ff-Adlm-SN', 'ff-Adlm-NG', 'hnj', 'ff-Adlm-MR', 'en-Dsrt', 'ff-Adlm-CM', 'en-Shaw', 'ff-Adlm-BF', 'rhg-Rohg-BD', 'ff-Adlm-LR', 'osa', 'ff-Adlm', 'ff-Adlm-SL'

sffc · 2023-06-27T15:37:43Z

Conclusions:

Default set of languages for globaldata are modern,moderate,basic
We support both region selection modes 1 and 2, defaulting to 1

LGTM: @Manishearth @sffc @eggrobin @robertbastian

Furthermore:

Add --locales recommended and the corresponding datagen API LocaleInclude::Recommended
Make this the default value on the CLI

LGTM: @Manishearth @eggrobin @skius @robertbastian @sffc

sffc · 2023-06-27T23:53:44Z

@macchiati also approved of the above scheme.

sffc added the discuss Discuss at a future ICU4X-SC meeting label Jun 14, 2023

sffc added the discuss-triaged The stakeholders for this issue have been identified and it can be discussed out-of-band label Jun 15, 2023

robertbastian mentioned this issue Jun 27, 2023

Add recommended locale set options and expand regions in datagen #3586

Merged

sffc closed this as completed Oct 5, 2023

sffc added this to the 1.3 Blocking ⟨P1⟩ milestone Oct 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Globaldata CLDR locales set #3538

Globaldata CLDR locales set #3538

sffc commented Jun 14, 2023 •

edited

Loading

robertbastian commented Jun 15, 2023

Manishearth commented Jun 15, 2023

eggrobin commented Jun 16, 2023

sffc commented Jun 16, 2023 •

edited

Loading

robertbastian commented Jun 19, 2023

sffc commented Jun 27, 2023 •

edited

Loading

sffc commented Jun 27, 2023

sffc commented Jun 27, 2023 •

edited

Loading

sffc commented Jun 27, 2023

sffc commented Jun 27, 2023

Globaldata CLDR locales set #3538

Globaldata CLDR locales set #3538

Comments

sffc commented Jun 14, 2023 • edited Loading

robertbastian commented Jun 15, 2023

Manishearth commented Jun 15, 2023

eggrobin commented Jun 16, 2023

sffc commented Jun 16, 2023 • edited Loading

robertbastian commented Jun 19, 2023

sffc commented Jun 27, 2023 • edited Loading

sffc commented Jun 27, 2023

sffc commented Jun 27, 2023 • edited Loading

sffc commented Jun 27, 2023

sffc commented Jun 27, 2023

sffc commented Jun 14, 2023 •

edited

Loading

sffc commented Jun 16, 2023 •

edited

Loading

sffc commented Jun 27, 2023 •

edited

Loading

sffc commented Jun 27, 2023 •

edited

Loading