Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimizing data for Display Names #3260

Open
Tracked by #3913
sffc opened this issue Apr 4, 2023 · 3 comments
Open
Tracked by #3913

Optimizing data for Display Names #3260

sffc opened this issue Apr 4, 2023 · 3 comments
Labels
A-data Area: Data coverage or quality A-design Area: Architecture or design C-dnames Component: Language/Region/... Display Names S-medium Size: Less than a week (larger bug fix or enhancement) T-core Type: Required functionality

Comments

@sffc
Copy link
Member

sffc commented Apr 4, 2023

The DisplayNames component comes with a large amount of data. It is the largest locale-specific data in ICU and will also likely be the largest in ICU4X.

There are a few things that make DisplayNames interesting:

  1. The majority of the display names are probably not useful to carry for most clients. For example, users speaking Japanese are more likely to need the translation for the Katakana script than the translation for the Cherokee script. We should explore something like japanext and likelysubtagsext where we have a core set and an extended set.
  2. Regional variants often override only a small number of strings. For example, en-GB and en-US might be equivalent for all region names except for one or two. This doesn't play nicely with the deduplication mechanism we've thusfar relied on.

CC @snktd @robertbastian @markusicu

@sffc sffc added A-design Area: Architecture or design discuss Discuss at a future ICU4X-SC meeting A-data Area: Data coverage or quality C-dnames Component: Language/Region/... Display Names labels Apr 4, 2023
@robertbastian
Copy link
Member

I think 2 is a big issue, and I think it also happens for other data. We could, instead of loading a single data struct in the formatter constructor, load all structs for the whole fallback chain. This could use naive fallback (i.e. chopping off tags), so no additional data would be needed. We can then remove redundant entries from en-GB and en-001 if they are in en (if we're using naive we'd still have duplication across GB and 001 though).

@sffc
Copy link
Member Author

sffc commented May 11, 2023

Discuss with:

@sffc sffc added the discuss-triaged The stakeholders for this issue have been identified and it can be discussed out-of-band label May 25, 2023
@sffc
Copy link
Member Author

sffc commented Jul 5, 2023

Discussed on 2023-07-04. We will use the auxiliary key model, similar to currency formatter (#1441), which resolves the issues in the OP.

@sffc sffc removed discuss Discuss at a future ICU4X-SC meeting discuss-triaged The stakeholders for this issue have been identified and it can be discussed out-of-band labels Jul 5, 2023
@sffc sffc added this to the 1.4 Blocking ⟨P1⟩ milestone Jul 5, 2023
@sffc sffc added T-core Type: Required functionality S-medium Size: Less than a week (larger bug fix or enhancement) labels Jul 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-data Area: Data coverage or quality A-design Area: Architecture or design C-dnames Component: Language/Region/... Display Names S-medium Size: Less than a week (larger bug fix or enhancement) T-core Type: Required functionality
Projects
None yet
Development

No branches or pull requests

2 participants