Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to handle non-orthogonal data #3022

Open
sffc opened this issue Jan 25, 2023 · 2 comments
Open

How to handle non-orthogonal data #3022

sffc opened this issue Jan 25, 2023 · 2 comments
Assignees
Labels
S-tiny Size: Less than an hour (trivial fixes) T-docs-tests Type: Code change outside core library

Comments

@sffc
Copy link
Member

sffc commented Jan 25, 2023

We have three examples that have come up that involve us needing to store the same functional data in different forms tailored to individual components and client needs:

  1. Unicode properties: Bidi_Mirrored_Glyph, Script_Extensions
  2. Segmentation: LSTM or Dictionary
  3. Likely subtags: full set or only those needed for fallback

In all three of these cases, there is a component A that needs small data and a component B that needs bigger data; we want component A to use its small data if it is by itself, but if component B is present in the bundle, component A should use component B's data.

The best solution is to engineer the data structs to be fully orthogonal: bigger components load the data needed for the smaller components, plus some other "supplement" key. This is what @hsivonen has done in Collator/Normalizer. However, this is not always feasible if (1) the data cannot be easily split into smaller keys or (2) doing this split significantly reduces runtime performance.

For segmentation, I've proposed in #2905 that we do some magic inside datagen. However, this is not a foolproof solution since it requires datagen flags to be kept in sync with the ground truth in code.

For properties, there's discussion in #2833 about how to store the bidi-related properties for two distinct users, unicode_bidi and Harfbuzz.

Likely subtags: #2903

Let's discuss this general problem space and establish some recommendations.

@Manishearth @robertbastian @markusicu

@sffc sffc added the discuss-priority Discuss at the next ICU4X meeting label Jan 25, 2023
@Manishearth
Copy link
Member

For properties, there's discussion in #2833 about how to store the bidi-related properties for two distinct users, unicode_bidi and Harfbuzz.

FWIW I think the properties needed by unicode_bidi and harfbuzz are actually disjoint. unicode_bidi needs Bidi_Paired_Bracket and Bidi_Paired_Bracket_Type and Bidi_Class, whereas harfbuzz needs Bidi_Mirrored and Bidi_Mirroring_Glyph. If harfbuzz wishes to run the bidi algorithm it also needs the others, of course.

There's a potential optimization that can be done to merge them (#2833 (comment)) but I'm not convinced it's a good idea.


Overall I think this is where datagen config comes in, where we can tell datagen what subset we want to support.

@sffc
Copy link
Member Author

sffc commented Feb 23, 2023

Consensus: by default, do not accept overlapping data. Try to avoid it when possible. There may be exceptions, which can be approved by the SC on a case-by-case basis.

@sffc sffc removed the discuss-priority Discuss at the next ICU4X meeting label Feb 23, 2023
@sffc sffc self-assigned this Feb 23, 2023
@sffc sffc added T-docs-tests Type: Code change outside core library S-tiny Size: Less than an hour (trivial fixes) labels Feb 23, 2023
@sffc sffc added this to the 1.x Priority ⟨P2⟩ milestone Feb 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S-tiny Size: Less than an hour (trivial fixes) T-docs-tests Type: Code change outside core library
Projects
None yet
Development

No branches or pull requests

2 participants