How to handle non-orthogonal data #3022

sffc · 2023-01-25T00:14:04Z

We have three examples that have come up that involve us needing to store the same functional data in different forms tailored to individual components and client needs:

Unicode properties: Bidi_Mirrored_Glyph, Script_Extensions
Segmentation: LSTM or Dictionary
Likely subtags: full set or only those needed for fallback

In all three of these cases, there is a component A that needs small data and a component B that needs bigger data; we want component A to use its small data if it is by itself, but if component B is present in the bundle, component A should use component B's data.

The best solution is to engineer the data structs to be fully orthogonal: bigger components load the data needed for the smaller components, plus some other "supplement" key. This is what @hsivonen has done in Collator/Normalizer. However, this is not always feasible if (1) the data cannot be easily split into smaller keys or (2) doing this split significantly reduces runtime performance.

For segmentation, I've proposed in #2905 that we do some magic inside datagen. However, this is not a foolproof solution since it requires datagen flags to be kept in sync with the ground truth in code.

For properties, there's discussion in #2833 about how to store the bidi-related properties for two distinct users, unicode_bidi and Harfbuzz.

Likely subtags: #2903

Let's discuss this general problem space and establish some recommendations.

@Manishearth @robertbastian @markusicu

Manishearth · 2023-02-21T18:52:45Z

For properties, there's discussion in #2833 about how to store the bidi-related properties for two distinct users, unicode_bidi and Harfbuzz.

FWIW I think the properties needed by unicode_bidi and harfbuzz are actually disjoint. unicode_bidi needs Bidi_Paired_Bracket and Bidi_Paired_Bracket_Type and Bidi_Class, whereas harfbuzz needs Bidi_Mirrored and Bidi_Mirroring_Glyph. If harfbuzz wishes to run the bidi algorithm it also needs the others, of course.

There's a potential optimization that can be done to merge them (#2833 (comment)) but I'm not convinced it's a good idea.

Overall I think this is where datagen config comes in, where we can tell datagen what subset we want to support.

sffc · 2023-02-23T19:29:24Z

Consensus: by default, do not accept overlapping data. Try to avoid it when possible. There may be exceptions, which can be approved by the SC on a case-by-case basis.

sffc added the discuss-priority Discuss at the next ICU4X meeting label Jan 25, 2023

sffc removed the discuss-priority Discuss at the next ICU4X meeting label Feb 23, 2023

sffc self-assigned this Feb 23, 2023

sffc added T-docs-tests Type: Code change outside core library S-tiny Size: Less than an hour (trivial fixes) labels Feb 23, 2023

sffc added this to the 1.x Priority ⟨P2⟩ milestone Feb 23, 2023

sffc mentioned this issue Feb 23, 2023

Support slicing likely subtags and adding extended likely subtags #2903

Closed

sffc mentioned this issue Aug 7, 2024

Use LikelySubtagsForLanguageV1 for fallback #5338

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to handle non-orthogonal data #3022

How to handle non-orthogonal data #3022

sffc commented Jan 25, 2023 •

edited

Loading

Manishearth commented Feb 21, 2023

sffc commented Feb 23, 2023

How to handle non-orthogonal data #3022

How to handle non-orthogonal data #3022

Comments

sffc commented Jan 25, 2023 • edited Loading

Manishearth commented Feb 21, 2023

sffc commented Feb 23, 2023

sffc commented Jan 25, 2023 •

edited

Loading