Provide data for Bidi pairing of brackets #3030

echeran · 2023-01-25T23:20:29Z

Background

A pull request has gone into the Rust unicode-bidi repo to support the pairing of brackets in the Bidi algorithm. When doing so, the small amount of special case data was hardcoded for the relevant code points.

However, APIs were created to allow the plugging in of an external data provider. This external data provider could be used to supplying the most recent version of correct data. It was intended for ICU4X to be a reliable source of the latest greatest property data needed for the algorithm.

Problem

To that extent, we need to add the data for the properties concerned, which are Bidi_Paired_Bracket (bpb) and Bidi_Paired_Bracket_Type (bpt).

Details

Both of these properties can be provided in full via CodePointTrie. That would be useful for regex implementers, at the least. Bidi_Paired_Bracket_Type is enumerated, while Bidi_Paired_Bracket returns code points / <none>s (which can also be represented in CodePointTries).

For the purposes of the bracket pairing purposes of the Bidi algorithm, we can optimize to save space by only carefully examining the values that are actually used by the algorithm. There are currently only 128 code points whose Bidi_Paired_Bracket_Type values are not None. So we could minimize the actual data that we store for Bidi_Paired_Bracket (bpb) and Bidi_Paired_Bracket_Type (bpt) to just that set of code points. Other information derived from normalization is needed as well, but only for the characters whose normalization forms are not unique (and thus form an equivalence class of more than one pair), and the relevant code points are a further subset of that subset. So space can be optimized for users who only need the Bidi paired bracket portion of the algorithm by only holding a small set of data, where that small set is the intersection of selecting bpt != None and the 3 pieces of info (bpb, bpt, equivalence class representative char). Representing that data in a compact form, such as a ZeroVec of a struct that has a field per distinct piece of info, enables avoiding pulling in the whole CodePointTrie's worth of data for the relevant properties.

The text was updated successfully, but these errors were encountered:

sffc · 2023-01-25T23:57:12Z

Related: #2905

I think we may want to add an extra data key optimized for unicode-bidi, even if that duplicates data found elsewhere. We know the data is very small, so we probably don't need to do anything too fancy for it in datagen.

sffc · 2023-01-26T19:42:33Z

@hsivonen - Do we want to enable the bidi algorithm in apps that don't also use harfbuzz? I don't feel it's worthwhile to optimize for separating those two.
@sffc - The sets of properties are disjoint, right?
@hsivonen - Yes, but it's more efficient to bitpack them
@echeran - It's not clear that we can fully bitpack all of the code points; splitting them into 2 buckets could help
@Manishearth - The spec doesn't require that these properties are equal. It also talks about normalization; there's currently 1 code point that is affected by that, but it could possibly grow.
@sffc - What really is the advantage to splitting vs non-splitting?
@hsivonen - I don't know and we can't really find out without doing it.
@hsivonen - It's a derived property.
@Manishearth - Oh okay. So then the main thing is that with Elango's split, we can roll in the normalization with the frankenproperty.
@sffc - I think datagen should be packaging the property, not icuexport.
@hsivonen - icuexport does a lot of work for us in normalization and collator. It seems easier to do it on the icuexport side with relatively few lines of code.
@sffc - I see collator and normalizer as special cases because ICU is very opinionated about how they work. This harfbuzz/bidi thing is novel in ICU4X, so the code to do it should land in ICU4X
@hsivonen - I think we don't need the extra codepoint bytes all the time and we should be able to bitpack efficiently.
@Manishearth - ICU4X's approach is that it makes it easy to have multiple ways to do the same thing.

sffc · 2023-02-23T19:23:37Z

See additional discussion in #2833

echeran self-assigned this Feb 2, 2023

sffc modified the milestones: 1.2 Blocking ⟨P1⟩, 1.x Priority ⟨P2⟩ Feb 16, 2023

sffc added C-unicode Component: Props, sets, tries T-techdebt Type: ICU4X code health and tech debt S-medium Size: Less than a week (larger bug fix or enhancement) labels Feb 16, 2023

echeran linked a pull request Mar 27, 2023 that will close this issue

Support data for Bidi properties with combined data structure #3026

Merged

echeran closed this as completed in #3026 Mar 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Provide data for Bidi pairing of brackets #3030

Provide data for Bidi pairing of brackets #3030

echeran commented Jan 25, 2023

sffc commented Jan 25, 2023 •

edited

Loading

sffc commented Jan 26, 2023

sffc commented Feb 23, 2023

Provide data for Bidi pairing of brackets #3030

Provide data for Bidi pairing of brackets #3030

Comments

echeran commented Jan 25, 2023

Background

Problem

Details

sffc commented Jan 25, 2023 • edited Loading

sffc commented Jan 26, 2023

sffc commented Feb 23, 2023

sffc commented Jan 25, 2023 •

edited

Loading