Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide data for Bidi pairing of brackets #3030

Closed
echeran opened this issue Jan 25, 2023 · 3 comments · Fixed by #3026
Closed

Provide data for Bidi pairing of brackets #3030

echeran opened this issue Jan 25, 2023 · 3 comments · Fixed by #3026
Assignees
Labels
C-unicode Component: Props, sets, tries S-medium Size: Less than a week (larger bug fix or enhancement) T-techdebt Type: ICU4X code health and tech debt

Comments

@echeran
Copy link
Contributor

echeran commented Jan 25, 2023

Background

A pull request has gone into the Rust unicode-bidi repo to support the pairing of brackets in the Bidi algorithm. When doing so, the small amount of special case data was hardcoded for the relevant code points.

However, APIs were created to allow the plugging in of an external data provider. This external data provider could be used to supplying the most recent version of correct data. It was intended for ICU4X to be a reliable source of the latest greatest property data needed for the algorithm.

Problem

To that extent, we need to add the data for the properties concerned, which are Bidi_Paired_Bracket (bpb) and Bidi_Paired_Bracket_Type (bpt).

Details

Both of these properties can be provided in full via CodePointTrie. That would be useful for regex implementers, at the least. Bidi_Paired_Bracket_Type is enumerated, while Bidi_Paired_Bracket returns code points / <none>s (which can also be represented in CodePointTries).

For the purposes of the bracket pairing purposes of the Bidi algorithm, we can optimize to save space by only carefully examining the values that are actually used by the algorithm. There are currently only 128 code points whose Bidi_Paired_Bracket_Type values are not None. So we could minimize the actual data that we store for Bidi_Paired_Bracket (bpb) and Bidi_Paired_Bracket_Type (bpt) to just that set of code points. Other information derived from normalization is needed as well, but only for the characters whose normalization forms are not unique (and thus form an equivalence class of more than one pair), and the relevant code points are a further subset of that subset. So space can be optimized for users who only need the Bidi paired bracket portion of the algorithm by only holding a small set of data, where that small set is the intersection of selecting bpt != None and the 3 pieces of info (bpb, bpt, equivalence class representative char). Representing that data in a compact form, such as a ZeroVec of a struct that has a field per distinct piece of info, enables avoiding pulling in the whole CodePointTrie's worth of data for the relevant properties.

@sffc
Copy link
Member

sffc commented Jan 25, 2023

Related: #2905

I think we may want to add an extra data key optimized for unicode-bidi, even if that duplicates data found elsewhere. We know the data is very small, so we probably don't need to do anything too fancy for it in datagen.

@sffc
Copy link
Member

sffc commented Jan 26, 2023

  • @hsivonen - Do we want to enable the bidi algorithm in apps that don't also use harfbuzz? I don't feel it's worthwhile to optimize for separating those two.
  • @sffc - The sets of properties are disjoint, right?
  • @hsivonen - Yes, but it's more efficient to bitpack them
  • @echeran - It's not clear that we can fully bitpack all of the code points; splitting them into 2 buckets could help
  • @Manishearth - The spec doesn't require that these properties are equal. It also talks about normalization; there's currently 1 code point that is affected by that, but it could possibly grow.
  • @sffc - What really is the advantage to splitting vs non-splitting?
  • @hsivonen - I don't know and we can't really find out without doing it.
  • @hsivonen - It's a derived property.
  • @Manishearth - Oh okay. So then the main thing is that with Elango's split, we can roll in the normalization with the frankenproperty.
  • @sffc - I think datagen should be packaging the property, not icuexport.
  • @hsivonen - icuexport does a lot of work for us in normalization and collator. It seems easier to do it on the icuexport side with relatively few lines of code.
  • @sffc - I see collator and normalizer as special cases because ICU is very opinionated about how they work. This harfbuzz/bidi thing is novel in ICU4X, so the code to do it should land in ICU4X
  • @hsivonen - I think we don't need the extra codepoint bytes all the time and we should be able to bitpack efficiently.
  • @Manishearth - ICU4X's approach is that it makes it easy to have multiple ways to do the same thing.

@echeran echeran self-assigned this Feb 2, 2023
@sffc sffc added C-unicode Component: Props, sets, tries T-techdebt Type: ICU4X code health and tech debt S-medium Size: Less than a week (larger bug fix or enhancement) labels Feb 16, 2023
@sffc
Copy link
Member

sffc commented Feb 23, 2023

See additional discussion in #2833

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-unicode Component: Props, sets, tries S-medium Size: Less than a week (larger bug fix or enhancement) T-techdebt Type: ICU4X code health and tech debt
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants