Provide data for Bidi pairing of brackets #3030
Labels
C-unicode
Component: Props, sets, tries
S-medium
Size: Less than a week (larger bug fix or enhancement)
T-techdebt
Type: ICU4X code health and tech debt
Milestone
Background
A pull request has gone into the Rust
unicode-bidi
repo to support the pairing of brackets in the Bidi algorithm. When doing so, the small amount of special case data was hardcoded for the relevant code points.However, APIs were created to allow the plugging in of an external data provider. This external data provider could be used to supplying the most recent version of correct data. It was intended for ICU4X to be a reliable source of the latest greatest property data needed for the algorithm.
Problem
To that extent, we need to add the data for the properties concerned, which are
Bidi_Paired_Bracket (bpb)
andBidi_Paired_Bracket_Type (bpt)
.Details
Both of these properties can be provided in full via CodePointTrie. That would be useful for regex implementers, at the least.
Bidi_Paired_Bracket_Type
is enumerated, whileBidi_Paired_Bracket
returns code points /<none>
s (which can also be represented in CodePointTries).For the purposes of the bracket pairing purposes of the Bidi algorithm, we can optimize to save space by only carefully examining the values that are actually used by the algorithm. There are currently only 128 code points whose Bidi_Paired_Bracket_Type values are not
None
. So we could minimize the actual data that we store forBidi_Paired_Bracket (bpb)
andBidi_Paired_Bracket_Type (bpt)
to just that set of code points. Other information derived from normalization is needed as well, but only for the characters whose normalization forms are not unique (and thus form an equivalence class of more than one pair), and the relevant code points are a further subset of that subset. So space can be optimized for users who only need the Bidi paired bracket portion of the algorithm by only holding a small set of data, where that small set is the intersection of selectingbpt != None
and the 3 pieces of info (bpb
,bpt
, equivalence class representative char). Representing that data in a compact form, such as aZeroVec
of a struct that has a field per distinct piece of info, enables avoiding pulling in the whole CodePointTrie's worth of data for the relevant properties.The text was updated successfully, but these errors were encountered: