Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement N0 (bracket matching) #85

Merged
merged 16 commits into from
Dec 20, 2022
Merged

Conversation

Manishearth
Copy link
Member

@Manishearth Manishearth commented Dec 18, 2022

Fixes #3

This was a bit tricky to pin down in the details because of some stuff which I plan to file an issue about.

This does require new data, but it isn't a breaking change in the non-hardcoded-data mode because the added trait method falls back to the relatively small amount of hardcoded bidi pair data.

There's a bunch of complicated comments about processed_class iteration, related to #86. They're not quite accurate now that I have done the analysis in #86, but they're still potentially relevant.

This can be reviewed commit by commit.

r? @eggrobin or @sffc or @mbrubeck

@Manishearth Manishearth changed the title Implement N0 Implement N0 (bracket matching) Dec 18, 2022
Copy link

@mdebbar mdebbar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great to me. Thanks for fixing this bug!

tests/conformance_tests.rs Outdated Show resolved Hide resolved
tools/generate.py Show resolved Hide resolved
Comment on lines 47 to 56
pub(crate) fn bidi_matched_bracket(c: char) -> Option<(char, bool)> {
for pair in self::tables::bidi_pairs_table {
if pair.0 == c {
return Some((pair.1, true));
} else if pair.1 == c {
return Some((pair.0, false));
}
}
None
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think it's worth making this a hashmap or at least binary search?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that ICU4X will eventually not be using this data.

There's no static hashmap in Rust, and we could use a perfect hashing crate but that would be an additional dependency. We could also implement our own perfect hashing which would probably be more efficient.

The table has very few entries so it wasn't clear to me that the complexity of binary search was worth it. We can benchmark if desired.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@robertbastian had some thoughts on this

Copy link

@eggrobin eggrobin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good !

src/data_source.rs Outdated Show resolved Hide resolved
src/implicit.rs Show resolved Hide resolved
src/implicit.rs Outdated Show resolved Hide resolved
src/implicit.rs Outdated Show resolved Hide resolved
@@ -138,7 +138,7 @@ fn gen_base_levels_for_base_tests(bitset: u8) -> Vec<Option<Level>> {
}

#[test]
#[should_panic(expected = "3514 test cases failed! (88193 passed)")]
#[should_panic(expected = "69 test cases failed! (91638 passed)")]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤯

/// (since this data is small and changes less often), and in part so that this method can be
/// added without needing a breaking version bump.
/// Override this method in your custom data source to prevent the use of hardcoded data.
fn bidi_matched_opening_bracket(&self, c: char) -> Option<(char, bool)> {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a fan of this API. Perhaps we should split this into two methods, one which handles bidi matching, and one which gets you normalization, ideally expected to only work for properties that also have bidi matched pair values set?

cc @sffc and @echeran since we'll have to deal with this upstream in ICU4X

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this derivable from the Bidi Mirrored Glyph property, or do we need more than that?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://www.unicode.org/Public/UCD/latest/ucd/BidiBrackets.txt

# This file lists the set of code points with Bidi_Paired_Bracket_Type
# property values Open and Close. The set is derived from the character
# properties General_Category (gc), Bidi_Class (bc), Bidi_Mirrored (Bidi_M),
# and Bidi_Mirroring_Glyph (bmg), as follows: two characters, A and B,
# form a bracket pair if A has gc=Ps and B has gc=Pe, both have bc=ON and
# Bidi_M=Y, and bmg of A is B. Bidi_Paired_Bracket (bpb) maps A to B and
# vice versa, and their Bidi_Paired_Bracket_Type (bpt) property values are
# Open (o) and Close (c), respectively.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sffc in part, but it also needs normalization (currently for a single character pair). Note that the mirrored_glyph property has two values, so the tuple API is still necessary.

As I said it could be split into two to move normalization out.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently we only need to carry Bidi_Class data. I'm concerned that if we need both General_Category and Bidi_Mirrored and some normalization stuff, it will dramatically increase the size of our BiDi-only ICU4X build, even though the data seems to resolve down to only a small amount at the end of the day.

Perhaps ICU4X should resolve all of this at datagen time and introduce a new small key specifically for this purpose? What would that look like?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, when I made my ICU4X-Bidi wasm build with this PR yesterday (the last commit at that time was Handle canonical equivalence), the size of the wasm build increased from ~20KB to ~23KB (brotli-compressed).

Not a deal breaker for us, but I won't be opposed to reducing the size if there are some low hanging fruits 🙂

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah so my proposal is we do one of:

  • Make a frankenproperty that maps brackets to their normalized equivalent opening bracket and whether or not they are themselves opening
  • Expose two properties: Bidi_Mirrored (char -> (mirrored char, open/close), and a small hybrid property that is the value of ToNFC for bidi mirrored characters only

I kind of like the latter,.

There are also two corresponding trait models for the unicode-bidi crate:

  • The current one, where it requests data in the form of the frankenproperty proposed above
  • One where I give it two method impls, one which is .normalize(char) -> Option<char> (or just have it return self instead of None), and bidi_mirrored(char) -> Option<(char, bool)>. In the trait docs, .normalize() is not required to know how to normalize anything that is not .bidi_mirrored().

Note that the trait model and the ICU4X data model are two separate questions, each trait model can be implemented with either data model.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that the trait model and the ICU4X data model are two separate questions, each trait model can be implemented with either data model.

I think I lean toward the fully resolved property ("frankenproperty") for the API, but it should return a struct instead of a Option<(char, bool)> to be more clear about what it is doing. It's nice that the API enforces that it only needs to be defined for fewer characters, rather than having .normalize(char) -> Option<char> which makes it look very tempting to be hooked up to a full normalizer, which is neither necessary nor efficient.

Copy link
Member Author

@Manishearth Manishearth Dec 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sffc do you mind if I merge this as is and we iterate on the data model elsewhere? My stack of PRs is getting unweildy. I'm fine not doing so as well, but I figured if you don't mind I'd prefer to merge now.

(and just not release while it's in this state)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah of course. I can't approve PRs on this repo anyway.

@@ -138,7 +138,7 @@ fn gen_base_levels_for_base_tests(bitset: u8) -> Vec<Option<Level>> {
}

#[test]
#[should_panic(expected = "3514 test cases failed! (88193 passed)")]
#[should_panic(expected = "69 test cases failed! (91638 passed)")]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

@Manishearth
Copy link
Member Author

Manishearth commented Dec 20, 2022

I think as far as review is concerned I will be satisfied with one review (from @eggrobin or otherwise) on the algorithm, and also I'd like @sffc or @echeran to look at the data model. Also fine with landing this with algorithm review and an understanding that we won't cut a release until someone has had a look at the data model.

Mostly because this PR is getting big and I already have another PR built on in (#91) plus a complicated fix (#92) that I don't want to keep rebasing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement N0: Process bracket pairs
4 participants