Implement N0 (bracket matching) #85

Manishearth · 2022-12-18T04:14:44Z

Fixes #3

This was a bit tricky to pin down in the details because of some stuff which I plan to file an issue about.

This does require new data, but it isn't a breaking change in the non-hardcoded-data mode because the added trait method falls back to the relatively small amount of hardcoded bidi pair data.

There's a bunch of complicated comments about processed_class iteration, related to #86. They're not quite accurate now that I have done the analysis in #86, but they're still potentially relevant.

This can be reviewed commit by commit.

r? @eggrobin or @sffc or @mbrubeck

mdebbar

Looks great to me. Thanks for fixing this bug!

tests/conformance_tests.rs

tools/generate.py

mdebbar · 2022-12-19T15:23:16Z

src/char_data/mod.rs

+pub(crate) fn bidi_matched_bracket(c: char) -> Option<(char, bool)> {
+    for pair in self::tables::bidi_pairs_table {
+        if pair.0 == c {
+            return Some((pair.1, true));
+        } else if pair.1 == c {
+            return Some((pair.0, false));
+        }
+    }
+    None
+}


Do you think it's worth making this a hashmap or at least binary search?

Note that ICU4X will eventually not be using this data.

There's no static hashmap in Rust, and we could use a perfect hashing crate but that would be an additional dependency. We could also implement our own perfect hashing which would probably be more efficient.

The table has very few entries so it wasn't clear to me that the complexity of binary search was worth it. We can benchmark if desired.

@robertbastian had some thoughts on this

eggrobin

Looks good !

src/data_source.rs

src/implicit.rs

mdebbar · 2022-12-19T18:35:16Z

tests/conformance_tests.rs

@@ -138,7 +138,7 @@ fn gen_base_levels_for_base_tests(bitset: u8) -> Vec<Option<Level>> {
 }

 #[test]
-#[should_panic(expected = "3514 test cases failed! (88193 passed)")]
+#[should_panic(expected = "69 test cases failed! (91638 passed)")]


Manishearth · 2022-12-19T20:00:13Z

src/data_source.rs

+    /// (since this data is small and changes less often), and in part so that this method can be
+    /// added without needing a breaking version bump.
+    /// Override this method in your custom data source to prevent the use of hardcoded data.
+    fn bidi_matched_opening_bracket(&self, c: char) -> Option<(char, bool)> {


I'm not a fan of this API. Perhaps we should split this into two methods, one which handles bidi matching, and one which gets you normalization, ideally expected to only work for properties that also have bidi matched pair values set?

cc @sffc and @echeran since we'll have to deal with this upstream in ICU4X

Is this derivable from the Bidi Mirrored Glyph property, or do we need more than that?

https://www.unicode.org/Public/UCD/latest/ucd/BidiBrackets.txt

# This file lists the set of code points with Bidi_Paired_Bracket_Type # property values Open and Close. The set is derived from the character # properties General_Category (gc), Bidi_Class (bc), Bidi_Mirrored (Bidi_M), # and Bidi_Mirroring_Glyph (bmg), as follows: two characters, A and B, # form a bracket pair if A has gc=Ps and B has gc=Pe, both have bc=ON and # Bidi_M=Y, and bmg of A is B. Bidi_Paired_Bracket (bpb) maps A to B and # vice versa, and their Bidi_Paired_Bracket_Type (bpt) property values are # Open (o) and Close (c), respectively.

@sffc in part, but it also needs normalization (currently for a single character pair). Note that the mirrored_glyph property has two values, so the tuple API is still necessary.

As I said it could be split into two to move normalization out.

Currently we only need to carry Bidi_Class data. I'm concerned that if we need both General_Category and Bidi_Mirrored and some normalization stuff, it will dramatically increase the size of our BiDi-only ICU4X build, even though the data seems to resolve down to only a small amount at the end of the day.

Perhaps ICU4X should resolve all of this at datagen time and introduce a new small key specifically for this purpose? What would that look like?

FWIW, when I made my ICU4X-Bidi wasm build with this PR yesterday (the last commit at that time was Handle canonical equivalence), the size of the wasm build increased from ~20KB to ~23KB (brotli-compressed).

Not a deal breaker for us, but I won't be opposed to reducing the size if there are some low hanging fruits 🙂

Yeah so my proposal is we do one of:

Make a frankenproperty that maps brackets to their normalized equivalent opening bracket and whether or not they are themselves opening

Expose two properties: Bidi_Mirrored (char -> (mirrored char, open/close), and a small hybrid property that is the value of ToNFC for bidi mirrored characters only

I kind of like the latter,.

There are also two corresponding trait models for the unicode-bidi crate:

The current one, where it requests data in the form of the frankenproperty proposed above

One where I give it two method impls, one which is .normalize(char) -> Option<char> (or just have it return self instead of None), and bidi_mirrored(char) -> Option<(char, bool)>. In the trait docs, .normalize() is not required to know how to normalize anything that is not .bidi_mirrored().

Note that the trait model and the ICU4X data model are two separate questions, each trait model can be implemented with either data model.

Note that the trait model and the ICU4X data model are two separate questions, each trait model can be implemented with either data model.

I think I lean toward the fully resolved property ("frankenproperty") for the API, but it should return a struct instead of a Option<(char, bool)> to be more clear about what it is doing. It's nice that the API enforces that it only needs to be defined for fewer characters, rather than having .normalize(char) -> Option<char> which makes it look very tempting to be hooked up to a full normalizer, which is neither necessary nor efficient.

@sffc do you mind if I merge this as is and we iterate on the data model elsewhere? My stack of PRs is getting unweildy. I'm fine not doing so as well, but I figured if you don't mind I'd prefer to merge now.

(and just not release while it's in this state)

Yeah of course. I can't approve PRs on this repo anyway.

eggrobin · 2022-12-19T22:29:13Z

tests/conformance_tests.rs

@@ -138,7 +138,7 @@ fn gen_base_levels_for_base_tests(bitset: u8) -> Vec<Option<Level>> {
 }

 #[test]
-#[should_panic(expected = "3514 test cases failed! (88193 passed)")]
+#[should_panic(expected = "69 test cases failed! (91638 passed)")]


Manishearth · 2022-12-20T00:25:28Z

I think as far as review is concerned I will be satisfied with one review (from @eggrobin or otherwise) on the algorithm, and also I'd like @sffc or @echeran to look at the data model. Also fine with landing this with algorithm review and an understanding that we won't cut a release until someone has had a look at the data model.

Mostly because this PR is getting big and I already have another PR built on in (#91) plus a complicated fix (#92) that I don't want to keep rebasing.

Manishearth added 8 commits December 17, 2022 19:45

Update generation script for bidipairs

6fb3dd6

Add to BidiDataSource

55569f9

Add identify_bracket_pairs

0153214

Add resolve_neutral step N0

31317c6

Always pull in hardcoded data

ee20c4f

Some cleanups

c7a381c

Try to maintain the invariant of everything being updated

3466a2d

update test

f9e09fe

Manishearth mentioned this pull request Dec 18, 2022

Inconsistency between iterating over byte and char indices #86

Open

Manishearth changed the title ~~Implement N0~~ Implement N0 (bracket matching) Dec 18, 2022

mdebbar reviewed Dec 19, 2022

View reviewed changes

eggrobin reviewed Dec 19, 2022

View reviewed changes

src/data_source.rs Outdated Show resolved Hide resolved

src/implicit.rs Show resolved Hide resolved

src/implicit.rs Outdated Show resolved Hide resolved

src/implicit.rs Outdated Show resolved Hide resolved

Manishearth added 5 commits December 19, 2022 09:19

Fix bug around found_l and found_r

18a7a04

feedback: found_e and found_not_e

7fa0db4

feedback: comments

394fd9c

Clarity on enclosed

e88e049

unidata

1069c3b

mdebbar reviewed Dec 19, 2022

View reviewed changes

Handle canonical equivalence

ea43d44

Manishearth mentioned this pull request Dec 19, 2022

Analysis of failing character tests (after #85) #90

Closed

Manishearth commented Dec 19, 2022

View reviewed changes

eggrobin reviewed Dec 19, 2022

View reviewed changes

Manishearth force-pushed the brace-yourself branch from 3696391 to ea43d44 Compare December 19, 2022 22:56

handle EN/AN in N0

96722a0

Manishearth force-pushed the brace-yourself branch from c513ed9 to 96722a0 Compare December 20, 2022 00:08

Manishearth mentioned this pull request Dec 20, 2022

Fix W2 rule ordering; behavior of multibyte characters in W4 #91

Merged

Manishearth mentioned this pull request Dec 20, 2022

Handle retaining explicit formatting characters + other fixes #92

Merged

eggrobin approved these changes Dec 20, 2022

View reviewed changes

Introduce BidiMatchedOpeningBracket

02f3ddc

Manishearth merged commit cb61bd5 into servo:master Dec 20, 2022

Manishearth deleted the brace-yourself branch December 20, 2022 18:40

bors-servo mentioned this pull request Dec 20, 2022

Fix behavior of multibyte characters in W4 #87

Closed

echeran mentioned this pull request Jan 25, 2023

Provide data for Bidi pairing of brackets unicode-org/icu4x#3030

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement N0 (bracket matching) #85

Implement N0 (bracket matching) #85

Manishearth commented Dec 18, 2022 •

edited

Loading

mdebbar left a comment

mdebbar Dec 19, 2022

Manishearth Dec 19, 2022

Manishearth Dec 19, 2022

eggrobin left a comment

mdebbar Dec 19, 2022

Manishearth Dec 19, 2022

sffc Dec 20, 2022

eggrobin Dec 20, 2022

Manishearth Dec 20, 2022

sffc Dec 20, 2022

mdebbar Dec 20, 2022

Manishearth Dec 20, 2022

sffc Dec 20, 2022

Manishearth Dec 20, 2022 •

edited

Loading

sffc Dec 20, 2022

eggrobin Dec 19, 2022

Manishearth commented Dec 20, 2022 •

edited

Loading

Implement N0 (bracket matching) #85

Implement N0 (bracket matching) #85

Conversation

Manishearth commented Dec 18, 2022 • edited Loading

mdebbar left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eggrobin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Manishearth Dec 20, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Manishearth commented Dec 20, 2022 • edited Loading

Manishearth commented Dec 18, 2022 •

edited

Loading

Manishearth Dec 20, 2022 •

edited

Loading

Manishearth commented Dec 20, 2022 •

edited

Loading