-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement N0 (bracket matching) #85
Merged
Merged
Changes from 15 commits
Commits
Show all changes
16 commits
Select commit
Hold shift + click to select a range
6fb3dd6
Update generation script for bidipairs
Manishearth 55569f9
Add to BidiDataSource
Manishearth 0153214
Add identify_bracket_pairs
Manishearth 31317c6
Add resolve_neutral step N0
Manishearth ee20c4f
Always pull in hardcoded data
Manishearth c7a381c
Some cleanups
Manishearth 3466a2d
Try to maintain the invariant of everything being updated
Manishearth f9e09fe
update test
Manishearth 18a7a04
Fix bug around found_l and found_r
Manishearth 7fa0db4
feedback: found_e and found_not_e
Manishearth 394fd9c
feedback: comments
Manishearth e88e049
Clarity on enclosed
Manishearth 1069c3b
unidata
Manishearth ea43d44
Handle canonical equivalence
Manishearth 96722a0
handle EN/AN in N0
Manishearth 02f3ddc
Introduce BidiMatchedOpeningBracket
Manishearth File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not a fan of this API. Perhaps we should split this into two methods, one which handles bidi matching, and one which gets you normalization, ideally expected to only work for properties that also have bidi matched pair values set?
cc @sffc and @echeran since we'll have to deal with this upstream in ICU4X
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this derivable from the Bidi Mirrored Glyph property, or do we need more than that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://www.unicode.org/Public/UCD/latest/ucd/BidiBrackets.txt
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sffc in part, but it also needs normalization (currently for a single character pair). Note that the mirrored_glyph property has two values, so the tuple API is still necessary.
As I said it could be split into two to move normalization out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently we only need to carry Bidi_Class data. I'm concerned that if we need both General_Category and Bidi_Mirrored and some normalization stuff, it will dramatically increase the size of our BiDi-only ICU4X build, even though the data seems to resolve down to only a small amount at the end of the day.
Perhaps ICU4X should resolve all of this at datagen time and introduce a new small key specifically for this purpose? What would that look like?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, when I made my ICU4X-Bidi wasm build with this PR yesterday (the last commit at that time was
Handle canonical equivalence
), the size of the wasm build increased from ~20KB to ~23KB (brotli-compressed).Not a deal breaker for us, but I won't be opposed to reducing the size if there are some low hanging fruits 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah so my proposal is we do one of:
char -> (mirrored char, open/close)
, and a small hybrid property that is the value of ToNFC for bidi mirrored characters onlyI kind of like the latter,.
There are also two corresponding trait models for the unicode-bidi crate:
.normalize(char) -> Option<char>
(or just have it returnself
instead ofNone
), andbidi_mirrored(char) -> Option<(char, bool)>
. In the trait docs,.normalize()
is not required to know how to normalize anything that is not.bidi_mirrored()
.Note that the trait model and the ICU4X data model are two separate questions, each trait model can be implemented with either data model.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I lean toward the fully resolved property ("frankenproperty") for the API, but it should return a struct instead of a
Option<(char, bool)>
to be more clear about what it is doing. It's nice that the API enforces that it only needs to be defined for fewer characters, rather than having.normalize(char) -> Option<char>
which makes it look very tempting to be hooked up to a full normalizer, which is neither necessary nor efficient.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sffc do you mind if I merge this as is and we iterate on the data model elsewhere? My stack of PRs is getting unweildy. I'm fine not doing so as well, but I figured if you don't mind I'd prefer to merge now.
(and just not release while it's in this state)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah of course. I can't approve PRs on this repo anyway.