-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle retaining explicit formatting characters + other fixes #92
Conversation
52a4aac
to
98c05a8
Compare
756acdf
to
7fdfacb
Compare
This handles all of the fixups from https://www.unicode.org/reports/tr9/#Retaining_Explicit_Formatting_Characters
This uses a BD16 that does *not* match the spec but does match conformance tests, the reference implementation, and ICU4C. BD16 does not explicitly state this, but it intends to ignore overridden brackets.
whoops, got the indices wrong here
All of the N and W algorithms apply within an isolating run sequence, which may have gaps that contain other meaningful characters. We shoudl skip these.
These helpers make it easy to do lookaround within an isolating run sequence.
The N0 check for enclosed strong characters should only check within the run sequence.
7fdfacb
to
e4213aa
Compare
We also probably should fuzz this crate to catch panics, and maybe also fuzz it against the reference implementation once all tests pass. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems plausible; what type of review are you looking for?
We use lookaround in a bunch of other places. Use the new helpers there too
ae876ce
to
a20bc1c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My reaction is that iter_forwards_from and iter_backwards_from seem to be good abstractions, but they still give you UTF-8 indices, yes? So if you want to "flip" a whole code point from one level or property to another, you need to overwrite all the bytes for that code point, right?
@@ -47,13 +47,17 @@ pub fn compute( | |||
RLE | LRE | RLO | LRO | RLI | LRI | FSI => { | |||
let last_level = stack.last().level; | |||
|
|||
// <https://www.unicode.org/reports/tr9/#Retaining_Explicit_Formatting_Characters> | |||
levels[i] = last_level; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not super happy that all this code uses []
but you have another issue to follow up on that (fuzzing this crate to make sure it doesn't panic).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah the whole crate does this
if *class != BN { | ||
break; | ||
} | ||
*class = class_to_set; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're only setting one element; is this intentional, or do you need to set multiple elements? Is processing_classes
in UTF-8 or UTF-32 indices?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The loop makes this unnecessary; we are maintaining the property of every byte having its classes set so we don't have to worry about this. The existing code does this in many spots. I do want to clean this up by aggressively commenting or by building better abstractions. (Or potentially storing Option<BidiClass>
or something)
Everything is in UTF-8 indices.
ed6b0eb
to
d07f173
Compare
13a694b
to
eb21078
Compare
And all tests pass, now. Bonus fix for #8 |
kick CI |
5f00e3f
to
ab1b26a
Compare
Forgot to include variable renaming
Built on top of #91. PR starts at "Add all fixups for retaining explicits".
Fixes #89 by doing the implementation hacks specified in section 5.2 Retaining BNs and Explicit Formatting Characters.
This fixes all of the character tests! (fixes #90).
We still have to fix some of the "basic" tests, which are unfortunately not as nicely labeled as the character tests.This also fixes the basic tests (fixes #8 )This got mixed in with a bunch of other fixes since a lot of the code is nearby, a bunch of them fix things from N0. The main Big Fix in here besides the one around retaining explicit formatting is that the code now always handles lookahead/lookbehind by iterating within the run sequence only.
All in all, this PR contains: