Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle retaining explicit formatting characters + other fixes #92

Merged
merged 21 commits into from
Jan 18, 2023

Conversation

Manishearth
Copy link
Member

@Manishearth Manishearth commented Dec 20, 2022

Built on top of #91. PR starts at "Add all fixups for retaining explicits".

Fixes #89 by doing the implementation hacks specified in section 5.2 Retaining BNs and Explicit Formatting Characters.

This fixes all of the character tests! (fixes #90). We still have to fix some of the "basic" tests, which are unfortunately not as nicely labeled as the character tests. This also fixes the basic tests (fixes #8 )

This got mixed in with a bunch of other fixes since a lot of the code is nearby, a bunch of them fix things from N0. The main Big Fix in here besides the one around retaining explicit formatting is that the code now always handles lookahead/lookbehind by iterating within the run sequence only.

All in all, this PR contains:

@Manishearth Manishearth force-pushed the retain-explicit branch 5 times, most recently from 52a4aac to 98c05a8 Compare December 20, 2022 06:22
@Manishearth Manishearth force-pushed the retain-explicit branch 4 times, most recently from 756acdf to 7fdfacb Compare December 21, 2022 04:23
@Manishearth Manishearth marked this pull request as ready for review December 21, 2022 04:24
This uses a BD16 that does *not* match the spec but does match
conformance tests, the reference implementation, and ICU4C. BD16 does not explicitly state this, but it intends
to ignore overridden brackets.
whoops, got the indices wrong here
All of the N and W algorithms apply within an isolating run sequence,
which may have gaps that contain other meaningful characters. We shoudl
skip these.
These helpers make it easy to do lookaround within an isolating run
sequence.
The N0 check for enclosed strong characters should only check within the
run sequence.
@Manishearth
Copy link
Member Author

Manishearth commented Dec 21, 2022

r? @eggrobin or @sffc

Note for reviewers: Should be reviewed commit-by-commit. If you review the first two commits, you need not review #91, just let me know and I can merge that.

Looking for spec review from @eggrobin, Rust code review from @sffc , and ideally "is this the best way to implement this" from both.

@Manishearth Manishearth changed the title Handle retaining explicit formatting characters Handle retaining explicit formatting characters + other fixes Dec 21, 2022
@Manishearth
Copy link
Member Author

We also probably should fuzz this crate to catch panics, and maybe also fuzz it against the reference implementation once all tests pass.

Copy link
Contributor

@sffc sffc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems plausible; what type of review are you looking for?

@Manishearth
Copy link
Member Author

Manishearth commented Dec 21, 2022

@sffc I'm hoping @eggrobin can do spec review, and you can do general Rusty code review as well as "is this the best way to implement this step" review.

FWIW If you review everything here including the first two commits I don't need to get review on #91.

(expanded my r? comment to say so)

Copy link
Contributor

@sffc sffc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My reaction is that iter_forwards_from and iter_backwards_from seem to be good abstractions, but they still give you UTF-8 indices, yes? So if you want to "flip" a whole code point from one level or property to another, you need to overwrite all the bytes for that code point, right?

@@ -47,13 +47,17 @@ pub fn compute(
RLE | LRE | RLO | LRO | RLI | LRI | FSI => {
let last_level = stack.last().level;

// <https://www.unicode.org/reports/tr9/#Retaining_Explicit_Formatting_Characters>
levels[i] = last_level;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not super happy that all this code uses [] but you have another issue to follow up on that (fuzzing this crate to make sure it doesn't panic).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah the whole crate does this

if *class != BN {
break;
}
*class = class_to_set;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're only setting one element; is this intentional, or do you need to set multiple elements? Is processing_classes in UTF-8 or UTF-32 indices?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The loop makes this unnecessary; we are maintaining the property of every byte having its classes set so we don't have to worry about this. The existing code does this in many spots. I do want to clean this up by aggressively commenting or by building better abstractions. (Or potentially storing Option<BidiClass> or something)

Everything is in UTF-8 indices.

@Manishearth
Copy link
Member Author

And all tests pass, now. Bonus fix for #8

@Manishearth Manishearth reopened this Dec 29, 2022
@Manishearth
Copy link
Member Author

kick CI

src/explicit.rs Outdated Show resolved Hide resolved
src/implicit.rs Outdated Show resolved Hide resolved
src/implicit.rs Show resolved Hide resolved
src/implicit.rs Outdated Show resolved Hide resolved
@Manishearth Manishearth merged commit 64611a5 into servo:master Jan 18, 2023
@Manishearth Manishearth deleted the retain-explicit branch January 18, 2023 20:10
Manishearth added a commit that referenced this pull request Jan 18, 2023
Forgot to include variable renaming
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants