Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix parsing for invalid control characters in class atoms #85

Merged
merged 1 commit into from
Jan 28, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
124 changes: 80 additions & 44 deletions src/parse.rs
Original file line number Diff line number Diff line change
Expand Up @@ -491,8 +491,13 @@ where
}

// Parse a code point or character class.
let Some(first) = self.try_consume_bracket_class_atom()? else {
continue;
let first = match self.try_consume_bracket_class_atom()? {
Some((first, None)) => first,
Some((first, Some(second))) => {
add_class_atom(&mut result, first);
second
}
_ => continue,
};

// Check for a dash; we may have a range.
Expand All @@ -501,7 +506,7 @@ where
continue;
}

let Some(second) = self.try_consume_bracket_class_atom()? else {
let Some((second, third)) = self.try_consume_bracket_class_atom()? else {
// No second atom. For example: [a-].
add_class_atom(&mut result, first);
add_class_atom(&mut result, ClassAtom::CodePoint(u32::from('-')));
Expand All @@ -520,6 +525,11 @@ where
first: *c1,
last: *c2,
});

if let Some(third) = third {
add_class_atom(&mut result, third);
}

continue;
}

Expand All @@ -534,7 +544,9 @@ where
}
}

fn try_consume_bracket_class_atom(&mut self) -> Result<Option<ClassAtom>, Error> {
fn try_consume_bracket_class_atom(
&mut self,
) -> Result<Option<(ClassAtom, Option<ClassAtom>)>, Error> {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a complicated return type. Here's two alternatives for simplifying it:

  1. When encountering "\c" in a bracket, instead of returning both, unread the "c" and only return the backslash. The next iteration through the loop will then pick up the "c".
  2. Instead return a Result<Vec<ClassAtom>, Error>

I think the first will end up quite clean assuming it works.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds good, I will try to change it to the first if it works.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works well. I implemented it in #88.

let c = self.peek();
if c.is_none() {
return Ok(None);
Expand All @@ -556,98 +568,122 @@ where
// ClassEscape :: b
'b' => {
self.consume('b');
Ok(Some(ClassAtom::CodePoint(u32::from('\x08'))))
Ok(Some((ClassAtom::CodePoint(u32::from('\x08')), None)))
}
// ClassEscape :: [+UnicodeMode] -
'-' if self.flags.unicode => {
self.consume('-');
Ok(Some(ClassAtom::CodePoint(u32::from('-'))))
Ok(Some((ClassAtom::CodePoint(u32::from('-')), None)))
}
// ClassEscape :: [~UnicodeMode] c ClassControlLetter
'c' if !self.flags.unicode => {
let input = self.input.clone();
self.consume('c');
match self.peek().map(to_char_sat) {
// ClassEscape :: [~UnicodeMode] c ClassControlLetter
Some('0'..='9' | '_') => {
let next = self.next().expect("char was not next");
Ok(Some(ClassAtom::CodePoint(next & 0x1F)))
Ok(Some((ClassAtom::CodePoint(next & 0x1F), None)))
}
// ClassEscape :: CharacterEscape
_ => {
self.input = input;
let cc = self.consume_character_escape()?;
Ok(Some(ClassAtom::CodePoint(cc)))
// CharacterEscape :: c AsciiLetter
Some('a'..='z' | 'A'..='Z') => {
let next = self.next().expect("char was not next");
Ok(Some((ClassAtom::CodePoint(next % 32), None)))
}
// ClassAtomNoDash :: \ [lookahead = c]
_ => Ok(Some((
ClassAtom::CodePoint(u32::from('\\')),
Some(ClassAtom::CodePoint(u32::from('c'))),
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't able to follow this - from what I can tell in the spec it says:

Return the numeric value of U+005C (REVERSE SOLIDUS).

but the actual behavior in Chrome agrees with your code - do you know where this is documented?

Copy link
Collaborator Author

@raskad raskad Jan 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can read the spec like you have suggested to write to code in your other comment. It just says that if we encounter a \ and the next char is a c we return U+005C. Then the spec "returns" and we read the c as a single char next.

))),
}
}
// ClassEscape :: CharacterClassEscape :: d
'd' => {
self.consume('d');
Ok(Some(ClassAtom::CharacterClass {
class_type: CharacterClassType::Digits,
positive: true,
}))
Ok(Some((
ClassAtom::CharacterClass {
class_type: CharacterClassType::Digits,
positive: true,
},
None,
)))
}
// ClassEscape :: CharacterClassEscape :: D
'D' => {
self.consume('D');
Ok(Some(ClassAtom::CharacterClass {
class_type: CharacterClassType::Digits,
positive: false,
}))
Ok(Some((
ClassAtom::CharacterClass {
class_type: CharacterClassType::Digits,
positive: false,
},
None,
)))
}
// ClassEscape :: CharacterClassEscape :: s
's' => {
self.consume('s');
Ok(Some(ClassAtom::CharacterClass {
class_type: CharacterClassType::Spaces,
positive: true,
}))
Ok(Some((
ClassAtom::CharacterClass {
class_type: CharacterClassType::Spaces,
positive: true,
},
None,
)))
}
// ClassEscape :: CharacterClassEscape :: S
'S' => {
self.consume('S');
Ok(Some(ClassAtom::CharacterClass {
class_type: CharacterClassType::Spaces,
positive: false,
}))
Ok(Some((
ClassAtom::CharacterClass {
class_type: CharacterClassType::Spaces,
positive: false,
},
None,
)))
}
// ClassEscape :: CharacterClassEscape :: w
'w' => {
self.consume('w');
Ok(Some(ClassAtom::CharacterClass {
class_type: CharacterClassType::Words,
positive: true,
}))
Ok(Some((
ClassAtom::CharacterClass {
class_type: CharacterClassType::Words,
positive: true,
},
None,
)))
}
// ClassEscape :: CharacterClassEscape :: W
'W' => {
self.consume('W');
Ok(Some(ClassAtom::CharacterClass {
class_type: CharacterClassType::Words,
positive: false,
}))
Ok(Some((
ClassAtom::CharacterClass {
class_type: CharacterClassType::Words,
positive: false,
},
None,
)))
}
// ClassEscape :: CharacterClassEscape :: [+UnicodeMode] p{ UnicodePropertyValueExpression }
// ClassEscape :: CharacterClassEscape :: [+UnicodeMode] P{ UnicodePropertyValueExpression }
'p' | 'P' if self.flags.unicode => {
self.consume(ec);
let property_escape = self.try_consume_unicode_property_escape()?;
let negate = ec == 'P' as u32;
Ok(Some(ClassAtom::UnicodePropertyEscape {
property_escape,
negate,
}))
Ok(Some((
ClassAtom::UnicodePropertyEscape {
property_escape,
negate,
},
None,
)))
}
// ClassEscape :: CharacterEscape
_ => {
let cc = self.consume_character_escape()?;
Ok(Some(ClassAtom::CodePoint(cc)))
Ok(Some((ClassAtom::CodePoint(cc), None)))
}
}
}

_ => Ok(Some(ClassAtom::CodePoint(self.consume(c)))),
_ => Ok(Some((ClassAtom::CodePoint(self.consume(c)), None))),
}
}

Expand Down
11 changes: 11 additions & 0 deletions tests/tests.rs
Original file line number Diff line number Diff line change
Expand Up @@ -1639,3 +1639,14 @@ fn test_unicode_folding_tc(tc: TestConfig) {
tc.test_match_succeeds(r"\u{212A}", "iu", "k");
tc.test_match_succeeds(r"\u{212A}", "iu", "K");
}

#[test]
fn test_class_invalid_control_character() {
test_with_configs(test_class_invalid_control_character_tc)
}

fn test_class_invalid_control_character_tc(tc: TestConfig) {
tc.test_match_succeeds("[\\c\u{0}]", "", "\\");
tc.test_match_succeeds("[\\c\u{0}]", "", "c");
tc.test_match_succeeds("[\\c\u{0}]", "", "\u{0}");
}
Loading