Negation of full byte-range class causes panic #303

scooter-dangle · 2016-12-07T23:44:14Z

For a project I'm working on, we wanted to have a simple match-none regular expression. My original attempt was to use an empty capture group. The one I constructed, r#"[^\x00-\xff]"#, results in a panic when used with regex::bytes::Regex (rather than an error as in the case of the more elegant r#"[^\d\D]"#, now that #257 has been fixed).

In https://github.com/scooter-dangle/regex/tree/explicitly-empty-byte-range, I add a test for this particular case to regex-syntax/src/parser.rs. (https://github.com/scooter-dangle/regex/blob/explicitly-empty-byte-range/regex-syntax/src/parser.rs#L2750) (Per notes below, this original construction was incorrect. See https://github.com/scooter-dangle/regex/blob/panic-on-empty-byte-range-for-bytes-regex/tests/bytes.rs#L57-L61 instead.)

Note: For the original goal of a match-none regex, we currently use r#"\z.\A"#.

The text was updated successfully, but these errors were encountered:

BurntSushi · 2016-12-07T23:47:40Z

Ah yup, looks like a bug! Nice find. :-)

scooter-dangle · 2016-12-08T00:18:17Z

To avoid confusion for future readers, I thought I should point out that part of my description above is technically incorrect: We're using a version of regex prior to the #257 fix, so to me this currently looks like a panic. I've tried just using the expression on master, though, and it appears to compile and work correctly at first glance.

There is still technically a panic in the additional test I referenced, but only because the particular test macro expects an error—the regex itself isn't causing the panic.

(By the way...really appreciate the quick response/labeling of the issue submission! Thanks!)

scooter-dangle · 2016-12-08T00:37:58Z

Further note...in our project, we're using the regex::bytes::* variants for user-supplied regular expressions. That could be influencing whether the character class is (and/or should) be considered empty.

BurntSushi · 2016-12-08T01:05:32Z

Ah I see, yeah I think I did misunderstand the issue at first glance. When I get a chance I'll dig into this and see if I can straighten everything out. Thanks for bringing it to my attention!

To help me prioritize: can you clarify whether you're blocked on this particular issue or not? If I read you correctly, I think you have a work-around for now, right?

scooter-dangle · 2016-12-08T01:30:34Z

The particular issue of having a match-none regex is solved by \z.\A.

I think we're still getting a panic on the negated byte range, though. If I'm doing it correctly, I think I'm getting a full-on panic when I insert

mat!(negated_full_byte_range_bytes, r#"[^\x00-\xff]"#, "abcde", None);

into https://github.com/rust-lang-nursery/regex/blob/master/tests/bytes.rs

If I'm reading things correctly, I think it might be due to the code at the end of Parser::parse_class (roughly https://github.com/rust-lang-nursery/regex/blob/master/regex-syntax/src/parser.rs#L592-L600):

    fn parse_class(&mut self) -> Result<Build> {
        /* ... */
        class = self.class_transform(negated, class).canonicalize();
        if class.is_empty() {
            return Err(self.err(ErrorKind::EmptyClass));
        }
        Ok(Build::Expr(if self.flags.unicode {
            Expr::Class(class)
        } else {
            Expr::ClassBytes(class.to_byte_class())
        }))
    }

When the character class is for all code points, it doesn't stop at \xff, so class.is_empty() should return false. But then class.to_byte_class() is essentially reducing the size of the set in which class is operating, and removes all of the entities previously allowed by[^\x00-\xff].

In that case, my initially posted test is incorrect, and the test I added to tests/bytes.rs is the (nearly) correct one. (Except for the fact that that test macro calls unwrap, and empty classes are currently an Err.)

Regarding whether it's a blocker: It's very inconvenient. We already have to be able to catch panics when processing user-supplied data (like regexes) so that we can still provide some kind of feedback to the user. This will go into the list of one or two known panic-causing conditions.

Thanks again for looking into this. Hopefully I'm no longer mixing up my bytes-vs-no-bytes discrepancies with my version discrepancies.

scooter-dangle changed the title ~~Negation of full byte range class causes panic~~ Negation of full byte-range class causes panic Dec 7, 2016

BurntSushi added the bug label Dec 7, 2016

scooter-dangle mentioned this issue Dec 8, 2016

Verify character class still non-empty after converting to byte class #304

Merged

BurntSushi closed this as completed in #304 Dec 29, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Negation of full byte-range class causes panic #303

Negation of full byte-range class causes panic #303

scooter-dangle commented Dec 7, 2016 •

edited

Loading

BurntSushi commented Dec 7, 2016

scooter-dangle commented Dec 8, 2016 •

edited

Loading

scooter-dangle commented Dec 8, 2016

BurntSushi commented Dec 8, 2016

scooter-dangle commented Dec 8, 2016 •

edited

Loading

Negation of full byte-range class causes panic #303

Negation of full byte-range class causes panic #303

Comments

scooter-dangle commented Dec 7, 2016 • edited Loading

BurntSushi commented Dec 7, 2016

scooter-dangle commented Dec 8, 2016 • edited Loading

scooter-dangle commented Dec 8, 2016

BurntSushi commented Dec 8, 2016

scooter-dangle commented Dec 8, 2016 • edited Loading

scooter-dangle commented Dec 7, 2016 •

edited

Loading

scooter-dangle commented Dec 8, 2016 •

edited

Loading

scooter-dangle commented Dec 8, 2016 •

edited

Loading