Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Negation of full byte-range class causes panic #303

Closed
scooter-dangle opened this issue Dec 7, 2016 · 5 comments
Closed

Negation of full byte-range class causes panic #303

scooter-dangle opened this issue Dec 7, 2016 · 5 comments
Labels

Comments

@scooter-dangle
Copy link
Contributor

scooter-dangle commented Dec 7, 2016

For a project I'm working on, we wanted to have a simple match-none regular expression. My original attempt was to use an empty capture group. The one I constructed, r#"[^\x00-\xff]"#, results in a panic when used with regex::bytes::Regex (rather than an error as in the case of the more elegant r#"[^\d\D]"#, now that #257 has been fixed).

In https://github.com/scooter-dangle/regex/tree/explicitly-empty-byte-range, I add a test for this particular case to regex-syntax/src/parser.rs. (https://github.com/scooter-dangle/regex/blob/explicitly-empty-byte-range/regex-syntax/src/parser.rs#L2750) (Per notes below, this original construction was incorrect. See https://github.com/scooter-dangle/regex/blob/panic-on-empty-byte-range-for-bytes-regex/tests/bytes.rs#L57-L61 instead.)

Note: For the original goal of a match-none regex, we currently use r#"\z.\A"#.

@scooter-dangle scooter-dangle changed the title Negation of full byte range class causes panic Negation of full byte-range class causes panic Dec 7, 2016
@BurntSushi BurntSushi added the bug label Dec 7, 2016
@BurntSushi
Copy link
Member

Ah yup, looks like a bug! Nice find. :-)

@scooter-dangle
Copy link
Contributor Author

scooter-dangle commented Dec 8, 2016

To avoid confusion for future readers, I thought I should point out that part of my description above is technically incorrect: We're using a version of regex prior to the #257 fix, so to me this currently looks like a panic. I've tried just using the expression on master, though, and it appears to compile and work correctly at first glance.

There is still technically a panic in the additional test I referenced, but only because the particular test macro expects an error—the regex itself isn't causing the panic.

(By the way...really appreciate the quick response/labeling of the issue submission! Thanks!)

@scooter-dangle
Copy link
Contributor Author

Further note...in our project, we're using the regex::bytes::* variants for user-supplied regular expressions. That could be influencing whether the character class is (and/or should) be considered empty.

@BurntSushi
Copy link
Member

Ah I see, yeah I think I did misunderstand the issue at first glance. When I get a chance I'll dig into this and see if I can straighten everything out. Thanks for bringing it to my attention!

To help me prioritize: can you clarify whether you're blocked on this particular issue or not? If I read you correctly, I think you have a work-around for now, right?

@scooter-dangle
Copy link
Contributor Author

scooter-dangle commented Dec 8, 2016

The particular issue of having a match-none regex is solved by \z.\A.

I think we're still getting a panic on the negated byte range, though. If I'm doing it correctly, I think I'm getting a full-on panic when I insert

mat!(negated_full_byte_range_bytes, r#"[^\x00-\xff]"#, "abcde", None);

into https://github.com/rust-lang-nursery/regex/blob/master/tests/bytes.rs

If I'm reading things correctly, I think it might be due to the code at the end of Parser::parse_class (roughly https://github.com/rust-lang-nursery/regex/blob/master/regex-syntax/src/parser.rs#L592-L600):

    fn parse_class(&mut self) -> Result<Build> {
        /* ... */
        class = self.class_transform(negated, class).canonicalize();
        if class.is_empty() {
            return Err(self.err(ErrorKind::EmptyClass));
        }
        Ok(Build::Expr(if self.flags.unicode {
            Expr::Class(class)
        } else {
            Expr::ClassBytes(class.to_byte_class())
        }))
    }

When the character class is for all code points, it doesn't stop at \xff, so class.is_empty() should return false. But then class.to_byte_class() is essentially reducing the size of the set in which class is operating, and removes all of the entities previously allowed by[^\x00-\xff].

In that case, my initially posted test is incorrect, and the test I added to tests/bytes.rs is the (nearly) correct one. (Except for the fact that that test macro calls unwrap, and empty classes are currently an Err.)

Regarding whether it's a blocker: It's very inconvenient. We already have to be able to catch panics when processing user-supplied data (like regexes) so that we can still provide some kind of feedback to the user. This will go into the list of one or two known panic-causing conditions.

Thanks again for looking into this. Hopefully I'm no longer mixing up my bytes-vs-no-bytes discrepancies with my version discrepancies.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants