-
Notifications
You must be signed in to change notification settings - Fork 445
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pattern panics non-Unicode bytes::Regex
#367
Comments
The code that computes the byte classes had a bug that was only tripped when the maximum number of equivalence classes was necessary (256). This commit fixes that by rewriting the loop such that we never uselessly increment outside the bounds of `u8`. Fixes rust-lang#367
Nice find. A "minimal" (or perhaps, more straight forward) reproduction is the following: extern crate regex;
fn main() {
let mut pat = "(?-u)".to_string();
for i in 0..256 {
let hexbyte = format!(r"\x{:02x}", i);
pat.push_str(&hexbyte);
}
let _ = regex::bytes::Regex::new(&pat);
} The underlying cause of this was the "byte class optimization." In particular, the optimization computes equivalence classes of bytes that are used to transition the DFA, since there is typically a much smaller number of equivalence classes than 256, which means the transition table in memory is much smaller. However, the part of the compiler that computed the equivalence classes tripped over a bug when the number of equivalence classes reached its maximum (256): let mut byte_classes = vec![0; 256];
let mut class = 0u8;
for i in 0..256 {
byte_classes[i] = class as u8;
if self.0[i] {
class = class.checked_add(1).unwrap();
}
}
byte_classes In this case, on the last iteration of the loop, the I have a fix for this in #368. Thanks for the report! |
compiler: fix a byte class bug The code that computes the byte classes had a bug that was only tripped when the maximum number of equivalence classes was necessary (256). This commit fixes that by rewriting the loop such that we never uselessly increment outside the bounds of `u8`. Fixes #367
I am using rustc version 1.14.0 and regex version 0.2.1. I found a pattern that panics a non-Unicode
bytes::Regex
:This program demonstrates the panic:
The pattern consists of 136 unique literal byte values, plus the sub-patterns
.
,[.]
,\S
, and\w
. Here is what I have been able to find out:bytes::RegexSetBuilder
that panicked, removing as much as I could while still having it panic, and sorting.\x44
with a literalD
and it still panics.\x00
to\x01
still panics, but changing\xde
to\xdf
does not. Changing\w
to\b
still panics, but changing\w
to\d
does not.(?-u)
(orunicode(false)
when using aRegexBuilder
orRegexSetBuilder
), then it does not panic.The same thing happens with
bytes::RegexBuilder
,bytes::RegexSet
, andbytes::RegexSetBuilder
. The 140 necessary elements can be distributed across multiple patterns when using a builder. Here is an example of abytes::RegexSetBuilder
that panics:Here are all 140 elements of the pattern in order:
The text was updated successfully, but these errors were encountered: