Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

find/captures disagree about match when using $ or \z #334

Closed
CmdrMoozy opened this issue Jan 29, 2017 · 1 comment · Fixed by #343
Closed

find/captures disagree about match when using $ or \z #334

CmdrMoozy opened this issue Jan 29, 2017 · 1 comment · Fixed by #343
Labels

Comments

@CmdrMoozy
Copy link

I believe I have found a bug in this library. I've tried to condense it down into the simplest possible test case which reproduces the problem:

#[test]
fn test_end_of_text() {
    let re = regex!(r"a(b*(X|\z))?");
    let text = "abcbX";
    let ma = re.find(text).unwrap().as_str();
    let mb = re.captures(text).unwrap().get(0).unwrap().as_str();
    assert_eq!("a", ma);
    assert_eq!(ma, mb);
}

I would expect this test case to pass. The first assert_eq! does, but the second assert_eq! reports that re.captures( actually returned "ab" instead of just "a". It seems like (b*(X|\z))? matches the b following the a and returns it as part of the full capture group, even though the following character c doesn't match X|\z. Interestingly, this only seems to happen when using captures, and not find.

@CmdrMoozy CmdrMoozy changed the title Discrepancy in Regex::find / Regex::captures behavior when using '\z' find/captures disagree about match when using $ or \z Feb 12, 2017
@CmdrMoozy
Copy link
Author

I turned this into a real regression test according to this crate's existing style:

// See: https://github.com/rust-lang/regex/issues/334
#[test]
fn regression_end_of_text_mismatch() {
    let re = regex!(r"a(b*(X|$))?");
    let text = text!("abcbX");
    let ma = re.find(text).unwrap();
    let mb = re.captures(text).unwrap().get(0).unwrap();
    assert_eq!(text!("a"), match_text!(ma));
    assert_eq!(match_text!(ma), match_text!(mb));
}

Another thing I noted while doing this is that, while not in multi-line mode, the same test fails regardless of whether I'm using $ or \z.

@BurntSushi BurntSushi added the bug label Feb 18, 2017
BurntSushi added a commit to BurntSushi/regex that referenced this issue Feb 18, 2017
When searching for captures, we first use the DFA to find the start and
end of the match. We then pass just the matched region of text to the
NFA engine to find sub-capture locations. This is a key optimization
that prevents the NFA engine from searching a lot more text than what is
necessary in some cases.

One problem with this is that some instructions determine their match
state based on whether the engine is at the boundary of the search text.
For example, `$` matches if and only if the engine is at EOF. If we only
provide the matched text region, then assertions like `\b` might not
work, since it needs to examine at least one character past the end of
the match. If we provide the matched text region plus one character,
then `$` may match when it shouldn't. Therefore, we provide the matched
text plus (at most) two characters.

Fixes rust-lang#334
@BurntSushi BurntSushi mentioned this issue Feb 18, 2017
bors added a commit that referenced this issue Feb 18, 2017
Fixes

This PR contains a series of commits that fixes several minor bugs.

Fixes #321, Fixes #334, Fixes #326, Fixes #333, Fixes #338
@bors bors closed this as completed in #343 Feb 18, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants