Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ripgrep fails to match pattern including digit character class #1203

Closed
ravron opened this issue Feb 27, 2019 · 2 comments
Closed

ripgrep fails to match pattern including digit character class #1203

ravron opened this issue Feb 27, 2019 · 2 comments
Labels
bug A bug.

Comments

@ravron
Copy link

ravron commented Feb 27, 2019

What version of ripgrep are you using?

ripgrep 0.10.0 (rev 8a7db1a918)
-SIMD -AVX (compiled)
+SIMD +AVX (runtime)

How did you install ripgrep?

$ brew tap burntsushi/ripgrep https://github.com/BurntSushi/ripgrep.git
$ brew install ripgrep-bin

What operating system are you using ripgrep on?

macOS 10.14.3 (18D109)

Describe your question, feature request, or bug.

rg appears to fail to find a certain pattern in a one-line file that definitely contains that pattern.

I must be missing something — this seems very unlikely to be a legitimate bug — but I can't figure out what.

If this is a bug, what are the steps to reproduce the behavior?

  1. echo 153.230000 >| test.txt
  2. rg '\d\d\d00' test.txt. This successfully finds a match of 23000.
  3. rg '\d\d\d000' test.txt. This fails to find any match, when it should match 230000

Note that grep '\d\d\d000' test.txt correctly matches 230000. (grep --version grep (BSD grep) 2.5.1-FreeBSD)

If this is a bug, what is the actual behavior?

$ echo 153.230000 >| test.txt
$ rg --debug '\d\d\d000' test.txt
DEBUG|grep_regex::literal|grep-regex/src/literal.rs:110: required literal found: "000"
DEBUG|globset|globset/src/lib.rs:429: built glob set; 0 literals, 0 basenames, 8 extensions, 0 prefixes, 0 suffixes, 0 required extensions, 0 regexes
DEBUG|globset|globset/src/lib.rs:424: glob converted to regex: Glob { glob: "**/.*.s[a-w][a-z]", re: "(?-u)^(?:/?|.*/)\\..*\\.s[a-w][a-z]$", opts: GlobOptions { case_insensitive: false, literal_separator: false, backslash_escape: true }, tokens: Tokens([RecursivePrefix, Literal('.'), ZeroOrMore, Literal('.'), Literal('s'), Class { negated: false, ranges: [('a', 'w')] }, Class { negated: false, ranges: [('a', 'z')] }]) }
DEBUG|globset|globset/src/lib.rs:429: built glob set; 0 literals, 3 basenames, 0 extensions, 0 prefixes, 0 suffixes, 0 required extensions, 1 regexes
$

If this is a bug, what is the expected behavior?

rg '\d\d\d000' test.txt should identify the single match in the file, as grep does. Specifically:

$ rg '\d\d\d000' test.txt
1:153.230000

Other

Note that changing the corpus in seemingly irrelevant ways can cause the bug to change or disappear. For example, the \d\d\d000 pattern matches if three 0 characters are prepended to the contents of the file (that is, the file contains 000153.230000).

@BurntSushi BurntSushi added the bug A bug. label Feb 27, 2019
BurntSushi added a commit to rust-lang/regex that referenced this issue Feb 27, 2019
This commit fixes a bug where the reverse suffix literal optimization
wasn't quite right. It was too eagerly skipping past parts of the input
without verifying that there was no match. We fix this by being a bit more
careful with what we're searching by keeping track of the starting position
of the last literal matched. Subsequent literal searches then start
immediately after the last one.

This is necessary in particular when the suffix literal can have
overlapping matches. e.g., searching `000` in `0000` can match at either
positions 0 or 1, but searching `abc` in `abcd` can only match as position
0.

This was initially reported as a bug against ripgrep:
BurntSushi/ripgrep#1203
@BurntSushi
Copy link
Owner

Thanks so much for reporting this! It is indeed a bug in the regex engine. You can this and this for the specific changes to the regex engine to fix this. The short story is that you were tripping in a reverse suffix literal optimization that wasn't quite correct. The reason why it seemed sensitive to different regexes and inputs is because it is! :-) The reverse suffix literal optimization only runs in very specific circumstances related to the size and structure of the regex, and this particular bug is only tripped when a suffix literal (such as 000) can result in potentially overlapping matches. In this case, both your pattern and your haystack did this.

This should now be fixed on master, since I've bumped ripgrep's regex dependency to 1.1.2.

@ravron
Copy link
Author

ravron commented Feb 27, 2019

No kidding. Thanks for the explanation and the fixes!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug A bug.
Projects
None yet
Development

No branches or pull requests

2 participants