ripgrep fails to match pattern including digit character class #1203

ravron · 2019-02-27T20:30:11Z

What version of ripgrep are you using?

ripgrep 0.10.0 (rev 8a7db1a918)
-SIMD -AVX (compiled)
+SIMD +AVX (runtime)

How did you install ripgrep?

$ brew tap burntsushi/ripgrep https://github.com/BurntSushi/ripgrep.git
$ brew install ripgrep-bin

What operating system are you using ripgrep on?

macOS 10.14.3 (18D109)

Describe your question, feature request, or bug.

rg appears to fail to find a certain pattern in a one-line file that definitely contains that pattern.

I must be missing something — this seems very unlikely to be a legitimate bug — but I can't figure out what.

If this is a bug, what are the steps to reproduce the behavior?

echo 153.230000 >| test.txt
rg '\d\d\d00' test.txt. This successfully finds a match of 23000.
rg '\d\d\d000' test.txt. This fails to find any match, when it should match 230000

Note that grep '\d\d\d000' test.txt correctly matches 230000. (grep --version grep (BSD grep) 2.5.1-FreeBSD)

If this is a bug, what is the actual behavior?

$ echo 153.230000 >| test.txt
$ rg --debug '\d\d\d000' test.txt
DEBUG|grep_regex::literal|grep-regex/src/literal.rs:110: required literal found: "000"
DEBUG|globset|globset/src/lib.rs:429: built glob set; 0 literals, 0 basenames, 8 extensions, 0 prefixes, 0 suffixes, 0 required extensions, 0 regexes
DEBUG|globset|globset/src/lib.rs:424: glob converted to regex: Glob { glob: "**/.*.s[a-w][a-z]", re: "(?-u)^(?:/?|.*/)\\..*\\.s[a-w][a-z]$", opts: GlobOptions { case_insensitive: false, literal_separator: false, backslash_escape: true }, tokens: Tokens([RecursivePrefix, Literal('.'), ZeroOrMore, Literal('.'), Literal('s'), Class { negated: false, ranges: [('a', 'w')] }, Class { negated: false, ranges: [('a', 'z')] }]) }
DEBUG|globset|globset/src/lib.rs:429: built glob set; 0 literals, 3 basenames, 0 extensions, 0 prefixes, 0 suffixes, 0 required extensions, 1 regexes
$

If this is a bug, what is the expected behavior?

rg '\d\d\d000' test.txt should identify the single match in the file, as grep does. Specifically:

$ rg '\d\d\d000' test.txt
1:153.230000

Other

Note that changing the corpus in seemingly irrelevant ways can cause the bug to change or disappear. For example, the \d\d\d000 pattern matches if three 0 characters are prepended to the contents of the file (that is, the file contains 000153.230000).

The text was updated successfully, but these errors were encountered:

This commit fixes a bug where the reverse suffix literal optimization wasn't quite right. It was too eagerly skipping past parts of the input without verifying that there was no match. We fix this by being a bit more careful with what we're searching by keeping track of the starting position of the last literal matched. Subsequent literal searches then start immediately after the last one. This is necessary in particular when the suffix literal can have overlapping matches. e.g., searching `000` in `0000` can match at either positions 0 or 1, but searching `abc` in `abcd` can only match as position 0. This was initially reported as a bug against ripgrep: BurntSushi/ripgrep#1203

BurntSushi · 2019-02-27T22:46:06Z

Thanks so much for reporting this! It is indeed a bug in the regex engine. You can this and this for the specific changes to the regex engine to fix this. The short story is that you were tripping in a reverse suffix literal optimization that wasn't quite correct. The reason why it seemed sensitive to different regexes and inputs is because it is! :-) The reverse suffix literal optimization only runs in very specific circumstances related to the size and structure of the regex, and this particular bug is only tripped when a suffix literal (such as 000) can result in potentially overlapping matches. In this case, both your pattern and your haystack did this.

This should now be fixed on master, since I've bumped ripgrep's regex dependency to 1.1.2.

ravron · 2019-02-27T23:18:19Z

No kidding. Thanks for the explanation and the fixes!

BurntSushi added the bug A bug. label Feb 27, 2019

BurntSushi closed this as completed in f19b84f Feb 27, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ripgrep fails to match pattern including digit character class #1203

ripgrep fails to match pattern including digit character class #1203

ravron commented Feb 27, 2019 •

edited

Loading

BurntSushi commented Feb 27, 2019

ravron commented Feb 27, 2019

ripgrep fails to match pattern including digit character class #1203

ripgrep fails to match pattern including digit character class #1203

Comments

ravron commented Feb 27, 2019 • edited Loading

What version of ripgrep are you using?

How did you install ripgrep?

What operating system are you using ripgrep on?

Describe your question, feature request, or bug.

If this is a bug, what are the steps to reproduce the behavior?

If this is a bug, what is the actual behavior?

If this is a bug, what is the expected behavior?

Other

BurntSushi commented Feb 27, 2019

ravron commented Feb 27, 2019

ravron commented Feb 27, 2019 •

edited

Loading