Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filtering on the presence or absence of captures #606

Closed
chocolateboy opened this issue Sep 17, 2017 · 5 comments
Closed

Filtering on the presence or absence of captures #606

chocolateboy opened this issue Sep 17, 2017 · 5 comments
Labels
enhancement An enhancement to the functionality of the software. help wanted Others are encouraged to work on this issue. icebox A feature that is recognized as possibly desirable, but is unlikely to implemented any time soon.

Comments

@chocolateboy
Copy link
Contributor

chocolateboy commented Sep 17, 2017

💡 [This trick] relies on your ability to inspect Group 1 captures (at least in the generic flavor), so it will not work in a non-programming environment, such as a text editor's search-and-replace function or a grep command -- The Best Regex Trick

TL;DR

Select all lines which match \bTarzan\b but not "Tarzan":

$ rg -w '"Tarzan"|(Tarzan)' --defined '$1'

AKA

$ rg -w '"Tarzan"|(Tarzan)' -d '$1'

Suppose I want to select all lines which contain the unquoted word Tarzan i.e. \bTarzan\b but not "Tarzan" e.g. the first 4 lines of:

test.txt

"Tarzan and Jane"
"Jane and Tarzan"
Me Tarzan, you Jane
Tarzan vs "Tarzan"
This line doesn't mention him
He's moved to Tarzania
He's no "Tarzan"!

It can be done with a pipeline e.g.:

$ rg -w 'Tarzan' test.txt | rg -v '"Tarzan"'

But that particular example rejects lines which contain both, which is not what we want in this case. The same would be true if ripgrep added e.g. an -E (--no-regexp) option to complement -e/--regexp:

$ rg -we 'Tarzan' -E '"Tarzan"' test.txt

It can be done in one pass with PCRE-flavored greps such as GNU grep and ack, with varying degrees of difficulty/unreadability, by using negative lookahead/look-behind assertions e.g.:

$ grep -P '^(?:(?!"Tarzan"|Tarzan\w+)(Tarzan|.))+$' test.txt

That's already pretty gnarly for a single exclusion, and quickly becomes impractical/incomprehensible for multiple exclusions. It also matches lines which don't contain Tarzan and, again, excludes lines which contain both patterns.

In programming languages, there's a common pattern for performing exclusions in a simple, readable way without multiple passes:

  1. match and discard the exclusions
  2. match and capture the inclusion
  3. test for its existence

e.g.:

JavaScript

[ '', '"Tarzan"', 'Tarzan', 'Tarzania', 'Tarzan vs "Tarzan"' ].filter(it => {
    const m = it.match(/"Tarzan"|\b(Tarzan)\b/)
    return m && m[1]
}) // [ 'Tarzan', 'Tarzan vs "Tarzan"' ]

ES.next[1]

[ '', '"Tarzan"', 'Tarzan', 'Tarzania', 'Tarzan vs "Tarzan"' ].filter(it => {
    return it.match(/"Tarzan"|\b(Tarzan)\b/)?.[1]
}) // [ 'Tarzan', 'Tarzan vs "Tarzan"' ]

Ruby

[ '', '"Tarzan"', 'Tarzan', 'Tarzania', 'Tarzan vs "Tarzan"' ].select { |it|
    it[/"Tarzan"|\b(Tarzan)\b/, 1]
} # => [ 'Tarzan', 'Tarzan vs "Tarzan"' ]
$ ruby -ne 'print if $_[/"Tarzan"|\b(Tarzan)\b/, 1]' test.txt

etc.

This isn't available in any greps I'm aware of, but since the machinery is already there to capture and reference subexpressions by index and name, it seems like a small step to use them in predicates to reproduce the flexibility and simplicity of this pattern on the command line e.g.:

$ rg -w '"Tarzan"|(Tarzan)' -d '$1' test.txt

output

"Tarzan and Jane"
"Jane and Tarzan"
Me Tarzan, you Jane
Tarzan vs "Tarzan"

Notes

1) I assume that the predicate can be inverted e.g.:

$ rg --not-defined '$1'

AKA

$ rg -D '$1'

There aren't many single-letter options left. The last remaining pairs are -d/-D, -y/-Y and -z/-Z. The latter are commonly used to denote null/zero values, so they could be used instead, with the meaning of -d and -D inverted e.g.:

$ rg -z '$1' # AKA rg --not-defined '$1'
$ rg -Z '$1' # AKA rg --defined '$1'

2) I assume that indices increment across multiple patterns, and that multiple -d and -D options can be combined e.g.:

$ rg -e 'Foo|(Bar)' -e '(Baz|(Quux))' -d '$1' -D '$3'

3) I also assume that numbered and named captures can be mixed e.g.:

$ rg -e 'Foo|(Bar)' -e 'Baz|(?P<name>Quux)' -d '$1' -D '$name'

4) The full version of the matching command would currently be:

rg '^.*?(?:"Tarzan"|\b(Tarzan)\b).*$' -d '$1' test.txt

Hopefully some of that boilerplate can be removed e.g. via #389 or #593.

5) For clarity, "Tarzan" vs Tarzan is omitted from the examples. Handling it only slightly complicates the regex:

$ rg '^(?:"Tarzan"|\b(Tarzan)\b|.)*$' -d '$1' test.txt
@BurntSushi
Copy link
Owner

Thanks for this very thorough write up!

I kind of feel like that semantics of this are too complex, which will probably lead to a feature that almost nobody uses. By that, I don't mean that the flags --not-defined and --defined are themselves complex, but using them effectively---as you've demonstrated here---requires some ingenuity in crafting the regex.

With that said, I'd be willing to adopt a feature like this because I do agree that it could be useful, but I'd have to strongly insist on the following:

  1. It should not begin life with short flags. I used short flags whenever the flags are common, or if there was a precedent for their existence in other tools. For a feature like this, that is neither common nor familiar, I would like to hold off on adding short flags. If I'm wrong and it becomes popular, then we can revisit it.
  2. The maintenance burden of the feature needs to be low. That means adding the feature shouldn't require any significant complications and it should be reasonably well tested.
  3. Since the use case motivating the existence of these flags is somewhat complicated, I would like the documentation to be clear. It should be concise, but contain an example usage. (Perhaps a condensed version of the example in this ticket.)

@BurntSushi BurntSushi added enhancement An enhancement to the functionality of the software. help wanted Others are encouraged to work on this issue. labels Sep 24, 2017
@BurntSushi BurntSushi added the icebox A feature that is recognized as possibly desirable, but is unlikely to implemented any time soon. label Oct 21, 2017
@BatmanAoD
Copy link

Now that PRCE is (optionally) supported, can either of you think of a use-case for this that isn't handled by lookahead and lookbehind? I think this would be strictly more powerful than negative lookbehind, since lookbehind can't contain variable-length patterns, but that's the only advantage I can see. (Granted, that's an advantage I think I would occasionally find useful.)

@BurntSushi
Copy link
Owner

I think it could be possible to define a simpler UX than needing to resort to look-around.

With that said, it's a good point and I was never a big fan of adding this feature anyway. So I'm going to close this.

@chocolateboy
Copy link
Contributor Author

chocolateboy commented Sep 20, 2020

Lookaround assertions still have the issues mentioned above. For anyone looking for a clean solution to this with the PCRE engine, the backtracking-control verbs123 are your friends:

Input

"Tarzan and Jane"
"Jane and Tarzan"
Me Tarzan, you Jane
Tarzan vs "Tarzan"
"Tarzan" vs Tarzan
This line doesn't mention him
He's moved to Tarzania
He's no "Tarzan"!

Command

$ rg --pcre2 '(?:"Tarzan")(*SKIP)(*FAIL)|\bTarzan\b' test.txt

Output

"Tarzan and Jane"
"Jane and Tarzan"
Me Tarzan, you Jane
Tarzan vs "Tarzan"
"Tarzan" vs Tarzan

Or, to exclude lines which contain "Tarzan":

Command

$ rg --pcre2 '(?:.*?"Tarzan".*)(*SKIP)(*FAIL)|\bTarzan\b' test.txt
$ rg --pcre2 '(?:.*?"Tarzan".*)(*COMMIT)(*FAIL)|\bTarzan\b' test.txt

Output

"Tarzan and Jane"
"Jane and Tarzan"
Me Tarzan, you Jane

Footnotes

  1. https://www.rexegg.com/regex-best-trick.html#pcrevariation

  2. https://www.rexegg.com/backtracking-control-verbs.html

  3. https://perldoc.perl.org/perlre.html#Special-Backtracking-Control-Verbs

@BatmanAoD
Copy link

@chocolateboy Wow, I had never heard of those before. Thanks for sharing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement An enhancement to the functionality of the software. help wanted Others are encouraged to work on this issue. icebox A feature that is recognized as possibly desirable, but is unlikely to implemented any time soon.
Projects
None yet
Development

No branches or pull requests

3 participants