permit empty capture groups? #106

BurntSushi · 2015-08-01T21:10:44Z

It seems like both Ruby and Python support empty capture groups. Their use is a little niche, but it probably isn't hard to permit them.

>>> import re
>>> m = re.match('()', 'hi')
>>> m.group(0)
''
>>> m.group(1)
''

The text was updated successfully, but these errors were encountered:

BurntSushi · 2016-01-31T16:06:46Z

Today, a regex with an empty capture group is forbidden, so it would be backwards compatible to add this after a 1.0 release.

casey · 2016-10-15T06:04:59Z

What are some use cases of empty capture groups?

I'm writing some code that tokenizes an input string with regexes, and an empty capture group would have been useful. I needed my token regexes to have the same number of capture groups so that they can all be unpacked uniformly, and this was only possible in one case with an empty capture group.

I was able to work around it however, so it's kind of a weak use case.

For `[^\x00-\xff]`, while it is still treated as a full Unicode character class, it is not empty. For instance `≥` would still be matched. However, when `CharClass::to_byte_class` is called on it (as is done when using `regex::bytes::Regex::new` rather than `regex::Regex::new`), it _is_ now empty, since it excludes all possible bytes. This commit adds a test asserting that `regex::bytes::Regex::new` for this case (in accordance with rust-lang#106) and adds an `is_empty` check to the result of calling `CharClass::to_byte_class`, which allows the test to pass.

For `[^\x00-\xff]`, while it is still treated as a full Unicode character class, it is not empty. For instance `≥` would still be matched. However, when `CharClass::to_byte_class` is called on it (as is done when using `regex::bytes::Regex::new` rather than `regex::Regex::new`), it _is_ now empty, since it excludes all possible bytes. This commit adds a test asserting that `regex::bytes::Regex::new` returns `Err` for this case (in accordance with rust-lang#106) and adds an `is_empty` check to the result of calling `CharClass::to_byte_class`, which allows the test to pass.

BurntSushi · 2016-12-29T00:04:30Z

This issue has been open for a long time, and people seem to have made do without it, so I'd like to close this issue.

Empty capture groups are still forbidden, so this is something that can be revisited if we so chose.

TimNN · 2017-02-10T22:39:55Z

I just ran into this issue / wanted to have this feature for auto generated regexes.

This issue can of course be easily worked around by use non-empty capture groups which match zero length strings (eg. (\\b|\\B) or (a{0})) so this feel like a very arbitrary limitation.

casey · 2017-02-10T22:45:11Z

I actually also ran into this issue a second time. I wrote a program which colorizes stdin according to user-supplied regexes, and there was a crash if the user supplied an empty regex. Of course, I could have checked for this, however, it seems like a unnecessary restriction.

BurntSushi · 2017-02-10T22:55:48Z

@casey Why does it feel like an unnecesary restriction? You need to check for errors anyway.

casey · 2017-02-10T23:00:19Z

Certainly, but it's a restriction that I don't think that users will necessarily expect.

If their knowledge of regexes is from Python, Ruby, or JavaScript, then this is an incompatibility.

Admittedly, there are other differences, but it seems like relaxing this would be nice.

BurntSushi · 2017-02-10T23:03:44Z

There are many many incompatibilities and this still seems like a very strange thing to explicitly support.

casey · 2017-02-10T23:08:42Z

Also, anecdotally, I ran into this another time when programmatically generating a regex.

I was doing something like:

s = ''
for name, pattern in parts:
  s += '(<?{}>{})'.format(name, pattern)

On of the patterns was empty, and thus inadvertently created an empty capture group. Of course, I was able to fix this, but it introduced more special cases and checks in my code. Also, it was a case where it wasn't easy to visual inspect the pattern for empty groups, since it wasn't a simple string literal.

I think that the point isn't really that this is vital to support, or that it's useful (it's definitely a very strange thing!) but that people will run into this problem, so why not just make it do the right thing.

I've run into it on three separate occasions. Of course, this is anecdotal evidence, but I suspect that others will also run into it.

BurntSushi · 2017-02-10T23:29:48Z

FWIW, I do think autogenerating regexes is the most compelling use case for this.

…

On Feb 10, 2017 6:08 PM, "Casey Rodarmor" ***@***.***> wrote: Also, anecdotally, I also ran into this another time when programmatically generating a regex. I was doing something like: s = ''for name, pattern in parts: s += '(<?{}>{})'.format(name, pattern) On of the patterns was empty, and thus inadvertently created an empty capture group. Of course, I was able to fix this, but it introduced more special cases and checks in my code. Also, it was a case where it wasn't easy to visual inspect the pattern for empty groups, since it wasn't a simple string literal. I think that the point isn't really that this is vital to support, or that it's useful (it's definitely a very strange thing!) but that people will run into this problem, so why not just make it do the right thing. I've run into it on three separate occasions. Of course, this is anecdotal evidence, but I suspect that others will also run into it. — You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub <#106 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAb34mSrfPGHPa-3z27bQziZpKT2Y23Kks5rbO36gaJpZM4FkCXl> .

For `[^\x00-\xff]`, while it is still treated as a full Unicode character class, it is not empty. For instance `≥` would still be matched. However, when `CharClass::to_byte_class` is called on it (as is done when using `regex::bytes::Regex::new` rather than `regex::Regex::new`), it _is_ now empty, since it excludes all possible bytes. This commit adds a test asserting that `regex::bytes::Regex::new` returns `Err` for this case (in accordance with rust-lang/regex#106) and adds an `is_empty` check to the result of calling `CharClass::to_byte_class`, which allows the test to pass.

BurntSushi added the question label Aug 1, 2015

scooter-dangle mentioned this issue Dec 7, 2016

Negation of full byte-range class causes panic #303

Closed

scooter-dangle mentioned this issue Dec 8, 2016

Verify character class still non-empty after converting to byte class #304

Merged

BurntSushi closed this as completed Dec 29, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

permit empty capture groups? #106

permit empty capture groups? #106

BurntSushi commented Aug 1, 2015

BurntSushi commented Jan 31, 2016

casey commented Oct 15, 2016

BurntSushi commented Dec 29, 2016

TimNN commented Feb 10, 2017

casey commented Feb 10, 2017

BurntSushi commented Feb 10, 2017

casey commented Feb 10, 2017

BurntSushi commented Feb 10, 2017

casey commented Feb 10, 2017 •

edited

Loading

BurntSushi commented Feb 10, 2017 via email

permit empty capture groups? #106

permit empty capture groups? #106

Comments

BurntSushi commented Aug 1, 2015

BurntSushi commented Jan 31, 2016

casey commented Oct 15, 2016

BurntSushi commented Dec 29, 2016

TimNN commented Feb 10, 2017

casey commented Feb 10, 2017

BurntSushi commented Feb 10, 2017

casey commented Feb 10, 2017

BurntSushi commented Feb 10, 2017

casey commented Feb 10, 2017 • edited Loading

BurntSushi commented Feb 10, 2017 via email

casey commented Feb 10, 2017 •

edited

Loading