Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

permit empty capture groups? #106

Closed
BurntSushi opened this issue Aug 1, 2015 · 10 comments
Closed

permit empty capture groups? #106

BurntSushi opened this issue Aug 1, 2015 · 10 comments
Labels

Comments

@BurntSushi
Copy link
Member

It seems like both Ruby and Python support empty capture groups. Their use is a little niche, but it probably isn't hard to permit them.

>>> import re
>>> m = re.match('()', 'hi')
>>> m.group(0)
''
>>> m.group(1)
''
@BurntSushi
Copy link
Member Author

Today, a regex with an empty capture group is forbidden, so it would be backwards compatible to add this after a 1.0 release.

@casey
Copy link

casey commented Oct 15, 2016

What are some use cases of empty capture groups?

I'm writing some code that tokenizes an input string with regexes, and an empty capture group would have been useful. I needed my token regexes to have the same number of capture groups so that they can all be unpacked uniformly, and this was only possible in one case with an empty capture group.

I was able to work around it however, so it's kind of a weak use case.

scooter-dangle added a commit to scooter-dangle/regex that referenced this issue Dec 8, 2016
For `[^\x00-\xff]`, while it is still treated as a full Unicode
character class, it is not empty. For instance `≥` would still be
matched.

However, when `CharClass::to_byte_class` is called on it (as is done
when using `regex::bytes::Regex::new` rather than `regex::Regex::new`),
it _is_ now empty, since it excludes all possible bytes.

This commit adds a test asserting that `regex::bytes::Regex::new` for
this case (in accordance with
rust-lang#106) and adds an
`is_empty` check to the result of calling `CharClass::to_byte_class`,
which allows the test to pass.
scooter-dangle added a commit to scooter-dangle/regex that referenced this issue Dec 8, 2016
For `[^\x00-\xff]`, while it is still treated as a full Unicode
character class, it is not empty. For instance `≥` would still be
matched.

However, when `CharClass::to_byte_class` is called on it (as is done
when using `regex::bytes::Regex::new` rather than `regex::Regex::new`),
it _is_ now empty, since it excludes all possible bytes.

This commit adds a test asserting that `regex::bytes::Regex::new`
returns `Err` for this case (in accordance with
rust-lang#106) and adds an
`is_empty` check to the result of calling `CharClass::to_byte_class`,
which allows the test to pass.
@BurntSushi
Copy link
Member Author

This issue has been open for a long time, and people seem to have made do without it, so I'd like to close this issue.

Empty capture groups are still forbidden, so this is something that can be revisited if we so chose.

@TimNN
Copy link

TimNN commented Feb 10, 2017

I just ran into this issue / wanted to have this feature for auto generated regexes.

This issue can of course be easily worked around by use non-empty capture groups which match zero length strings (eg. (\\b|\\B) or (a{0})) so this feel like a very arbitrary limitation.

@casey
Copy link

casey commented Feb 10, 2017

I actually also ran into this issue a second time. I wrote a program which colorizes stdin according to user-supplied regexes, and there was a crash if the user supplied an empty regex. Of course, I could have checked for this, however, it seems like a unnecessary restriction.

@BurntSushi
Copy link
Member Author

@casey Why does it feel like an unnecesary restriction? You need to check for errors anyway.

@casey
Copy link

casey commented Feb 10, 2017

Certainly, but it's a restriction that I don't think that users will necessarily expect.

If their knowledge of regexes is from Python, Ruby, or JavaScript, then this is an incompatibility.

Admittedly, there are other differences, but it seems like relaxing this would be nice.

@BurntSushi
Copy link
Member Author

There are many many incompatibilities and this still seems like a very strange thing to explicitly support.

@casey
Copy link

casey commented Feb 10, 2017

Also, anecdotally, I ran into this another time when programmatically generating a regex.

I was doing something like:

s = ''
for name, pattern in parts:
  s += '(<?{}>{})'.format(name, pattern)

On of the patterns was empty, and thus inadvertently created an empty capture group. Of course, I was able to fix this, but it introduced more special cases and checks in my code. Also, it was a case where it wasn't easy to visual inspect the pattern for empty groups, since it wasn't a simple string literal.

I think that the point isn't really that this is vital to support, or that it's useful (it's definitely a very strange thing!) but that people will run into this problem, so why not just make it do the right thing.

I've run into it on three separate occasions. Of course, this is anecdotal evidence, but I suspect that others will also run into it.

@BurntSushi
Copy link
Member Author

BurntSushi commented Feb 10, 2017 via email

peterhj pushed a commit to peterhj/regex-syntax that referenced this issue Feb 23, 2020
For `[^\x00-\xff]`, while it is still treated as a full Unicode
character class, it is not empty. For instance `≥` would still be
matched.

However, when `CharClass::to_byte_class` is called on it (as is done
when using `regex::bytes::Regex::new` rather than `regex::Regex::new`),
it _is_ now empty, since it excludes all possible bytes.

This commit adds a test asserting that `regex::bytes::Regex::new`
returns `Err` for this case (in accordance with
rust-lang/regex#106) and adds an
`is_empty` check to the result of calling `CharClass::to_byte_class`,
which allows the test to pass.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants