-
Notifications
You must be signed in to change notification settings - Fork 446
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
permit empty capture groups? #106
Comments
Today, a regex with an empty capture group is forbidden, so it would be backwards compatible to add this after a 1.0 release. |
What are some use cases of empty capture groups? I'm writing some code that tokenizes an input string with regexes, and an empty capture group would have been useful. I needed my token regexes to have the same number of capture groups so that they can all be unpacked uniformly, and this was only possible in one case with an empty capture group. I was able to work around it however, so it's kind of a weak use case. |
For `[^\x00-\xff]`, while it is still treated as a full Unicode character class, it is not empty. For instance `≥` would still be matched. However, when `CharClass::to_byte_class` is called on it (as is done when using `regex::bytes::Regex::new` rather than `regex::Regex::new`), it _is_ now empty, since it excludes all possible bytes. This commit adds a test asserting that `regex::bytes::Regex::new` for this case (in accordance with rust-lang#106) and adds an `is_empty` check to the result of calling `CharClass::to_byte_class`, which allows the test to pass.
For `[^\x00-\xff]`, while it is still treated as a full Unicode character class, it is not empty. For instance `≥` would still be matched. However, when `CharClass::to_byte_class` is called on it (as is done when using `regex::bytes::Regex::new` rather than `regex::Regex::new`), it _is_ now empty, since it excludes all possible bytes. This commit adds a test asserting that `regex::bytes::Regex::new` returns `Err` for this case (in accordance with rust-lang#106) and adds an `is_empty` check to the result of calling `CharClass::to_byte_class`, which allows the test to pass.
This issue has been open for a long time, and people seem to have made do without it, so I'd like to close this issue. Empty capture groups are still forbidden, so this is something that can be revisited if we so chose. |
I just ran into this issue / wanted to have this feature for auto generated regexes. This issue can of course be easily worked around by use non-empty capture groups which match zero length strings (eg. |
I actually also ran into this issue a second time. I wrote a program which colorizes stdin according to user-supplied regexes, and there was a crash if the user supplied an empty regex. Of course, I could have checked for this, however, it seems like a unnecessary restriction. |
@casey Why does it feel like an unnecesary restriction? You need to check for errors anyway. |
Certainly, but it's a restriction that I don't think that users will necessarily expect. If their knowledge of regexes is from Python, Ruby, or JavaScript, then this is an incompatibility. Admittedly, there are other differences, but it seems like relaxing this would be nice. |
There are many many incompatibilities and this still seems like a very strange thing to explicitly support. |
Also, anecdotally, I ran into this another time when programmatically generating a regex. I was doing something like: s = ''
for name, pattern in parts:
s += '(<?{}>{})'.format(name, pattern) On of the patterns was empty, and thus inadvertently created an empty capture group. Of course, I was able to fix this, but it introduced more special cases and checks in my code. Also, it was a case where it wasn't easy to visual inspect the pattern for empty groups, since it wasn't a simple string literal. I think that the point isn't really that this is vital to support, or that it's useful (it's definitely a very strange thing!) but that people will run into this problem, so why not just make it do the right thing. I've run into it on three separate occasions. Of course, this is anecdotal evidence, but I suspect that others will also run into it. |
FWIW, I do think autogenerating regexes is the most compelling use case for
this.
…On Feb 10, 2017 6:08 PM, "Casey Rodarmor" ***@***.***> wrote:
Also, anecdotally, I also ran into this another time when programmatically
generating a regex.
I was doing something like:
s = ''for name, pattern in parts:
s += '(<?{}>{})'.format(name, pattern)
On of the patterns was empty, and thus inadvertently created an empty
capture group. Of course, I was able to fix this, but it introduced more
special cases and checks in my code. Also, it was a case where it wasn't
easy to visual inspect the pattern for empty groups, since it wasn't a
simple string literal.
I think that the point isn't really that this is vital to support, or that
it's useful (it's definitely a very strange thing!) but that people will
run into this problem, so why not just make it do the right thing.
I've run into it on three separate occasions. Of course, this is anecdotal
evidence, but I suspect that others will also run into it.
—
You are receiving this because you modified the open/close state.
Reply to this email directly, view it on GitHub
<#106 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAb34mSrfPGHPa-3z27bQziZpKT2Y23Kks5rbO36gaJpZM4FkCXl>
.
|
For `[^\x00-\xff]`, while it is still treated as a full Unicode character class, it is not empty. For instance `≥` would still be matched. However, when `CharClass::to_byte_class` is called on it (as is done when using `regex::bytes::Regex::new` rather than `regex::Regex::new`), it _is_ now empty, since it excludes all possible bytes. This commit adds a test asserting that `regex::bytes::Regex::new` returns `Err` for this case (in accordance with rust-lang/regex#106) and adds an `is_empty` check to the result of calling `CharClass::to_byte_class`, which allows the test to pass.
It seems like both Ruby and Python support empty capture groups. Their use is a little niche, but it probably isn't hard to permit them.
The text was updated successfully, but these errors were encountered: