Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Group references are not taken into account when group is reporting the last match #296

Closed
mrabarnett opened this issue Sep 13, 2018 · 4 comments
Labels
bug Something isn't working minor

Comments

@mrabarnett
Copy link
Owner

Original report by Anonymous.


group is supposed to yield the last match of a group of a regular expression, while captures will yield all matches. When using group for a group with a group reference, the last match only of the named group is returned by group, ignoring the matches of the references returned by captures:

>>> import regex
>>> m = regex.fullmatch('(?P<x>.)*(?&x)', 'abc')
>>> m.captures('x')
['a', 'b', 'c']      # all matches, both of x and &x
>>> m.group('x')
'b'                  # last match only of matches of x, ignoring &x

Without a reference, just copying the named group, it works as expected:

>>> m = regex.fullmatch('(?P<x>.)(?P<x>.)(?P<x>.)', 'abc')
>>> m.captures('x')
['a', 'b', 'c']
>>> m.group('x')
'c'                  # last match of all three matches

I assume this is not intended behavior but a bug, right?

I used regex 2018.8.29, Python 3.6.5 and opensuse 15.0.

@mrabarnett
Copy link
Owner Author

Original comment by Matthew Barnett (Bitbucket: mrabarnett, GitHub: mrabarnett).


Is it a bug? Probably, though it's not obvious what it should be.

Would you expect this:

#!python

>>> regex.match(r'(.)(?1)', 'ab').groups()
('a',)

?

If you wanted this to work:

#!python

>>> regex.fullmatch(r'(?i)(.)(?R)?\1', 'Abba').groups()
('A',)

you'd need group 1 to be saved and restored. (Perl and PCRE do this, which is why this module does too.)

In your first example, group 'x' matches twice, capturing 'a' and 'b', and the subroutine consumes the 'c', so, if anything, m.captures('x') should be ['a', 'b'].

Your 2 examples can't be equivalent without breaking something else.

That's regex for you!

@mrabarnett
Copy link
Owner Author

Original comment by Lars Schmidt-Thieme (Bitbucket: lst2015, ).


Currently it looks inconsistent as captures reports both matches
but groups who should report the last one of all matches,
reports only the last one of the subset of matches matching
the named group, not the reference:

>>> regex.fullmatch(r'(.)(?1)', 'ab').captures(1)
['a', 'b']                                        # current behavior
>>> regex.fullmatch(r'(.)(?1)', 'ab').groups()
('a',)                                            # current behavior

Both ways to make it consistent would make sense to me:

a) either that captures reports only the matches of the named group,
but not of references, i.e.,

>>> regex.fullmatch(r'(.)(?1)', 'ab').captures(1)
['a']                                             # desired behavior case a)
>>> regex.fullmatch(r'(.)(?1)', 'ab').groups()
('a',)                                            # current behavior

or
b) that groups reports actually the last of all matches,
named groups and references, i.e.,

>>> regex.fullmatch(r'(.)(?1)', 'ab').captures(1)
['a', 'b']                                         # current behavior
>>> regex.fullmatch(r'(.)(?1)', 'ab').groups()
('b',)                                             # desired behavior case b)

For your second example, in case a) I would expect the outcome
to be

>>> regex.fullmatch(r'(?i)(.)(?R)?\1', 'Abba').groups()
('A',)

as references are ignored, while in case b) to be

>>> regex.fullmatch(r'(?i)(.)(?R)?\1', 'Abba').groups()
('b',)

as group 1 is referenced as part of the overall regex reference ?R
and matched twice, so the latter one is 'b'.

Does this make sense?

@mrabarnett
Copy link
Owner Author

Original comment by Matthew Barnett (Bitbucket: mrabarnett, GitHub: mrabarnett).


Consider the final example. If group 1 finishes with the value 'b', then the final \1 would need to match a 'b', but the target string ends with 'a', so that's not going to work.

In order to use recursion to match like this, which is what Perl and PCRE do, the groups have to be saved when a subroutine is called and restored when it returns.

Group 1 captures 'a', calls the subroutine, captures 'b', matches 'b' with the backreference, returns from the subroutine, and then matches 'a' with the backreference. Group 1 has been restored to its previous match.

The question is what .captures() should return. I copied that from C#, but that implementation doesn't support recursion.

@mrabarnett
Copy link
Owner Author

Original comment by Matthew Barnett (Bitbucket: mrabarnett, GitHub: mrabarnett).


I've come to a conclusion/decision here: it might be unexpected, but it's not a bug.

.groups() returns the last match at the top level; it must because captures must be preserved across subroutine calls as described above. This is consistent with other regex implementations.

.captures() returns all of the captures of the groups, irrespective of whether they occurred at the top level or in a subroutine call, because this lets you capture more and makes recursion more useful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working minor
Projects
None yet
Development

No branches or pull requests

1 participant