Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow duplicate names of groups #87

Closed
mrabarnett opened this issue Jan 23, 2013 · 8 comments
Closed

Allow duplicate names of groups #87

mrabarnett opened this issue Jan 23, 2013 · 8 comments
Labels
enhancement New feature or request trivial

Comments

@mrabarnett
Copy link
Owner

Original report by Marcin Wojnarski (Bitbucket: mwojnars, GitHub: mwojnars).


Hi,

Currently, duplicate names are not allowed, for example this code raises an exception because group "a" is defined twice:

>>> regex.match(r'(?<a>here)? or (?<a>here)?', "here or here")
error: duplicate group

I suspect this design is a legacy after standard 're' module which didn't allow multiple values, so it was somehow natural to reject duplicate group names, too. But now, in 'regex' module which can capture repeated values, it would be natural to accept also duplicate group names and merge values extracted from all same-named groups into one list.

This enhancement would allow parsing loose formats, where a given value may appear in any of several different places in the text and we must prepare a regex that has groups in all these places. Usually, we would expect that only one place is matched (groups are optional like in regex above), but we can't say in advance which one and - for convenience - we'd like to use the same name for all these places, to avoid manual merging of several groups afterwards. In other use cases, it may be possible that more than 1 group matches and we want to extract all the matched values as a single list.

I think this enhancement would fit very well to the concept of repeated captures that's already present in 'regex'.

Do any other regex implementations have something like this?

I don't know.

@mrabarnett
Copy link
Owner Author

Original comment by Anonymous.


Wouldn't the formats be alternatives, e.g. "(?<found>this)|(?<found>that)"?

The possibility is already covered; the groups are mutually exclusive.

@mrabarnett
Copy link
Owner Author

Original comment by Marcin Wojnarski (Bitbucket: mwojnars, GitHub: mwojnars).


Alternative is very good for different value patterns, but not for different locations. Example: web scraping, complex page where the same value (say, price of a product) can appear in 3 different places, depending on the type of product:

"(?<price1>\d+)? some-stuff (?<price2>\d+)? other-stuff (?<price3>\d+)?"

Because these are different *locations* in text, not different patterns, and the static parts ("some-stuff") must be present in the middle to correctly position the groups in entire text, alternative can't be used here (or would be very difficult: with static parts copy-pasted several times). Besides, we want to extract other properties too, not only price, and want to use single regex for all this - without making 3 variants of entire regex and without manual labelling of fields 'price1' 'price2' 'price3' and then merging.

@mrabarnett
Copy link
Owner Author

Original comment by Anonymous.


The regex module tries to be compatible with the re module, whose documentation says: """Group names must be valid Python identifiers, and each group name must be defined only once within a regular expression""".

The regex module relaxes that a little by allowing them multiple times if they're mutually exclusive, but I'm not sure whether they should be allowed in the version 0 ('compatible') behaviour.

Perhaps only in version 1 ('enhanced') behaviour?

I'll need to think about it and see whether it would have any adverse side-effects.

For the record, Perl allows it.

@mrabarnett
Copy link
Owner Author

Original comment by Marcin Wojnarski (Bitbucket: mwojnars, GitHub: mwojnars).


OK, thanks, for my needs V1 would be fine.

In case if you consider adding it in V0, note that - although this change is not strictly compatible with 're' - it does NOT break any existing code, because it only relaxes the constraints of correct patterns - any pattern correct in 're' would still be correct in 'regex' and behave *exactly* the same, with no changes in result; only some more patterns would be considered correct now.

@mrabarnett
Copy link
Owner Author

Original comment by Anonymous.


It's true that it wouldn't break any existing code, so there'd be no harm in having it work in V0 too.

@mrabarnett
Copy link
Owner Author

Original comment by Anonymous.


Duplicate group names are allowed in regex 0.1.20130124.

@mrabarnett
Copy link
Owner Author

Original comment by Marcin Wojnarski (Bitbucket: mwojnars, GitHub: mwojnars).


There is a minor issue when the same group is nested - the inner group overrides the value matched by the outer group and both are present in the result (2 copies of the same inner value). For example:

>>> match = regex.match(r'(?<x>a(?<x>b))', "ab")
>>> match.capturesdict()
{'x': ['b', 'b']}

@mrabarnett
Copy link
Owner Author

Original comment by Anonymous.


Fixed in regex 0.1.20130126.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request trivial
Projects
None yet
Development

No branches or pull requests

1 participant