Allow duplicate names of groups #87

mrabarnett · 2013-01-23T10:41:53Z

Original report by Marcin Wojnarski (Bitbucket: mwojnars, GitHub: mwojnars).

Hi,

Currently, duplicate names are not allowed, for example this code raises an exception because group "a" is defined twice:

>>> regex.match(r'(?<a>here)? or (?<a>here)?', "here or here")
error: duplicate group

I suspect this design is a legacy after standard 're' module which didn't allow multiple values, so it was somehow natural to reject duplicate group names, too. But now, in 'regex' module which can capture repeated values, it would be natural to accept also duplicate group names and merge values extracted from all same-named groups into one list.

This enhancement would allow parsing loose formats, where a given value may appear in any of several different places in the text and we must prepare a regex that has groups in all these places. Usually, we would expect that only one place is matched (groups are optional like in regex above), but we can't say in advance which one and - for convenience - we'd like to use the same name for all these places, to avoid manual merging of several groups afterwards. In other use cases, it may be possible that more than 1 group matches and we want to extract all the matched values as a single list.

I think this enhancement would fit very well to the concept of repeated captures that's already present in 'regex'.

Do any other regex implementations have something like this?

I don't know.

The text was updated successfully, but these errors were encountered:

mrabarnett · 2013-01-23T11:02:40Z

Original comment by Anonymous.

Wouldn't the formats be alternatives, e.g. "(?<found>this)|(?<found>that)"?

The possibility is already covered; the groups are mutually exclusive.

mrabarnett · 2013-01-23T16:11:14Z

Original comment by Marcin Wojnarski (Bitbucket: mwojnars, GitHub: mwojnars).

Alternative is very good for different value patterns, but not for different locations. Example: web scraping, complex page where the same value (say, price of a product) can appear in 3 different places, depending on the type of product:

"(?<price1>\d+)? some-stuff (?<price2>\d+)? other-stuff (?<price3>\d+)?"

Because these are different *locations* in text, not different patterns, and the static parts ("some-stuff") must be present in the middle to correctly position the groups in entire text, alternative can't be used here (or would be very difficult: with static parts copy-pasted several times). Besides, we want to extract other properties too, not only price, and want to use single regex for all this - without making 3 variants of entire regex and without manual labelling of fields 'price1' 'price2' 'price3' and then merging.

mrabarnett · 2013-01-23T18:36:03Z

Original comment by Anonymous.

The regex module tries to be compatible with the re module, whose documentation says: """Group names must be valid Python identifiers, and each group name must be defined only once within a regular expression""".

The regex module relaxes that a little by allowing them multiple times if they're mutually exclusive, but I'm not sure whether they should be allowed in the version 0 ('compatible') behaviour.

Perhaps only in version 1 ('enhanced') behaviour?

I'll need to think about it and see whether it would have any adverse side-effects.

For the record, Perl allows it.

mrabarnett · 2013-01-24T03:35:19Z

Original comment by Marcin Wojnarski (Bitbucket: mwojnars, GitHub: mwojnars).

OK, thanks, for my needs V1 would be fine.

In case if you consider adding it in V0, note that - although this change is not strictly compatible with 're' - it does NOT break any existing code, because it only relaxes the constraints of correct patterns - any pattern correct in 're' would still be correct in 'regex' and behave *exactly* the same, with no changes in result; only some more patterns would be considered correct now.

mrabarnett · 2013-01-24T06:05:12Z

Original comment by Anonymous.

It's true that it wouldn't break any existing code, so there'd be no harm in having it work in V0 too.

mrabarnett · 2013-01-24T12:31:36Z

Original comment by Anonymous.

Duplicate group names are allowed in regex 0.1.20130124.

mrabarnett · 2013-01-26T04:09:03Z

Original comment by Marcin Wojnarski (Bitbucket: mwojnars, GitHub: mwojnars).

There is a minor issue when the same group is nested - the inner group overrides the value matched by the outer group and both are present in the result (2 copies of the same inner value). For example:

>>> match = regex.match(r'(?<x>a(?<x>b))', "ab")
>>> match.capturesdict()
{'x': ['b', 'b']}

mrabarnett · 2013-01-26T15:39:18Z

Original comment by Anonymous.

Fixed in regex 0.1.20130126.

mrabarnett closed this as completed Jan 26, 2013

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow duplicate names of groups #87

Allow duplicate names of groups #87

mrabarnett commented Jan 23, 2013

mrabarnett commented Jan 23, 2013

mrabarnett commented Jan 23, 2013

mrabarnett commented Jan 23, 2013

mrabarnett commented Jan 24, 2013

mrabarnett commented Jan 24, 2013

mrabarnett commented Jan 24, 2013

mrabarnett commented Jan 26, 2013

mrabarnett commented Jan 26, 2013

Allow duplicate names of groups #87

Allow duplicate names of groups #87

Comments

mrabarnett commented Jan 23, 2013

mrabarnett commented Jan 23, 2013

mrabarnett commented Jan 23, 2013

mrabarnett commented Jan 23, 2013

mrabarnett commented Jan 24, 2013

mrabarnett commented Jan 24, 2013

mrabarnett commented Jan 24, 2013

mrabarnett commented Jan 26, 2013

mrabarnett commented Jan 26, 2013