-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow duplicate names of groups #87
Comments
Original comment by Anonymous. Wouldn't the formats be alternatives, e.g. The possibility is already covered; the groups are mutually exclusive. |
Original comment by Marcin Wojnarski (Bitbucket: mwojnars, GitHub: mwojnars). Alternative is very good for different value patterns, but not for different locations. Example: web scraping, complex page where the same value (say, price of a product) can appear in 3 different places, depending on the type of product:
Because these are different *locations* in text, not different patterns, and the static parts ("some-stuff") must be present in the middle to correctly position the groups in entire text, alternative can't be used here (or would be very difficult: with static parts copy-pasted several times). Besides, we want to extract other properties too, not only price, and want to use single regex for all this - without making 3 variants of entire regex and without manual labelling of fields 'price1' 'price2' 'price3' and then merging. |
Original comment by Anonymous. The regex module tries to be compatible with the re module, whose documentation says: """Group names must be valid Python identifiers, and each group name must be defined only once within a regular expression""". The regex module relaxes that a little by allowing them multiple times if they're mutually exclusive, but I'm not sure whether they should be allowed in the version 0 ('compatible') behaviour. Perhaps only in version 1 ('enhanced') behaviour? I'll need to think about it and see whether it would have any adverse side-effects. For the record, Perl allows it. |
Original comment by Marcin Wojnarski (Bitbucket: mwojnars, GitHub: mwojnars). OK, thanks, for my needs V1 would be fine. In case if you consider adding it in V0, note that - although this change is not strictly compatible with 're' - it does NOT break any existing code, because it only relaxes the constraints of correct patterns - any pattern correct in 're' would still be correct in 'regex' and behave *exactly* the same, with no changes in result; only some more patterns would be considered correct now. |
Original comment by Anonymous. It's true that it wouldn't break any existing code, so there'd be no harm in having it work in V0 too. |
Original comment by Anonymous. Duplicate group names are allowed in regex 0.1.20130124. |
Original comment by Marcin Wojnarski (Bitbucket: mwojnars, GitHub: mwojnars). There is a minor issue when the same group is nested - the inner group overrides the value matched by the outer group and both are present in the result (2 copies of the same inner value). For example:
|
Original comment by Anonymous. Fixed in regex 0.1.20130126. |
Original report by Marcin Wojnarski (Bitbucket: mwojnars, GitHub: mwojnars).
Hi,
Currently, duplicate names are not allowed, for example this code raises an exception because group "a" is defined twice:
I suspect this design is a legacy after standard 're' module which didn't allow multiple values, so it was somehow natural to reject duplicate group names, too. But now, in 'regex' module which can capture repeated values, it would be natural to accept also duplicate group names and merge values extracted from all same-named groups into one list.
This enhancement would allow parsing loose formats, where a given value may appear in any of several different places in the text and we must prepare a regex that has groups in all these places. Usually, we would expect that only one place is matched (groups are optional like in regex above), but we can't say in advance which one and - for convenience - we'd like to use the same name for all these places, to avoid manual merging of several groups afterwards. In other use cases, it may be possible that more than 1 group matches and we want to extract all the matched values as a single list.
I think this enhancement would fit very well to the concept of repeated captures that's already present in 'regex'.
Do any other regex implementations have something like this?
I don't know.
The text was updated successfully, but these errors were encountered: