Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DEFINE in pattern creates unwanted empty capture group #452

Closed
p-i- opened this issue Feb 9, 2022 · 2 comments
Closed

DEFINE in pattern creates unwanted empty capture group #452

p-i- opened this issue Feb 9, 2022 · 2 comments

Comments

@p-i-
Copy link

p-i- commented Feb 9, 2022

From the example, you can see an extra unwanted '' match being generated by the DEFINE:

> ipython
Python 3.10.0 (default, Oct 23 2021, 18:32:23) [Clang 13.0.0 (clang-1300.0.29.3)]
Type 'copyright', 'credits' or 'license' for more information
IPython 8.0.1 -- An enhanced Interactive Python. Type '?' for help.

In [1]: %pip show regex
Name: regex
Version: 2022.1.18
Summary: Alternative regular expression module, to replace re.
Home-page: https://github.com/mrabarnett/mrab-regex
Author: Matthew Barnett
Author-email: regex@mrabarnett.plus.com
License: Apache Software License
Location: /Users/pi/Desktop/src/2022-01-24--pi-ccxt/.venv/lib/python3.10/site-packages
Requires:
Required-by:
Note: you may need to restart the kernel to use updated packages.

In [2]: import regex

In [3]: regex.findall(
   ...:     r"foo \s (\d+) \s bar \s (\d+)  (?(DEFINE)(?<integer>\d+))",
   ...:     "foo 1 bar 2",
   ...:     regex.VERBOSE,
   ...: )
Out[3]: [('1', '2', '')]

In [4]: regex.findall(
   ...:     r"(?(DEFINE)(?<integer>\d+)) foo \s ((?&integer)) \s bar \s (\d+)",
   ...:     "foo 1 bar 2",
   ...:     regex.VERBOSE,
   ...: )
Out[4]: [('', '1', '2')]

In [5]: pattern = regex.compile(
   ...:     r"(?(DEFINE)(?<integer>\d+)) foo \s ((?&integer)) \s bar \s (\d+)", regex.VERBOSE
   ...: )

In [6]: match = pattern.match("foo 1 bar 2")

In [7]: match.groupdict()
Out[7]: {'integer': None}

In [8]: match.groups()
Out[8]: (None, '1', '2')
  • It doesn't matter whether I actually USE the pattern or not.

  • Moving the DEFINE to the start of the pattern, now the empty match comes up first.

I'm working with a more complex pattern which uses 3 DEFINEs:

(?(DEFINE)
    (?<decimal>
        [ ]*? \d+ (?:[.,] \d+)? [ ]*?
    )
    (?<TARGETS>
        ^[ ]*? Target ((?&decimal)) - (?&decimal) ( \( .*? \) )? [ ]*? \n
    )
    (?<range>
        (?&decimal) - (?&decimal) | (?&decimal)  # You first try the longer one. If that fails, fall back and do the shorter one.
        #(?&d) (?: - (?&d))?
    )
)

... and this creates 5 additional empty matches.

So it's awkward to work with. I have to manually run the rexpression, count how many empties to remove, and hardcode it in.

Maybe I can add a terminator token at the end of my pattern before the DEFINEs, and do it that way, but it's a hack.

I'm basically extracting numbers from a text-string, like:

ID: 42
some waffle
ENTRY: 33.3-33.4
TARGET: 44.5
TARGET: 44.6
TARGET: 44.7
Tolerance: 5%

What's my best play here?

@p-i-
Copy link
Author

p-i- commented Feb 9, 2022

This should elucidate what's going on:

In [1]: pattern_string = r"FOO (?P<foo>(?&Integer))(?(DEFINE)(?P<Integer>\d+))"

In [2]: pattern = regex.compile(pattern_string)

In [3]: match = pattern.match("FOO 42")

In [4]: match.groupdict()
Out[4]: {'foo': '42', 'Integer': None}

So my recommendation to myself is to name my capture-groups lowercase with (?P<myname>...) and name my DEFINEs (which I think of as functions / macros) capitalized, e.g. (?(DEFINE)(?P<Integer>\d+))

Then I can filter the groupdict that comes back:

G = {k: v for k, v in match.groupdict().items() if k[0].islower()}

If anybody has a suggestion for a cleaner solution, I'm all ears!

But I'm pretty happy with this one.

I'll leave it to an admin to close out the issue in case they have anything to add.

@mrabarnett
Copy link
Owner

There's no difference between a capture group and a subroutine; you can call any group as a subroutine.

You need to have a way to refer to a group/subroutine, whether as backreference or so that you can call it, and a subroutine itself can contain capture groups and backreferences to them.

Yes, it's messy, but regex wasn't invented from scratch, but evolved from true "regular expressions" over many years.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants