Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Have an option for POSIX-compatible longest match of alternates #150

Closed
mrabarnett opened this issue Sep 2, 2015 · 9 comments
Closed

Have an option for POSIX-compatible longest match of alternates #150

mrabarnett opened this issue Sep 2, 2015 · 9 comments

Comments

@mrabarnett
Copy link
Owner

Original report by Anonymous.


Hello there,

Currently both re and regexp short-circuit the first match for alternate matches. For example, (A|AA)$ matches only the last character in AA.

On the other hand, POSIX regex (C, C++, Boost, Ruby) would demand that the longest leftmost match is returned, i.e AA. Most modern engines seem to reject this on the basis that it makes the engine terribly slow (because it cannot match alternates eagerly).

However, the leftmost longest overall match behavior can be quite useful in some situations, where otherwise workarounds are needed and it looks like there is currently no engine for Python which supports this behaviour.

It would be nice to have the POSIX behaviour of the longest submatch as an option when compiling a regular expression.

@mrabarnett
Copy link
Owner Author

Original comment by Matthew Barnett (Bitbucket: mrabarnett, GitHub: mrabarnett).


You'll need to give me some examples to convince me.

@mrabarnett
Copy link
Owner Author

Original comment by Anonymous.


It is kind of hard to come up with examples that seem really convincing. There is always a workaround. Leftmost longest can just be convenient, but also very, very slow and memory consuming. However, it is the behavior that I usually would expect. I actually was surprised that there was short-circuit instead.

Nonetheless: One example might be identifier matching, where the identifiers have a common prefix but are otherwise configurable: 'getItem|getItemValue|setItem|setItemValue'. Here, Python won't ever match 'getItemValue' and 'setItemValue'. The workaround is to sort the alternate elements by length, but that only works reliable if the subexpressions are simple strings.

r'\d+(\w(\d*)?|[eE]([+-]\d+))': Matches either a number with an exponent or an integer with a base suffix (e.g. 'b', 'x', 'b12', where 'b12' means to base 12. This is simplified, because \d is too strict, here). Short-circuit treats the exponent as an 'e' suffix and stops. Workaround is to explicitly name the possible suffixes, introduce explicit boundaries (r'\b'), or exclude 'e' or --again-- re-ordering. Another one: some (German) umlauts can decomposed into character pairs. These may sometimes need to be count as a single character. A very naive version of umlaut-aware character counting could use r'(\w|ae|oe|ue|ss)', which does not work in Python. The workaround again is to have the r'\w' last. Also: (Mr|Mrs) (from the perl manual).

Also related to this (but different): Matching 'one(self)?(selfsufficient)?' against 'oneselfsufficient' regex (and re, and perl) will match 'oneself' instead of the full string (longest match). POSIX would require the full match.

Glenn Fowler has some examples where he analyzes the POSIX behavior.

Okui and Suzuki claim to have an algorithm which avoids the worst case exponential explosion. That might be of interest.

For reference:
POSIX, ruby, and PCRE (in my tests) do leftmost longest for alternates.

Python, Perl, Java, and JavaScript short-circuit for alternates. I was not able to find an alternative engine for Python that implements POSIX behavior.

Go has both (Compile and CompilePOSIX).

Engines also differ with regard to optional subexpressions (see above). This is actually what worries me a little, but I would need to do a survey to table up the various engines' behavior in this case.

@mrabarnett
Copy link
Owner Author

Original comment by Matthew Barnett (Bitbucket: mrabarnett, GitHub: mrabarnett).


I've used some online regex testers.

As far as I can see, PCRE (PHP), JavaScript and Ruby all use first-match, not longest match.

I'm not surpised that PCRE uses first-match because the "PC" part stands for "Perl-Compatible", and Perl uses first-match.

@mrabarnett
Copy link
Owner Author

Original comment by Anonymous.


Hmm, that might have been an error in my examples, I think I tried (a|aa)$ which matches aa on all engines because of the leftmost. With ruby (I have not re-tested that, yet) and PCRE gone, as it looks like, there isn't any proper POSIX engine with unicode support around anymore. It's probably not too bad if Python doesn't have one, too (it also didn't have one before regex).

@mrabarnett
Copy link
Owner Author

Original comment by Matthew Barnett (Bitbucket: mrabarnett, GitHub: mrabarnett).


It appears that there's a bit more to it than simply the "leftmost longest match"; there's also the question of the capture groups.

I found the rules a little hard to understand the way they were written, so I've rephrased them as follows:

Have we already found a match?
Yes:
    How long is it compared to the existing one?
    Shorter:
        Reject the new match.
    Longer:
        Accept the new match.
    Same length:
        For each group:
            Where did it match?
            Earlier:
                Accept the new match.
            Later:
                Reject the new match.
            Same position:
                How long is it compared to the existing one?
                Shorter:
                    Reject the new match.
                Longer:
                    Accept the new match.
                Same length:
                    Do nothing.
        Reject the new match.
No:
    Accept the new match.

@mrabarnett
Copy link
Owner Author

Original comment by Anonymous.


Hmm. SingleUnix literally says

  • The search for a matching sequence starts at the beginning of a string and stops when the first sequence matching the expression is found, where first is defined to mean "begins earliest in the string".
  • Consistent with the whole match being the longest of the leftmost matches, each subpattern, from left to right, matches the longest possible string
  • as well as It is possible to determine what strings correspond to subexpressions by recursively applying the leftmost longest rule to each subexpression, but only with the proviso that the overall match is leftmost longest. For example, matching \(ac*\)c*d[ac]*\1 against acdacaaa matches acdacaaa (with \1=a); Note the syntax difference for BREs which need escapes for the controls.

If I understand it correctly, the proviso part is the one that seems to get dropped for efficiency.

In your algorithm sketch, does accept mean accept and terminate or just mark this as the currently accepted match?

Because, If I understand you correctly, searching a(a)?(ab)? in aab should return the full aab (with groups None and ab) not aa (with groups a and None), because that is the longer overall match. Instead all but the POSIX-compliant engines and none of those that support Unicode seem to apply recursive subgroups match left-to-right, longest local match, ignoring external length

I think we might have some really nice examples for the documentation sections at the very least ;-).

@mrabarnett
Copy link
Owner Author

Original comment by Matthew Barnett (Bitbucket: mrabarnett, GitHub: mrabarnett).


The most important part is that it should be the longest overall match; the rest of it is about how to choose between possible matches that contain capture groups.

In the algorithm code, "accept" and "reject" are about whether to accept or reject the new match as being the best one found so far at this position.

Incidentally, the POSIX standard doesn't have backreferences (or so I've read), although some implementations have added them.

@mrabarnett
Copy link
Owner Author

Original comment by Anonymous.


I'm not sure, but I think that depends on the version of POSIX you want to refer to.

SingleUnix (which currently is IEEE Std. 1003.1-2013) has backrefs:

The back-reference expression \n matches the same (possibly empty) string of characters as was matched by a subexpression enclosed between ( and ) preceding the \n.

@mrabarnett
Copy link
Owner Author

Original comment by Matthew Barnett (Bitbucket: mrabarnett, GitHub: mrabarnett).


Added in regex 2015.09.15.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant