-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Have an option for POSIX-compatible longest match of alternates #150
Comments
Original comment by Matthew Barnett (Bitbucket: mrabarnett, GitHub: mrabarnett). You'll need to give me some examples to convince me. |
Original comment by Anonymous. It is kind of hard to come up with examples that seem really convincing. There is always a workaround. Leftmost longest can just be convenient, but also very, very slow and memory consuming. However, it is the behavior that I usually would expect. I actually was surprised that there was short-circuit instead. Nonetheless: One example might be identifier matching, where the identifiers have a common prefix but are otherwise configurable:
Also related to this (but different): Matching Glenn Fowler has some examples where he analyzes the POSIX behavior. Okui and Suzuki claim to have an algorithm which avoids the worst case exponential explosion. That might be of interest. For reference: Python, Perl, Java, and JavaScript short-circuit for alternates. I was not able to find an alternative engine for Python that implements POSIX behavior. Go has both (Compile and CompilePOSIX). Engines also differ with regard to optional subexpressions (see above). This is actually what worries me a little, but I would need to do a survey to table up the various engines' behavior in this case. |
Original comment by Matthew Barnett (Bitbucket: mrabarnett, GitHub: mrabarnett). I've used some online regex testers. As far as I can see, PCRE (PHP), JavaScript and Ruby all use first-match, not longest match. I'm not surpised that PCRE uses first-match because the "PC" part stands for "Perl-Compatible", and Perl uses first-match. |
Original comment by Anonymous. Hmm, that might have been an error in my examples, I think I tried |
Original comment by Matthew Barnett (Bitbucket: mrabarnett, GitHub: mrabarnett). It appears that there's a bit more to it than simply the "leftmost longest match"; there's also the question of the capture groups. I found the rules a little hard to understand the way they were written, so I've rephrased them as follows:
|
Original comment by Anonymous. Hmm. SingleUnix literally says
If I understand it correctly, the proviso part is the one that seems to get dropped for efficiency. In your algorithm sketch, does accept mean accept and terminate or just mark this as the currently accepted match? Because, If I understand you correctly, searching I think we might have some really nice examples for the documentation sections at the very least ;-). |
Original comment by Matthew Barnett (Bitbucket: mrabarnett, GitHub: mrabarnett). The most important part is that it should be the longest overall match; the rest of it is about how to choose between possible matches that contain capture groups. In the algorithm code, "accept" and "reject" are about whether to accept or reject the new match as being the best one found so far at this position. Incidentally, the POSIX standard doesn't have backreferences (or so I've read), although some implementations have added them. |
Original comment by Anonymous. I'm not sure, but I think that depends on the version of POSIX you want to refer to. SingleUnix (which currently is IEEE Std. 1003.1-2013) has backrefs: The back-reference expression \n matches the same (possibly empty) string of characters as was matched by a subexpression enclosed between ( and ) preceding the \n. |
Original comment by Matthew Barnett (Bitbucket: mrabarnett, GitHub: mrabarnett). Added in regex 2015.09.15. |
Original report by Anonymous.
Hello there,
Currently both
re
andregexp
short-circuit the first match for alternate matches. For example,(A|AA)$
matches only the last character inAA
.On the other hand, POSIX regex (C, C++, Boost, Ruby) would demand that the longest leftmost match is returned, i.e
AA
. Most modern engines seem to reject this on the basis that it makes the engine terribly slow (because it cannot match alternates eagerly).However, the leftmost longest overall match behavior can be quite useful in some situations, where otherwise workarounds are needed and it looks like there is currently no engine for Python which supports this behaviour.
It would be nice to have the POSIX behaviour of the longest submatch as an option when compiling a regular expression.
The text was updated successfully, but these errors were encountered: