-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Request: (*SKIP) #153
Comments
Original comment by Matthew Barnett (Bitbucket: mrabarnett, GitHub: mrabarnett). I've been experimenting with Perl, trying to see whether I can skip to before the starting position of the match by using (*SKIP) in a lookbehind, something like:
If you could skip to before the start position, you might be able to stop it progressing through the text. Is it guaranteed that you can never do that? |
Original comment by boolbag NA (Bitbucket: boolbag, GitHub: boolbag). Hi Matthew, What an interesting idea you came up with! That would be weird indeed. To reassure you, You can think of
You can see how adding a fail token such as To further illustrate the answer, I have made this progression in PHP (PCRE), with the two last examples showing what happens when the Also, I find the PCRE documentation on backtracking control clearer than Perl's. There are some refinements mentioned in the documentation, but since the verbs are experimental and not consistent between PCRE and Perl, it seems to me that it would be safe to ignore them for the time being.
The above can be tweaked in this sandbox. |
Original comment by Matthew Barnett (Bitbucket: mrabarnett, GitHub: mrabarnett). In the last 2 examples, it doesn't get to the (*SKIP) because it looks back for the '3' and sees a '4' instead. (Lookbehinds should be read in reverse order.) |
Original comment by boolbag NA (Bitbucket: boolbag, GitHub: boolbag). Hi Matthew, what you say is correct for engines that have infinite lookbehind (mrab, .NET, JGSoft), but I don't think it's the case for PCRE (used in the example). Remember that PCRE lookbehind is fixed-width. The engine sees that the pattern in the lookbehind has a fixed width of two, and jumps to a starting position for the submatch that is two places before in the string. No reversal occurs. I don't have a reference for this right now but this is in my head from studying how lookbehind works in various engines some years ago. |
Original comment by Matthew Barnett (Bitbucket: mrabarnett, GitHub: mrabarnett). It looks the (*SKIP) is ignored in lookarounds. I've tried this in Perl:
and it shows:
Notice how the (*SKIP) would make the regex skip characters if it had an effect, but, apparently, it doesn't. |
Original comment by boolbag NA (Bitbucket: boolbag, GitHub: boolbag). Hi Matthew, first off it looks like our messages got crossed (I edited the message about PCRE lookbehind while you were posting). It looks like you are right! Supposing
Testing it with two characters confirms your idea:
If You're right! These verbs are not greatly documented, it seems like it takes a bit of reverse engineering. For my taste, I would have thought that I preferred the version where EDIT: for the record, same behavior in PCRE. |
Original comment by Matthew Barnett (Bitbucket: mrabarnett, GitHub: mrabarnett). If (*SKIP) worked everywhere, then you'd have to decide what should happen in the lookbehind case. Perhaps just ignore such attempts to skip backwards? |
Original comment by boolbag NA (Bitbucket: boolbag, GitHub: boolbag). In a context where Within an assertion, what behavior makes the most sense to you? Would you allow And do you prefer to stay close to Perl and PCRE, in case someone finds an edge case and "complains?" :) You've made me wonder how these languages intend Man, once again I can see how writing a regex engine is not for the faint of heart, and especially one as generous in features. You're a real hero, I'm sure many people wouldn't enjoy regex in Python if they couldn't do something like |
Original comment by boolbag NA (Bitbucket: boolbag, GitHub: boolbag). Hi again Matthew, The missing piece is that the engine is never allowed to backtrack across a This can be seen here. (I put comments in some PHP code to illustrate the behavior in PCRE, but here is a sandbox with all of the same patterns in Perl.)
In a lookaround,
|
Original comment by Matthew Barnett (Bitbucket: mrabarnett, GitHub: mrabarnett). Here's a summary of what I think the behaviour is/should be: A (*SKIP) in a pattern sets a limit on how far back the regex is A (*SKIP) in a lookaround won't affect the enclosing pattern, but one Interestingly, in the following test with Perl, only the first matches. It looks to me like a bug. PCRE says that both match, as I'd expect.
|
Original comment by Matthew Barnett (Bitbucket: mrabarnett, GitHub: mrabarnett). Added in regex 2015.09.14. |
Original comment by boolbag NA (Bitbucket: boolbag, GitHub: boolbag). Wow, Matthew... On behalf of myself and all avid regex users, thank you so much! I see you've added For anyone who is curious, here is a classic example of
|
Original comment by boolbag NA (Bitbucket: boolbag, GitHub: boolbag). A quick FIY: I've documented this feature on this page, where the behavior of backtracking control verbs are shown in the three engines that now support them (Perl, PCRE, and now Python). Interesting differences sometimes. PCRE and Python are well-behaved but Perl has a few known bugs. |
Original report by boolbag NA (Bitbucket: boolbag, GitHub: boolbag).
Hi again Matthew,
This is the third in a series of posts to present a case for three features. In this post, I'll focus on
(*SKIP)
. This is probably the lowest priority one as I use it the least, but when I do use it, it is just wonderful.The
(*SKIP)(*FAIL)
syntax shows its worth when you want to match something except in certain contexts. Instead of trying to avoid the "bad context", you deliberately match it, add(*SKIP)(*FAIL)
, then an OR|
, then match what you actually want.Some time ago I showed how this works on here.
I realize that Perl and PCRE have other control verbs such as
(*PRUNE)
, but I am yet to see a convincing use case for those, whereas(*SKIP)(*FAIL)
can be used quite often. So it seems to me that it would be quite alright to implement(*SKIP)
without bothering about the others.Thanks in advance for considering it.
The text was updated successfully, but these errors were encountered: