Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Request: (*SKIP) #153

Closed
mrabarnett opened this issue Sep 10, 2015 · 13 comments
Closed

Request: (*SKIP) #153

mrabarnett opened this issue Sep 10, 2015 · 13 comments
Labels
enhancement New feature or request minor

Comments

@mrabarnett
Copy link
Owner

Original report by boolbag NA (Bitbucket: boolbag, GitHub: boolbag).


Hi again Matthew,

This is the third in a series of posts to present a case for three features. In this post, I'll focus on (*SKIP). This is probably the lowest priority one as I use it the least, but when I do use it, it is just wonderful.

The (*SKIP)(*FAIL) syntax shows its worth when you want to match something except in certain contexts. Instead of trying to avoid the "bad context", you deliberately match it, add (*SKIP)(*FAIL), then an OR |, then match what you actually want.

Some time ago I showed how this works on here.

I realize that Perl and PCRE have other control verbs such as (*PRUNE), but I am yet to see a convincing use case for those, whereas (*SKIP)(*FAIL) can be used quite often. So it seems to me that it would be quite alright to implement (*SKIP) without bothering about the others.

Thanks in advance for considering it.

@mrabarnett
Copy link
Owner Author

Original comment by Matthew Barnett (Bitbucket: mrabarnett, GitHub: mrabarnett).


I've been experimenting with Perl, trying to see whether I can skip to before the starting position of the match by using (*SKIP) in a lookbehind, something like:

#!python

    ..(?<=(*SKIP)...)(*FAIL)

If you could skip to before the start position, you might be able to stop it progressing through the text.

Is it guaranteed that you can never do that?

@mrabarnett
Copy link
Owner Author

Original comment by boolbag NA (Bitbucket: boolbag, GitHub: boolbag).


Hi Matthew,

What an interesting idea you came up with! That would be weird indeed.

To reassure you, (*SKIP) in a lookbehind or other assertion has no meaning in relation to the position of the engine in the sub-match. (*SKIP) always creates a reference to a string position in the outermost match.

You can think of (*SKIP) is a bookmark in the subject:

  • The engine starts a match attempt in the string at position i
  • At position j in the string, the engine encounters the (*SKIP) token. It continues the match attempt, but records position j.
  • At position k, the match fails.
  • Normally, the engine would start its next match attempt at position i+1. However, because of the bookmark, the next attempt starts at position j.

You can see how adding a fail token such as (?!) or the syntactic sugar (*F) right after a (*SKIP) works. The point of failure matches the bookmark, so the match attempt resumes right where we failed in the subject, so that everything prior is skipped.

To further illustrate the answer, I have made this progression in PHP (PCRE), with the two last examples showing what happens when the (*SKIP) is inside a lookbehind.

Also, I find the PCRE documentation on backtracking control clearer than Perl's.

There are some refinements mentioned in the documentation, but since the verbs are experimental and not consistent between PCRE and Perl, it seems to me that it would be safe to ignore them for the time being.

#!php

$pattern = '~\d+(*SKIP)bcd|[3d]~';
if (preg_match($pattern, '123bcd', $m)) {echo "$m[0]\n";}
// matches 123bcd. The bookmark is discarded, `(*SKIP)` has no effect.
if (preg_match($pattern, '123zzd', $m)) {echo "$m[0]\n";}
// [3d] matches d, not the 3 because we have skipped it

$pattern = '~\d++(?<=3(*SKIP))zzd|[4d]$~';
if (preg_match($pattern, '123zzd', $m)) {echo "$m[0]\n";}
// matches 123zzd: the first branch succeeds
if (preg_match($pattern, '124zzd', $m)) {echo "$m[0]\n";}
// [4d] matches d: when the lookbehind fails, the bookmark in the outer match is beyond the 4
// NOTE THE POSITION OF THE (*SKIP) IN THE LOOKBEHIND
// This would do the same:

$pattern = '~\d++(?<=(*SKIP)3)zzd|[4d]$~';
if (preg_match($pattern, '124zzd', $m)) {echo "$m[0]\n";}
// [4d] matches d

// Playing some more with the position of the (*SKIP) in the lookbehind to convince you:
$pattern = '~\d++(?<=2(*SKIP)3)zzd|[4d]$~';
if (preg_match($pattern, '124zzd', $m)) {echo "$m[0]\n";}
// [4d] matches d, not 4:
// the (*SKIP) is between 2 and 3, but the bookmark is not in the sub-match (after the 2), it is
// in the outermost match, beyond the 4.

The above can be tweaked in this sandbox.

@mrabarnett
Copy link
Owner Author

Original comment by Matthew Barnett (Bitbucket: mrabarnett, GitHub: mrabarnett).


In the last 2 examples, it doesn't get to the (*SKIP) because it looks back for the '3' and sees a '4' instead. (Lookbehinds should be read in reverse order.)

@mrabarnett
Copy link
Owner Author

Original comment by boolbag NA (Bitbucket: boolbag, GitHub: boolbag).


Hi Matthew, what you say is correct for engines that have infinite lookbehind (mrab, .NET, JGSoft), but I don't think it's the case for PCRE (used in the example). Remember that PCRE lookbehind is fixed-width. The engine sees that the pattern in the lookbehind has a fixed width of two, and jumps to a starting position for the submatch that is two places before in the string. No reversal occurs.

I don't have a reference for this right now but this is in my head from studying how lookbehind works in various engines some years ago.
Edit: looking at my notes here and here, I see that I had given your engine prominent advertising as always. :)

@mrabarnett
Copy link
Owner Author

Original comment by Matthew Barnett (Bitbucket: mrabarnett, GitHub: mrabarnett).


It looks the (*SKIP) is ignored in lookarounds.

I've tried this in Perl:

#!perl

while ('abcd123' =~ /(.)(?{print "$1\n";})(?=.(*SKIP))\d/g) {
    print "\$&=\"$&\"\n";
}

and it shows:

#!python

a
b
c
d
$&="d1"
2
$&="23"

Notice how the (*SKIP) would make the regex skip characters if it had an effect, but, apparently, it doesn't.

@mrabarnett
Copy link
Owner Author

Original comment by boolbag NA (Bitbucket: boolbag, GitHub: boolbag).


Hi Matthew, first off it looks like our messages got crossed (I edited the message about PCRE lookbehind while you were posting).

It looks like you are right!
But to reach that conclusion I believe your example needs to be tweaked.

Supposing (*SKIP) worked inside the lookahead, it seems to me that your scenario is compatible with that:

  • engine starts match attempt before the a. First token matches a. In the lookahead, b is matched. (*SKIP) causes the current string position in the outermost match (see my first post from today) to be bookmarked, i.e. the position just before b. The \d token fails.
  • engine starts match attempt at the bookmarked position, which is between a and b. We get the same results as your output, because we only skip one character at a time, which is what the engine does anyway.

Testing it with two characters confirms your idea:

perl -e 'if ('abc12' =~ /(..)(?=.(*SKIP))\d/) { print "match: $&\n"; }'

If (*SKIP) had the effect I thought, we would skip ab and the match would be c12. However, the match is bc1.

You're right!

These verbs are not greatly documented, it seems like it takes a bit of reverse engineering. For my taste, I would have thought that I preferred the version where (*SKIP) works everywhere, but since I've never noticed, it can't be an issue... And the current behavior is probably the fruit of higher wisdom.

EDIT: for the record, same behavior in PCRE.

@mrabarnett
Copy link
Owner Author

Original comment by Matthew Barnett (Bitbucket: mrabarnett, GitHub: mrabarnett).


If (*SKIP) worked everywhere, then you'd have to decide what should happen in the lookbehind case.

Perhaps just ignore such attempts to skip backwards?

@mrabarnett
Copy link
Owner Author

Original comment by boolbag NA (Bitbucket: boolbag, GitHub: boolbag).


In a context where (*SKIP) bookmarks the engine's string position in the outermost match, it can never cause a backwards skip even when encountered in a lookbehind: the bookmarked position to be skipped to is the current position in the global match attempt.

Within an assertion, what behavior makes the most sense to you? Would you allow (*SKIP) to act, or would you rather ignore it?

And do you prefer to stay close to Perl and PCRE, in case someone finds an edge case and "complains?" :)

You've made me wonder how these languages intend (*SKIP) to work within an assertion. I wondered if it might have some effect on the local matching process, within the assertion. But since it makes no sense to "restart an assertion's match further down the string" (the point of the assertion being that it asserts something at a specific position), maybe (*SKIP) is just ignored.

Man, once again I can see how writing a regex engine is not for the faint of heart, and especially one as generous in features. You're a real hero, I'm sure many people wouldn't enjoy regex in Python if they couldn't do something like import regex as mrab

@mrabarnett
Copy link
Owner Author

Original comment by boolbag NA (Bitbucket: boolbag, GitHub: boolbag).


Hi again Matthew,
After digging deeper into the documentation, I think I've finally figured out what happens to (*SKIP) within a lookaround.

The missing piece is that the engine is never allowed to backtrack across a (*SKIP).

This can be seen here. (I put comments in some PHP code to illustrate the behavior in PCRE, but here is a sandbox with all of the same patterns in Perl.)

#!php

echo "This illustrates that you cannot backtrack across (*SKIP)\n";

// A+ gives up the final A to allow \w to match. The match is AAA
echo preg_match('~A+\w~', 'AAA', $m) ? "$m[0]\n" : "no match\n";

// With a (*SKIP), the engine cannot backtrack into A+. The match fails.
echo preg_match('~A+(*SKIP)\w~', 'AAA', $m) ? "$m[0]\n" : "no match\n";

In a lookaround, (*SKIP) affects whether or not it is possible to backtrack within the lookaround. This causes some interesting logic for a negative lookaround. Here is an illustration.

#!php


echo "(*SKIP) in lookahead\n";

// The .+ in the lookahead first consumes bc, then the c fails. The .+ gives up the c. The assertion succeeds with bc, and the match is ab
echo preg_match('~a(?=.+c)\w~', 'abc', $m) ? "$m[0]\n" : "no match\n";

// Let's add a (*SKIP). The .+ in the lookahead consumes bc, then the c fails. The engine is not allowed to cross (*SKIP) to backtrack, so the looakead fails, as does the match.
echo preg_match('~a(?=.+(*SKIP)c)\w~', 'abc', $m) ? "$m[0]\n" : "no match\n"; // no match

// With a negative lookahead, the situation is inverted.
echo "(*SKIP) in negative lookahead\n";

// The .+ in the negative lookahead first consumes bc, then the c fails. The .+ gives up the c. The sub-pattern succeeds with bc, the negative lookahead fails, there is no match.
echo preg_match('~a(?!.+c)\w~', 'abc', $m) ? "$m[0]\n" : "no match\n";

// Let's add a (*SKIP). The .+ in the negative lookahead consumes bc, then the c fails. The engine is not allowed to cross (*SKIP) to backtrack, so the sub-pattern fails, the negative lookahead succeeds, and the match is ab.
echo preg_match('~a(?!.+(*SKIP)c)\w~', 'abc', $m) ? "$m[0]\n" : "no match\n"; // no match

@mrabarnett
Copy link
Owner Author

Original comment by Matthew Barnett (Bitbucket: mrabarnett, GitHub: mrabarnett).


Here's a summary of what I think the behaviour is/should be:

A (*SKIP) in a pattern sets a limit on how far back the regex is
allowed to backtrack in terms of the position within the text. Any
attempt to backtrack past that limit will result in further
backtracking. The limit also affects any nested pattern in a
lookaround.

A (*SKIP) in a lookaround won't affect the enclosing pattern, but one
in the outermost pattern will affect where the regex will attempt to
match next if the entire pattern fails to match and it needs to advance
for the next attempt (it'll advance to the limit, or just past it if it
started there).

Interestingly, in the following test with Perl, only the first matches. It looks to me like a bug.

PCRE says that both match, as I'd expect.

#!perl

print "Trying /.*?(*SKIP)..d/\n";

if ('abcdef' =~ /.*?(*SKIP)..d/) {
    print "\$&=\"$&\"\n";
    print "\$1=\"$1\"\n";
} else {
    print "No match\n";
}

print "\n";

print "Trying /.*?(*SKIP)..(?=d)/\n";

if ('abcdef' =~ /.*?(*SKIP)..(?=d)/) {
    print "\$&=\"$&\"\n";
    print "\$1=\"$1\"\n";
} else {
    print "No match\n";
}

@mrabarnett
Copy link
Owner Author

Original comment by Matthew Barnett (Bitbucket: mrabarnett, GitHub: mrabarnett).


Added in regex 2015.09.14.

@mrabarnett
Copy link
Owner Author

Original comment by boolbag NA (Bitbucket: boolbag, GitHub: boolbag).


Wow, Matthew...
You're an absolute hero for adding these three features!

On behalf of myself and all avid regex users, thank you so much!
Starting to play with them.

I see you've added (*F) and (*FAIL) as well!

For anyone who is curious, here is a classic example of (*SKIP)(*FAIL). The goal is to match strings like c22 (a c and some digits), except if the string is between double quotes. Therefore, the only correct match in the subject is c15.

#!python

Python 3.4.3 (v3.4.3:9b73f1c3e601, Feb 24 2015, 22:44:40) [MSC v.1600 64 bit (AMD64)] on win32
>>> import regex as mrab
>>> pattern = mrab.compile('"c\d+"(*SKIP)(*FAIL)|c\d+')
>>> print(list(pattern.finditer('"c12" and c15 and "c14"')))
[<regex.Match object; span=(10, 13), match='c15'>]

@mrabarnett
Copy link
Owner Author

Original comment by boolbag NA (Bitbucket: boolbag, GitHub: boolbag).


A quick FIY: I've documented this feature on this page, where the behavior of backtracking control verbs are shown in the three engines that now support them (Perl, PCRE, and now Python).

Interesting differences sometimes. PCRE and Python are well-behaved but Perl has a few known bugs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request minor
Projects
None yet
Development

No branches or pull requests

1 participant