Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Two-phase matching algorithm for NonBacktracking #68199

Merged
merged 4 commits into from
Apr 19, 2022

Conversation

olsaarik
Copy link
Contributor

@olsaarik olsaarik commented Apr 19, 2022

This PR changes the RegexOptions.NonBacktracking matching algorithm to a mostly 2-phase one, fixing #65607. Currently the first phase only walks to the first final state position, relying on a third phase to extend the match as far as possible. Now that NonBacktracking is aiming to fully match the semantics of the backtracking engines, this is problematic for some patterns. For example, .{5}Foo|Bar on FooBarFoo should match on ooBarFoo, but with the current algorithm it produces Bar, the problem being that the first phase finds the first final state at the end of Bar and then fails to walk backwards form that to the starting point of the preferred match.

The new algorithm has the first phase walk to the end of the preferred match, which is now possible with the backtracking simulating derivatives that we added recently. Then the second phase is the same as before, walking back to the matching starting point. The third phase is gone for all cases except for patterns with subcaptures. That third phase is a bit simpler than before too, since it doesn't have to find the end of a match (which is already known at that point).

Critically, the "dot starred" version of a regex R, which the first phase operates on is now using a lazy loop: .*?R. With the backtracking simulation this causes a partial match T to appear on the left side of the core pattern: T|.*?R. If T ever becomes nullable, then any lower priority parts of the derivative will disappear, allowing the matching procedure to reach a deadend state when all earlier-starting match candidates have been exhausted or extended as far as possible.

One optimization that we lose is the counter-subsumption one. This was already invalid before, but only hit the unit tests with these changes. The problem is that with the new ordered alternation combining .{0,1}|.{0,2} into .{0,2} loses the information that the shorter match is preferred. This PR removes the optimization.

I also performed a general pass of cleaning up the matching code. To summarize the changes in the PR:

  • Move to a 2-phase matching algorithm that fixes semantic differences.
  • Add new lazy dot star into the builder and use it for the dot-starred version of the pattern.
  • Add a test (the example above) that passes with the new algorithm, but didn't pass before.
  • The phase functions in SymbolicRegexMatcher.cs are now in the order they are called.
  • The matching functions had several places where nullability was checked. The matching loops are now written to first check for nullability/deadends/more input before transitioning, which allows some consolidation of the logic. This cleanup also fixes RegexOptions.NonBacktracking needs to handle capture groups when startat == input.Length #65606 by removing the special case mentioned in the bug.
  • Found and fixed a bug, where the "starts with line anchor" information in SymbolicRegexInfo didn't match for which anchors the special handling of an \n at the very end of the input was triggered. This was probably introduced before, but surfaced in the new matching algorithm. Removed unused bits from SymbolicRegexInfo at the same time.
  • Remove the invalid counter optimization (details above).

First phase now finds the true match end position.
The implicit .* is now a lazy .*? to prioritize the earliest match.
Third phase is now only run for subcaptures, which no longer needs to
find match end position.
Remove counter optimization that no longer applies with OrderedOr.
Fix a problem in SymbolicRegexInfo where begin/end anchors were
marked as line anchors.
Also remove dead fields from SymbolicRegexInfo.
Fix captures not being handled for empty matches at start of input.
Especially fix comments for the new 2-phase match generation algorithm.
@ghost
Copy link

ghost commented Apr 19, 2022

Tagging subscribers to this area: @dotnet/area-system-text-regularexpressions
See info in area-owners.md if you want to be subscribed.

Issue Details

This PR changes the RegexOptions.NonBacktracking matching algorithm to a mostly 2-phase one. Currently the first phase only walks to the first final state position, relying on a third phase to extend the match as far as possible. Now that NonBacktracking is aiming to fully match the semantics of the backtracking engines, this is problematic for some patterns. For example, .{5}Foo|Bar on FooBarFoo should match on ooBarFoo, but with the current algorithm it produces Bar, the problem being that the first phase finds the first final state at the end of Bar and then fails to walk backwards form that to the starting point of the preferred match.

The new algorithm has the first phase walk to the end of the preferred match, which is now possible with the backtracking simulating derivatives that we added recently. Then the second phase is the same as before, walking back to the matching starting point. The third phase is gone for all cases except for patterns with subcaptures. That third phase is a bit simpler than before too, since it doesn't have to find the end of a match (which is already known at that point).

Critically, the "dot starred" version of a regex R, which the first phase operates on is now using a lazy loop: .*?R. With the backtracking simulation this causes a partial match T to appear on the left side of the core pattern: T|.*?R. If T ever becomes nullable, then any lower priority parts of the derivative will disappear, allowing the matching procedure to reach a deadend state when all earlier-starting match candidates have been exhausted or extended as far as possible.

One optimization that we lose is the counter-subsumption one. This was already invalid before, but only hit the unit tests with these changes. The problem is that with the new ordered alternation combining .{0,1}|.{0,2} into .{0,2} loses the information that the shorter match is preferred. This PR removes the optimization.

I also performed a general pass of cleaning up the matching code. To summarize the changes in the PR:

  • Move to a 2-phase matching algorithm that fixes semantic differences.
  • Add new lazy dot star into the builder and use it for the dot-starred version of the pattern.
  • Add a test (the example above) that passes with the new algorithm, but didn't pass before.
  • The phase functions in SymbolicRegexMatcher.cs are now in the order they are called.
  • The matching functions had several places where nullability was checked. The matching loops are now written to first check for nullability/deadends/more input before transitioning, which allows some consolidation of the logic.
  • Found and fixed a bug, where the "starts with line anchor" information in SymbolicRegexInfo didn't match for which anchors the special handling of an \n at the very end of the input was triggered. This was probably introduced before, but surfaced in the new matching algorithm. Removed unused bits from SymbolicRegexInfo at the same time.
  • Remove the invalid counter optimization (details above).
Author: olsaarik
Assignees: olsaarik
Labels:

area-System.Text.RegularExpressions

Milestone: -

Comment on lines +39 to 47
if (startsWithLineAnchor)
{
i |= ContainsLineAnchorMask;

if (startsWithLineAnchor)
{
i |= StartsWithLineAnchorMask;
}
i |= StartsWithLineAnchorMask;
}

if (startsWithBoundaryAnchor)
if (startsWithLineAnchor || startsWithSomeAnchor)
{
i |= StartsWithBoundaryAnchorMask;
i |= StartsWithSomeAnchorMask;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit:
To retain the style of the rest of the checks, this could be:

if (startsWithLineAnchor || startsWithSomeAnchor)
{
    i |= StartsWithSomeAnchorMask;

    if (startsWithLineAnchor)
    {
        i |= StartsWithLineAnchorMask
    }
}

@@ -210,7 +210,7 @@ private SymbolicRegexMatcher(SymbolicRegexNode<TSet> rootNode, int captureCount,

// Create the dot-star pattern (a concatenation of any* with the original pattern)
// and all of its initial states.
_dotStarredPattern = _builder.CreateConcat(_builder._anyStar, _pattern);
_dotStarredPattern = _builder.CreateConcat(_builder._anyStarLazy, _pattern);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: should we rename _dotStarredPattern to _lazyDotStarredPattern?

int mintermId = c == '\n' && i == input.Length - 1 && TStateHandler.StartsWithLineAnchor(ref state) ?
builder._minterms!.Length : // mintermId = minterms.Length represents \Z (last \n)
builder._minterms!.Length : // mintermId = minterms.Length represents an \n at the very end of input
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit:

Suggested change
builder._minterms!.Length : // mintermId = minterms.Length represents an \n at the very end of input
builder._minterms!.Length : // mintermId = minterms.Length represents a \n at the very end of input

@@ -374,260 +363,207 @@ public SymbolicMatch FindMatch(bool isMatch, ReadOnlySpan<char> input, int start
// start position. That tells us the actual starting position of the match. We can skip this phase if we
// recorded a fixed-length marker for the portion of the pattern that matched, as we can then jump that
// exact number of positions backwards. Continuing the previous example, phase 2 will walk backwards from
// that first b until it finds the 6th a: aaaaaaaaaab.
// that last b until it finds the 4th a: aaabbbc.
int matchStart;
if (matchStartLengthMarker >= 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are markers working again? They weren't previously:
#65532

currentState = new CurrentState(nfaState);
if (i < inputForInnerLoop.Length)
{
// We failed to transition. Upgrade to DFA mode.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// We failed to transition. Upgrade to DFA mode.
// We failed to transition. Upgrade to NFA mode.

Comment on lines +537 to +539
// If this is an isMatch call we are done, since a match is now known to exist.
if (isMatch)
return 1;
Copy link
Member

@stephentoub stephentoub Apr 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit:

Suggested change
// If this is an isMatch call we are done, since a match is now known to exist.
if (isMatch)
return 1;
// If this is an IsMatch call, we are done, since a match is now known to exist.
if (isMatch)
return 1;

i = pos;
finalStatePosition = 0;
return -1;
Debug.Fail("No nullable state found in the set of end states");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit:

Suggested change
Debug.Fail("No nullable state found in the set of end states");
Debug.Fail("No nullable state found in the set of end states");

Copy link
Member

@stephentoub stephentoub left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great. I only had nits, so since CI is green, I'm going to merge, and we can address any desired nits subsequently.

@stephentoub stephentoub merged commit 11f02ad into dotnet:main Apr 19, 2022
directhex pushed a commit to directhex/runtime that referenced this pull request Apr 21, 2022
* Switch to 2-phase matching in NonBacktracking

First phase now finds the true match end position.
The implicit .* is now a lazy .*? to prioritize the earliest match.
Third phase is now only run for subcaptures, which no longer needs to
find match end position.
Remove counter optimization that no longer applies with OrderedOr.
Fix a problem in SymbolicRegexInfo where begin/end anchors were
marked as line anchors.
Also remove dead fields from SymbolicRegexInfo.
Fix captures not being handled for empty matches at start of input.

* Improve comments for NonBacktracking

Especially fix comments for the new 2-phase match generation algorithm.

* Add a failing test for the earlier NonBacktracking

* Avoid transitions to deadends for capuring NFA
@dakersnar
Copy link
Contributor

There were regex related regressions: dotnet/perf-autofiling-issues#4731

As far as I can tell this commit seems to be the cause, but I'm not certain.

@stephentoub
Copy link
Member

As far as I can tell this commit seems to be the cause, but I'm not certain.

It almost certainly is.

@olsaarik, this was necessary for correctness, but did we expect this much of a hit?

@AndyAyersMS
Copy link
Member

Possibly another regression here: dotnet/perf-autofiling-issues#4756

@olsaarik
Copy link
Contributor Author

@stephentoub, the biggest known cause of regressions is that while making this PR I realized that the counter subsumption optimization no longer holds under the backtracking simulating semantics and thus had to remove it. Losing that can for sure have significant impact on some patterns with large counters, which will have a larger DFA state space.

Should investigate still though, I did reorganize some of the innermost matching loop logic, but not in a way that I'd expect to cause any measurable regressions.

@stephentoub
Copy link
Member

Thanks, @olsaarik. That's good to know, but I don't think it explains this, as some of the worst regressions here don't involve any counters.

@AndyAyersMS
Copy link
Member

Possible improvement (windows x64): dotnet/perf-autofiling-issues#4960

@stephentoub
Copy link
Member

Possible improvement (windows x64): dotnet/perf-autofiling-issues#4960

Hmm, it's unlikely to be this. That links shows improvements for both RegexOptions.None and RegexOptions.NonBacktracking, and this PR would have only impacted the latter.

@AndyAyersMS
Copy link
Member

Possible improvement (windows x64): dotnet/perf-autofiling-issues#4960

Hmm, it's unlikely to be this. That links shows improvements for both RegexOptions.None and RegexOptions.NonBacktracking, and this PR would have only impacted the latter.

newplot - 2022-04-27T151523 267
newplot - 2022-04-27T151548 195

There is also dotnet/perf-autofiling-issues#4939.
newplot - 2022-04-27T151820 163
newplot - 2022-04-27T151826 175

Range of diffs for both is 4881a63...eb9dd67 so perhaps #67941 is responsible?

@stephentoub
Copy link
Member

stephentoub commented Apr 27, 2022

There is also dotnet/perf-autofiling-issues#4939.

Those are both NonBacktracking and so could be this PR. It's that Options None one that's confusing. (#67941 isn't used by regex yet.)

@AndyAyersMS
Copy link
Member

AndyAyersMS commented Apr 27, 2022

More regressions (all non-backtracking):

(note these may be dups, we seem to be double-filing some cases)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

RegexOptions.NonBacktracking needs to handle capture groups when startat == input.Length
4 participants