-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Two-phase matching algorithm for NonBacktracking #68199
Conversation
First phase now finds the true match end position. The implicit .* is now a lazy .*? to prioritize the earliest match. Third phase is now only run for subcaptures, which no longer needs to find match end position. Remove counter optimization that no longer applies with OrderedOr. Fix a problem in SymbolicRegexInfo where begin/end anchors were marked as line anchors. Also remove dead fields from SymbolicRegexInfo. Fix captures not being handled for empty matches at start of input.
Especially fix comments for the new 2-phase match generation algorithm.
Tagging subscribers to this area: @dotnet/area-system-text-regularexpressions Issue DetailsThis PR changes the The new algorithm has the first phase walk to the end of the preferred match, which is now possible with the backtracking simulating derivatives that we added recently. Then the second phase is the same as before, walking back to the matching starting point. The third phase is gone for all cases except for patterns with subcaptures. That third phase is a bit simpler than before too, since it doesn't have to find the end of a match (which is already known at that point). Critically, the "dot starred" version of a regex One optimization that we lose is the counter-subsumption one. This was already invalid before, but only hit the unit tests with these changes. The problem is that with the new ordered alternation combining I also performed a general pass of cleaning up the matching code. To summarize the changes in the PR:
|
if (startsWithLineAnchor) | ||
{ | ||
i |= ContainsLineAnchorMask; | ||
|
||
if (startsWithLineAnchor) | ||
{ | ||
i |= StartsWithLineAnchorMask; | ||
} | ||
i |= StartsWithLineAnchorMask; | ||
} | ||
|
||
if (startsWithBoundaryAnchor) | ||
if (startsWithLineAnchor || startsWithSomeAnchor) | ||
{ | ||
i |= StartsWithBoundaryAnchorMask; | ||
i |= StartsWithSomeAnchorMask; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit:
To retain the style of the rest of the checks, this could be:
if (startsWithLineAnchor || startsWithSomeAnchor)
{
i |= StartsWithSomeAnchorMask;
if (startsWithLineAnchor)
{
i |= StartsWithLineAnchorMask
}
}
@@ -210,7 +210,7 @@ private SymbolicRegexMatcher(SymbolicRegexNode<TSet> rootNode, int captureCount, | |||
|
|||
// Create the dot-star pattern (a concatenation of any* with the original pattern) | |||
// and all of its initial states. | |||
_dotStarredPattern = _builder.CreateConcat(_builder._anyStar, _pattern); | |||
_dotStarredPattern = _builder.CreateConcat(_builder._anyStarLazy, _pattern); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: should we rename _dotStarredPattern to _lazyDotStarredPattern?
int mintermId = c == '\n' && i == input.Length - 1 && TStateHandler.StartsWithLineAnchor(ref state) ? | ||
builder._minterms!.Length : // mintermId = minterms.Length represents \Z (last \n) | ||
builder._minterms!.Length : // mintermId = minterms.Length represents an \n at the very end of input |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit:
builder._minterms!.Length : // mintermId = minterms.Length represents an \n at the very end of input | |
builder._minterms!.Length : // mintermId = minterms.Length represents a \n at the very end of input |
@@ -374,260 +363,207 @@ public SymbolicMatch FindMatch(bool isMatch, ReadOnlySpan<char> input, int start | |||
// start position. That tells us the actual starting position of the match. We can skip this phase if we | |||
// recorded a fixed-length marker for the portion of the pattern that matched, as we can then jump that | |||
// exact number of positions backwards. Continuing the previous example, phase 2 will walk backwards from | |||
// that first b until it finds the 6th a: aaaaaaaaaab. | |||
// that last b until it finds the 4th a: aaabbbc. | |||
int matchStart; | |||
if (matchStartLengthMarker >= 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are markers working again? They weren't previously:
#65532
....Text.RegularExpressions/src/System/Text/RegularExpressions/Symbolic/SymbolicRegexMatcher.cs
Show resolved
Hide resolved
currentState = new CurrentState(nfaState); | ||
if (i < inputForInnerLoop.Length) | ||
{ | ||
// We failed to transition. Upgrade to DFA mode. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// We failed to transition. Upgrade to DFA mode. | |
// We failed to transition. Upgrade to NFA mode. |
// If this is an isMatch call we are done, since a match is now known to exist. | ||
if (isMatch) | ||
return 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit:
// If this is an isMatch call we are done, since a match is now known to exist. | |
if (isMatch) | |
return 1; | |
// If this is an IsMatch call, we are done, since a match is now known to exist. | |
if (isMatch) | |
return 1; |
i = pos; | ||
finalStatePosition = 0; | ||
return -1; | ||
Debug.Fail("No nullable state found in the set of end states"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit:
Debug.Fail("No nullable state found in the set of end states"); | |
Debug.Fail("No nullable state found in the set of end states"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great. I only had nits, so since CI is green, I'm going to merge, and we can address any desired nits subsequently.
* Switch to 2-phase matching in NonBacktracking First phase now finds the true match end position. The implicit .* is now a lazy .*? to prioritize the earliest match. Third phase is now only run for subcaptures, which no longer needs to find match end position. Remove counter optimization that no longer applies with OrderedOr. Fix a problem in SymbolicRegexInfo where begin/end anchors were marked as line anchors. Also remove dead fields from SymbolicRegexInfo. Fix captures not being handled for empty matches at start of input. * Improve comments for NonBacktracking Especially fix comments for the new 2-phase match generation algorithm. * Add a failing test for the earlier NonBacktracking * Avoid transitions to deadends for capuring NFA
There were regex related regressions: dotnet/perf-autofiling-issues#4731 As far as I can tell this commit seems to be the cause, but I'm not certain. |
It almost certainly is. @olsaarik, this was necessary for correctness, but did we expect this much of a hit? |
Possibly another regression here: dotnet/perf-autofiling-issues#4756 |
@stephentoub, the biggest known cause of regressions is that while making this PR I realized that the counter subsumption optimization no longer holds under the backtracking simulating semantics and thus had to remove it. Losing that can for sure have significant impact on some patterns with large counters, which will have a larger DFA state space. Should investigate still though, I did reorganize some of the innermost matching loop logic, but not in a way that I'd expect to cause any measurable regressions. |
Thanks, @olsaarik. That's good to know, but I don't think it explains this, as some of the worst regressions here don't involve any counters. |
Possible improvement (windows x64): dotnet/perf-autofiling-issues#4960 |
Hmm, it's unlikely to be this. That links shows improvements for both RegexOptions.None and RegexOptions.NonBacktracking, and this PR would have only impacted the latter. |
There is also dotnet/perf-autofiling-issues#4939. Range of diffs for both is 4881a63...eb9dd67 so perhaps #67941 is responsible? |
Those are both NonBacktracking and so could be this PR. It's that Options None one that's confusing. (#67941 isn't used by regex yet.) |
More regressions (all non-backtracking):
(note these may be dups, we seem to be double-filing some cases) |
This PR changes the
RegexOptions.NonBacktracking
matching algorithm to a mostly 2-phase one, fixing #65607. Currently the first phase only walks to the first final state position, relying on a third phase to extend the match as far as possible. Now that NonBacktracking is aiming to fully match the semantics of the backtracking engines, this is problematic for some patterns. For example,.{5}Foo|Bar
onFooBarFoo
should match onooBarFoo
, but with the current algorithm it producesBar
, the problem being that the first phase finds the first final state at the end ofBar
and then fails to walk backwards form that to the starting point of the preferred match.The new algorithm has the first phase walk to the end of the preferred match, which is now possible with the backtracking simulating derivatives that we added recently. Then the second phase is the same as before, walking back to the matching starting point. The third phase is gone for all cases except for patterns with subcaptures. That third phase is a bit simpler than before too, since it doesn't have to find the end of a match (which is already known at that point).
Critically, the "dot starred" version of a regex
R
, which the first phase operates on is now using a lazy loop:.*?R
. With the backtracking simulation this causes a partial matchT
to appear on the left side of the core pattern:T|.*?R
. IfT
ever becomes nullable, then any lower priority parts of the derivative will disappear, allowing the matching procedure to reach a deadend state when all earlier-starting match candidates have been exhausted or extended as far as possible.One optimization that we lose is the counter-subsumption one. This was already invalid before, but only hit the unit tests with these changes. The problem is that with the new ordered alternation combining
.{0,1}|.{0,2}
into.{0,2}
loses the information that the shorter match is preferred. This PR removes the optimization.I also performed a general pass of cleaning up the matching code. To summarize the changes in the PR:
SymbolicRegexMatcher.cs
are now in the order they are called.SymbolicRegexInfo
didn't match for which anchors the special handling of an\n
at the very end of the input was triggered. This was probably introduced before, but surfaced in the new matching algorithm. Removed unused bits fromSymbolicRegexInfo
at the same time.