Two-phase matching algorithm for NonBacktracking #68199

olsaarik · 2022-04-19T02:46:49Z

This PR changes the RegexOptions.NonBacktracking matching algorithm to a mostly 2-phase one, fixing #65607. Currently the first phase only walks to the first final state position, relying on a third phase to extend the match as far as possible. Now that NonBacktracking is aiming to fully match the semantics of the backtracking engines, this is problematic for some patterns. For example, .{5}Foo|Bar on FooBarFoo should match on ooBarFoo, but with the current algorithm it produces Bar, the problem being that the first phase finds the first final state at the end of Bar and then fails to walk backwards form that to the starting point of the preferred match.

The new algorithm has the first phase walk to the end of the preferred match, which is now possible with the backtracking simulating derivatives that we added recently. Then the second phase is the same as before, walking back to the matching starting point. The third phase is gone for all cases except for patterns with subcaptures. That third phase is a bit simpler than before too, since it doesn't have to find the end of a match (which is already known at that point).

Critically, the "dot starred" version of a regex R, which the first phase operates on is now using a lazy loop: .*?R. With the backtracking simulation this causes a partial match T to appear on the left side of the core pattern: T|.*?R. If T ever becomes nullable, then any lower priority parts of the derivative will disappear, allowing the matching procedure to reach a deadend state when all earlier-starting match candidates have been exhausted or extended as far as possible.

One optimization that we lose is the counter-subsumption one. This was already invalid before, but only hit the unit tests with these changes. The problem is that with the new ordered alternation combining .{0,1}|.{0,2} into .{0,2} loses the information that the shorter match is preferred. This PR removes the optimization.

I also performed a general pass of cleaning up the matching code. To summarize the changes in the PR:

Move to a 2-phase matching algorithm that fixes semantic differences.
Add new lazy dot star into the builder and use it for the dot-starred version of the pattern.
Add a test (the example above) that passes with the new algorithm, but didn't pass before.
The phase functions in SymbolicRegexMatcher.cs are now in the order they are called.
The matching functions had several places where nullability was checked. The matching loops are now written to first check for nullability/deadends/more input before transitioning, which allows some consolidation of the logic. This cleanup also fixes RegexOptions.NonBacktracking needs to handle capture groups when startat == input.Length #65606 by removing the special case mentioned in the bug.
Found and fixed a bug, where the "starts with line anchor" information in SymbolicRegexInfo didn't match for which anchors the special handling of an \n at the very end of the input was triggered. This was probably introduced before, but surfaced in the new matching algorithm. Removed unused bits from SymbolicRegexInfo at the same time.
Remove the invalid counter optimization (details above).

First phase now finds the true match end position. The implicit .* is now a lazy .*? to prioritize the earliest match. Third phase is now only run for subcaptures, which no longer needs to find match end position. Remove counter optimization that no longer applies with OrderedOr. Fix a problem in SymbolicRegexInfo where begin/end anchors were marked as line anchors. Also remove dead fields from SymbolicRegexInfo. Fix captures not being handled for empty matches at start of input.

Especially fix comments for the new 2-phase match generation algorithm.

ghost · 2022-04-19T02:46:57Z

Tagging subscribers to this area: @dotnet/area-system-text-regularexpressions
See info in area-owners.md if you want to be subscribed.

Issue Details

This PR changes the RegexOptions.NonBacktracking matching algorithm to a mostly 2-phase one. Currently the first phase only walks to the first final state position, relying on a third phase to extend the match as far as possible. Now that NonBacktracking is aiming to fully match the semantics of the backtracking engines, this is problematic for some patterns. For example, .{5}Foo|Bar on FooBarFoo should match on ooBarFoo, but with the current algorithm it produces Bar, the problem being that the first phase finds the first final state at the end of Bar and then fails to walk backwards form that to the starting point of the preferred match.

The new algorithm has the first phase walk to the end of the preferred match, which is now possible with the backtracking simulating derivatives that we added recently. Then the second phase is the same as before, walking back to the matching starting point. The third phase is gone for all cases except for patterns with subcaptures. That third phase is a bit simpler than before too, since it doesn't have to find the end of a match (which is already known at that point).

Critically, the "dot starred" version of a regex R, which the first phase operates on is now using a lazy loop: .*?R. With the backtracking simulation this causes a partial match T to appear on the left side of the core pattern: T|.*?R. If T ever becomes nullable, then any lower priority parts of the derivative will disappear, allowing the matching procedure to reach a deadend state when all earlier-starting match candidates have been exhausted or extended as far as possible.

One optimization that we lose is the counter-subsumption one. This was already invalid before, but only hit the unit tests with these changes. The problem is that with the new ordered alternation combining .{0,1}|.{0,2} into .{0,2} loses the information that the shorter match is preferred. This PR removes the optimization.

I also performed a general pass of cleaning up the matching code. To summarize the changes in the PR:

Move to a 2-phase matching algorithm that fixes semantic differences.
Add new lazy dot star into the builder and use it for the dot-starred version of the pattern.
Add a test (the example above) that passes with the new algorithm, but didn't pass before.
The phase functions in SymbolicRegexMatcher.cs are now in the order they are called.
The matching functions had several places where nullability was checked. The matching loops are now written to first check for nullability/deadends/more input before transitioning, which allows some consolidation of the logic.
Found and fixed a bug, where the "starts with line anchor" information in SymbolicRegexInfo didn't match for which anchors the special handling of an \n at the very end of the input was triggered. This was probably introduced before, but surfaced in the new matching algorithm. Removed unused bits from SymbolicRegexInfo at the same time.
Remove the invalid counter optimization (details above).

Author:	olsaarik
Assignees:	olsaarik
Labels:	`area-System.Text.RegularExpressions`
Milestone:	-

stephentoub · 2022-04-19T17:54:03Z

...tem.Text.RegularExpressions/src/System/Text/RegularExpressions/Symbolic/SymbolicRegexInfo.cs

+                if (startsWithLineAnchor)
                {
-                    i |= ContainsLineAnchorMask;
-
-                    if (startsWithLineAnchor)
-                    {
-                        i |= StartsWithLineAnchorMask;
-                    }
+                    i |= StartsWithLineAnchorMask;
                }

-                if (startsWithBoundaryAnchor)
+                if (startsWithLineAnchor || startsWithSomeAnchor)
                {
-                    i |= StartsWithBoundaryAnchorMask;
+                    i |= StartsWithSomeAnchorMask;
                }


Nit:
To retain the style of the rest of the checks, this could be:

if (startsWithLineAnchor || startsWithSomeAnchor) { i |= StartsWithSomeAnchorMask; if (startsWithLineAnchor) { i |= StartsWithLineAnchorMask } }

stephentoub · 2022-04-19T17:56:02Z