Some more cleanup to regex NonBacktracking #104766

stephentoub · 2024-07-11T22:50:44Z

Follow-up to #102655.

Some of this is cleanup of things introduced in that PR, some is cleanup to things pre-existing.

@veanes, @ieviev, I'd appreciate help reviewing. Probably easiest to go commit by commit.

…al accelerators more similar

It's only ever written and not actually used for anything.

Multiple XxDfa / XxNfa methods took a TStateHandler, but it was only ever DfaStateHandler for XxDfa or NfaStateHandler for XxNfa. We can just use the types directly in those methods, rather than generically parameterizing. Doing that revealed all but one of the members of IStateHandler weren't needed on the interface. And removing those revealed a bunch of dead code on DfaStateHandler/NfaStateHandler, which were removed, as well as arguments to some methods that weren't used.

…on at call sites

dotnet-policy-service · 2024-07-11T22:51:10Z

Tagging subscribers to this area: @dotnet/area-system-text-regularexpressions
See info in area-owners.md if you want to be subscribed.

veanes

Using object instead of int pair is a reasonable change, having more than 255 minterms to being with is super rare to begin with.

veanes

I went through all the commits, one by one. All of the changes made total tense to me, both regarding name changes as well as code edits for eliminating dead code, unused parameters and parameter reordering of placing out parameter last. All edits looked functionally correct to me and good improvements.

veanes · 2024-07-12T04:47:22Z

I went through all the commits, one by one. All of the changes made total tense to me, both regarding name changes as well as code edits for eliminating dead code, unused parameters and parameter reordering of placing out parameter last. All edits looked functionally correct to me and good improvements. I accidentally pressed approval too early, before writing the whole review as I thought the approval applied to that particular commit at that moment rather than the whole PR, but that aside, I am confident in all of the edits in the whole PR.

ieviev · 2024-07-12T08:01:27Z

Everything looks good to me. In the initial PR i was originally looking at intermediate state vectorization, which i later figured is not really as beneficial right now so it left a lot of dead code in the process e.g. the Initial state / Accelerated state parts. The initial state candidates in the original loop were only ever used to limit the distance reversal can go but i'm fairly sure it was a sanity check, since it's not really required by the theory.

stephentoub · 2024-07-12T10:38:29Z

Thank you, both.

@ieviev, thanks again for the initial round of updates. Are there other meaningful improvements you think should be made? I know there are remaining concerns about when we choose to use the find optimizations for starting state, but beyond that? You mention internal vectorization not being worthwhile right now... is that a general statement or is there something to be improved first you see blocking it from being useful?

ieviev · 2024-07-12T11:29:16Z

@stephentoub

With internal vectorization my main concern is that people don't really use regular expressions where it would be beneficial. It'd be hard to decide on a subset where internal vectorization doesn't cause noticeable overhead and provides real value in patterns people use. In the intersection/complement engine i saw matching paragraphs or large chunks of texts as an intended use case which i optimized for. So currently i was concerned that it would often be an expensive no-op that doesn't always result in a state transition - even the initial FindOptimizations sometimes has larger overhead than a DFA walk but if it's done internally as well, which is done even more often, then this could result in bad performance.

What i think would be a very worthwhile direction is using derivatives for prefix optimizations. This allows getting reliable prefixes like this for any pattern. I have not yet worked out what's the best way to use these prefixes once they're computed, but the fact that something like this is possible is extremely interesting. Perhaps something like pre-constructing 2-3 vector operations for the whole pattern and using each for some bitwise operation or greater/lesser than. This would have a case where the entire match could be confirmed off of these vector operations as well, e.g. [0-9]{8} (or even [0-9]{2}-[0-9]{2}-[0-9]{4}) could be confirmed with two vector operations for greater/lesser than. And if not the whole match then the same applies for a prefix.

Some examples of patterns and their reverse prefixes:

stephentoub added 13 commits July 11, 2024 17:29

Rent object[] rather than (uint,uint)[][] from the ArrayPool

de5d185

Remove unnecessary TInputReader generic from functions

6306ab8

Add more comments and do some renames

d2f41ff

Remove unused TFindOptimizationsHandler from FindEndPositionDeltasNFA

9b51dc6

Fix a stray input reader

52718b4

Some more renames

a922f06

Avoid duplicated reads of input character and nullability info

73c8226

Remove initialStateId from TryFindNextStartingPosition and make initi…

341c6b7

…al accelerators more similar

Remove unused initialStatePos / initialStatePosCandidate

9193d6f

It's only ever written and not actually used for anything.

Put GetStateFlags back in IStateHandler and use it to avoid duplicati…

401b65a

…on at call sites

Put out argument last in TryCreateNewTransition

adae5e2

Store state to local in FindStartPositionDeltasDFA

2d93daa

dotnet-issue-labeler bot added the area-System.Text.RegularExpressions label Jul 11, 2024

dotnet-policy-service bot assigned stephentoub Jul 11, 2024

build-analysis bot mentioned this pull request Jul 12, 2024

Intermittent build failure in AfterSourceBuild: "Could not write state file" #76488

Open

stephentoub added 2 commits July 11, 2024 20:56

Merge IAcceleratedStateHandler into IInitialStateHandler

9f61425

Remove MintermClassifier.IntLookup

4e760f1

stephentoub force-pushed the moreregexfixes branch from e755f98 to 4e760f1 Compare July 12, 2024 01:01

veanes reviewed Jul 12, 2024

View reviewed changes

veanes approved these changes Jul 12, 2024

View reviewed changes

This was referenced Jul 12, 2024

The Operation will be canceled. The next steps may not contain expected logs. dotnet/dnceng#3008

Open

The job running on agent NetCore-Public ran longer than the maximum time #104044

Closed

stephentoub merged commit b54bfdd into dotnet:main Jul 12, 2024
83 checks passed

stephentoub deleted the moreregexfixes branch July 12, 2024 10:38

github-actions bot locked and limited conversation to collaborators Aug 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some more cleanup to regex NonBacktracking #104766

Some more cleanup to regex NonBacktracking #104766

stephentoub commented Jul 11, 2024

dotnet-policy-service bot commented Jul 11, 2024

veanes left a comment

veanes left a comment •

edited

Loading

veanes commented Jul 12, 2024

ieviev commented Jul 12, 2024

stephentoub commented Jul 12, 2024

ieviev commented Jul 12, 2024 •

edited

Loading

Some more cleanup to regex NonBacktracking #104766

Some more cleanup to regex NonBacktracking #104766

Conversation

stephentoub commented Jul 11, 2024

dotnet-policy-service bot commented Jul 11, 2024

veanes left a comment

Choose a reason for hiding this comment

veanes left a comment • edited Loading

Choose a reason for hiding this comment

veanes commented Jul 12, 2024

ieviev commented Jul 12, 2024

stephentoub commented Jul 12, 2024

ieviev commented Jul 12, 2024 • edited Loading

veanes left a comment •

edited

Loading

ieviev commented Jul 12, 2024 •

edited

Loading