Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some more cleanup to regex NonBacktracking #104766

Merged
merged 15 commits into from
Jul 12, 2024

Conversation

stephentoub
Copy link
Member

Follow-up to #102655.

Some of this is cleanup of things introduced in that PR, some is cleanup to things pre-existing.

@veanes, @ieviev, I'd appreciate help reviewing. Probably easiest to go commit by commit.

It's only ever written and not actually used for anything.
Multiple XxDfa / XxNfa methods took a TStateHandler, but it was only ever DfaStateHandler for XxDfa or NfaStateHandler for XxNfa. We can just use the types directly in those methods, rather than generically parameterizing. Doing that revealed all but one of the members of IStateHandler weren't needed on the interface. And removing those revealed a bunch of dead code on DfaStateHandler/NfaStateHandler, which were removed, as well as arguments to some methods that weren't used.
Copy link
Contributor

Tagging subscribers to this area: @dotnet/area-system-text-regularexpressions
See info in area-owners.md if you want to be subscribed.

Copy link
Contributor

@veanes veanes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using object instead of int pair is a reasonable change, having more than 255 minterms to being with is super rare to begin with.

Copy link
Contributor

@veanes veanes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went through all the commits, one by one. All of the changes made total tense to me, both regarding name changes as well as code edits for eliminating dead code, unused parameters and parameter reordering of placing out parameter last. All edits looked functionally correct to me and good improvements.

@veanes
Copy link
Contributor

veanes commented Jul 12, 2024

I went through all the commits, one by one. All of the changes made total tense to me, both regarding name changes as well as code edits for eliminating dead code, unused parameters and parameter reordering of placing out parameter last. All edits looked functionally correct to me and good improvements. I accidentally pressed approval too early, before writing the whole review as I thought the approval applied to that particular commit at that moment rather than the whole PR, but that aside, I am confident in all of the edits in the whole PR.

@ieviev
Copy link
Contributor

ieviev commented Jul 12, 2024

Everything looks good to me. In the initial PR i was originally looking at intermediate state vectorization, which i later figured is not really as beneficial right now so it left a lot of dead code in the process e.g. the Initial state / Accelerated state parts. The initial state candidates in the original loop were only ever used to limit the distance reversal can go but i'm fairly sure it was a sanity check, since it's not really required by the theory.

@stephentoub
Copy link
Member Author

Thank you, both.

@ieviev, thanks again for the initial round of updates. Are there other meaningful improvements you think should be made? I know there are remaining concerns about when we choose to use the find optimizations for starting state, but beyond that? You mention internal vectorization not being worthwhile right now... is that a general statement or is there something to be improved first you see blocking it from being useful?

@stephentoub stephentoub merged commit b54bfdd into dotnet:main Jul 12, 2024
83 checks passed
@stephentoub stephentoub deleted the moreregexfixes branch July 12, 2024 10:38
@ieviev
Copy link
Contributor

ieviev commented Jul 12, 2024

@stephentoub

With internal vectorization my main concern is that people don't really use regular expressions where it would be beneficial. It'd be hard to decide on a subset where internal vectorization doesn't cause noticeable overhead and provides real value in patterns people use. In the intersection/complement engine i saw matching paragraphs or large chunks of texts as an intended use case which i optimized for. So currently i was concerned that it would often be an expensive no-op that doesn't always result in a state transition - even the initial FindOptimizations sometimes has larger overhead than a DFA walk but if it's done internally as well, which is done even more often, then this could result in bad performance.

What i think would be a very worthwhile direction is using derivatives for prefix optimizations. This allows getting reliable prefixes like this for any pattern. I have not yet worked out what's the best way to use these prefixes once they're computed, but the fact that something like this is possible is extremely interesting. Perhaps something like pre-constructing 2-3 vector operations for the whole pattern and using each for some bitwise operation or greater/lesser than. This would have a case where the entire match could be confirmed off of these vector operations as well, e.g. [0-9]{8} (or even [0-9]{2}-[0-9]{2}-[0-9]{4}) could be confirmed with two vector operations for greater/lesser than. And if not the whole match then the same applies for a prefix.

Some examples of patterns and their reverse prefixes:

image

image

image

image

@github-actions github-actions bot locked and limited conversation to collaborators Aug 12, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants