Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lock protect nullability cache of symbolic regex node #60942

Merged
merged 5 commits into from
Nov 3, 2021

Conversation

veanes
Copy link
Contributor

@veanes veanes commented Oct 27, 2021

Added lock to protect SymbolicRegexNode._nullabilityCache that stores conditional nullability of a node for a given context, for thread-safety. This computation is not the common case as it only applies when the node (regex) starts with an anchor and can potentially be nullable (accept the empty string). Initial thought was to special case nullability for context 0 using a field but this is already covered in most common cases when the regex is neither nullable nor can be nullable that is checked before. The cache could potentially be moved to the builder as a shared cache to avoid the caches in the nodes but would then create bigger probability of thread contention at the builder level.

@ghost ghost added the community-contribution Indicates that the PR has been added by a community member label Oct 27, 2021
@dotnet-issue-labeler dotnet-issue-labeler bot added area-System.Text.RegularExpressions and removed community-contribution Indicates that the PR has been added by a community member labels Oct 27, 2021
@ghost
Copy link

ghost commented Oct 27, 2021

Tagging subscribers to this area: @eerhardt, @dotnet/area-system-text-regularexpressions
See info in area-owners.md if you want to be subscribed.

Issue Details

Added lock to protect SymbolicRegexNode._nullabilityCache that stores conditional nullability of a node for a given context, for thread-safety. This computation is not the common case as it only applies when the node (regex) starts with an anchor and can potentially be nullable (accept the empty string). Initial thought was to special case nullability for context 0 using a field but this is already covered in most common cases when the regex is neither nullable nor can be nullable that is checked before. The cache could potentially be moved to the builder as a shared cache to avoid the caches in the nodes but would then create bigger probability of thread contention at the builder level.

Author: veanes
Assignees: -
Labels:

area-System.Text.RegularExpressions

Milestone: -

@veanes veanes requested a review from eerhardt October 27, 2021 23:23
@veanes veanes force-pushed the fixThreadSafetyInSymbolicRegex branch from d681d61 to 6e4e4d8 Compare October 27, 2021 23:39
@stephentoub
Copy link
Member

@veanes, I tried running this with our perf tests in dotnet/performance, just with NonBacktracking subbed in for the options. There are a few notable regressions. Is that expected?

Method Toolchain Options Mean Error StdDev Median Min Max Ratio
Email_IsMatch D:\coreclrtest\pr\corerun.exe 1024 277.66 ns 1.974 ns 1.750 ns 277.93 ns 274.85 ns 280.76 ns 1.16
Email_IsMatch d:\coreclrtest\main\corerun.exe 1024 239.41 ns 2.505 ns 2.092 ns 239.24 ns 235.91 ns 243.58 ns 1.00
Email_IsNotMatch D:\coreclrtest\pr\corerun.exe 1024 282.41 ns 2.858 ns 2.533 ns 282.48 ns 277.18 ns 285.97 ns 1.16
Email_IsNotMatch d:\coreclrtest\main\corerun.exe 1024 243.20 ns 2.135 ns 1.783 ns 242.66 ns 240.78 ns 247.10 ns 1.00
Date_IsMatch D:\coreclrtest\pr\corerun.exe 1024 213.16 ns 1.070 ns 0.949 ns 213.15 ns 211.64 ns 215.33 ns 1.22
Date_IsMatch d:\coreclrtest\main\corerun.exe 1024 175.06 ns 1.932 ns 1.713 ns 174.52 ns 172.81 ns 178.86 ns 1.00
Date_IsNotMatch D:\coreclrtest\pr\corerun.exe 1024 191.14 ns 3.625 ns 3.560 ns 189.36 ns 185.88 ns 198.14 ns 1.20
Date_IsNotMatch d:\coreclrtest\main\corerun.exe 1024 159.08 ns 0.988 ns 0.924 ns 159.22 ns 157.20 ns 160.42 ns 1.00
MatchesBoundary D:\coreclrtest\pr\corerun.exe 1024 45,933.54 ns 449.868 ns 375.660 ns 46,031.74 ns 45,154.15 ns 46,395.06 ns 1.10
MatchesBoundary d:\coreclrtest\main\corerun.exe 1024 41,639.27 ns 398.737 ns 372.979 ns 41,735.20 ns 40,851.51 ns 42,095.63 ns 1.00

@veanes
Copy link
Contributor Author

veanes commented Oct 28, 2021

I would not expect any noticeable regressions, because I was expecting the change (locking) not to really affect the hot-path. If I understand correctly, 22% slower for example for Date_IsMatch? What does the regex look like in this case -- I would expect it to make heavy use anchors (\b ?) that then show up in checking nullability that causes locking. If so, we could still create fields for the most common context dependent nullability checks before using the cache and thus avoid locks.

@veanes
Copy link
Contributor Author

veanes commented Oct 28, 2021

OK, indeed they do indeed use \b a lot at least in two cases, such as @"\b\w{10,}\b", so locking will introduce overhead there and is apparently expensive enough. I'll add specialized fields to cover those cases to avoid locking, I'll push a commit into this PR. Then you could rerun the perf evaluation. I think will be 4 special cases, roughly corresponding to contexts (\w,\w), (\w,\W), (\W,\w) , (\W,\W). Perhaps also for other cases (email example uses ^ and $, as \A and \Z -- I think -- not as line anchors that would require multiline option too)

@veanes
Copy link
Contributor Author

veanes commented Oct 28, 2021

Alternative would be to use a nested array nullability[][] of size 5x5 in each node.
nullability[prev][next] is 0 initially meaning unknown.
value 1 means true and value 2 means false. I'm assuming the operation

if (nullability[prev][next] == 0) nullability[prev][next] = ComputeNullability();
return nullability[prev][next] == 1;

is thread-safe without any locks.

I believe that would be the more efficient solution that is also uniform for all cases and avoids locking if I'm right about thread safety above.

@veanes
Copy link
Contributor Author

veanes commented Oct 28, 2021

Also, I forgot to add above, the nullability array would be null in those cases where it is irrelevant and never used, namely, when the node can never be nullable or is always nullable, these are by far the most common cases.

@veanes
Copy link
Contributor Author

veanes commented Oct 28, 2021

nullability array could also be flat, then it would need size 64 (3 bits per kind) then the lookup would directly use context (that is exactly (next << 3 | prev)).

@veanes veanes force-pushed the fixThreadSafetyInSymbolicRegex branch from 6e4e4d8 to 4564b04 Compare October 28, 2021 22:26
@veanes
Copy link
Contributor Author

veanes commented Oct 28, 2021

@stephentoub, I updated the fix according to my last comment. I'm therefore asking for a re-review. It would be interesting to know the performance comparison after this change, assuming it is still thread-safe -- which it should be, as it replaces the earlier nullability dictionary with an array and gets rid of locking. I'm using byte[64] to represent the values (to avoid bool?[64]) where I use 0 as UndefByte because it is the default value.

@veanes veanes force-pushed the fixThreadSafetyInSymbolicRegex branch from 4564b04 to 3052ed0 Compare October 29, 2021 06:30
@veanes
Copy link
Contributor Author

veanes commented Oct 29, 2021

@stephentoub : I took care of the comments and also simplified the initial test in IsNullableFor in the last commit. Once the checks pass the code should be merged into main. Just wanted to note that I don't have merge permission here yet, which I don't really mind (but I also cannot choose reviewers).

Copy link
Member

@stephentoub stephentoub left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@veanes veanes force-pushed the fixThreadSafetyInSymbolicRegex branch 2 times, most recently from 94e6f8d to 0c1aa20 Compare November 2, 2021 22:15
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants