Enable regex to use IndexOf(..., OrdinalIgnoreCase) for prefix searching #85438

stephentoub · 2023-04-27T02:19:19Z

As one of its many ways of finding the next possible match starting location, Regex recognizes a string known to start the expression and uses IndexOf to find it. With this change, it can also do so for OrdinalIgnoreCase. With improvements to IndexOf(..., OrdinalIgnoreCase), this now yields significantly faster searches through longer inputs, in addition to leading to simpler code in source generated regexes.

With #85437, here's the benchmark https://github.com/dotnet/performance/blob/6dccc9979e9a99ebabee2a9b8b9e657c08c3f4a0/src/benchmarks/micro/libraries/System.Text.RegularExpressions/Perf.Regex.Industry.cs#L86 on my machine:

Method	Toolchain	Options	Mean	Error	StdDev	Median	Min	Max	Ratio
Count	\main\corerun.exe	IgnoreCase, Compiled	1,708.2 ms	17.58 ms	2.72 ms	1,708.6 ms	1,704.7 ms	1,711.2 ms	1.00
Count	\pr\corerun.exe	IgnoreCase, Compiled	769.1 ms	14.94 ms	3.88 ms	766.7 ms	765.8 ms	773.8 ms	0.45

Note that without #85437, this PR will result in some usage being slower, as the compiler / source generator is already doing the same approach as IndexOf(..., OrdinalIgnoreCase) does today of searching for a set of characters with IndexOfAny, but it's frequently picking a better set of characters to search for based on frequency analysis. So we shouldn't merge this without the other PR (though this does have other benefits, like simpler codegen).

As one of its many ways of finding the next possible match starting location, Regex recognizes a string known to start the expression and uses IndexOf to find it. With this change, it can also do so for OrdinalIgnoreCase. With improvements to IndexOf(..., OrdinalIgnoreCase), this now yields significantly faster searches through longer inputs, in addition to leading to simpler code in source generated regexes.

ghost · 2023-04-27T02:19:29Z

Tagging subscribers to this area: @dotnet/area-system-text-regularexpressions
See info in area-owners.md if you want to be subscribed.

Issue Details

As one of its many ways of finding the next possible match starting location, Regex recognizes a string known to start the expression and uses IndexOf to find it. With this change, it can also do so for OrdinalIgnoreCase. With improvements to IndexOf(..., OrdinalIgnoreCase), this now yields significantly faster searches through longer inputs, in addition to leading to simpler code in source generated regexes.

With #85437, here's the benchmark https://github.com/dotnet/performance/blob/6dccc9979e9a99ebabee2a9b8b9e657c08c3f4a0/src/benchmarks/micro/libraries/System.Text.RegularExpressions/Perf.Regex.Industry.cs#L86 on my machine:

Method	Toolchain	Options	Mean	Error	StdDev	Median	Min	Max	Ratio
Count	\main\corerun.exe	IgnoreCase, Compiled	1,708.2 ms	17.58 ms	2.72 ms	1,708.6 ms	1,704.7 ms	1,711.2 ms	1.00
Count	\pr\corerun.exe	IgnoreCase, Compiled	769.1 ms	14.94 ms	3.88 ms	766.7 ms	765.8 ms	773.8 ms	0.45

Author:	stephentoub
Assignees:	-
Labels:	`area-System.Text.RegularExpressions`, `tenet-performance`
Milestone:	8.0.0

joperezr

LGTM, sorry for the delay.

...libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexCompiler.cs

stephentoub added area-System.Text.RegularExpressions tenet-performance Performance related issue labels Apr 27, 2023

stephentoub added this to the 8.0.0 milestone Apr 27, 2023

stephentoub requested review from joperezr and MihaZupan April 27, 2023 02:19

ghost assigned stephentoub Apr 27, 2023

stephentoub added 2 commits April 27, 2023 06:20

Merge branch 'dotnet:main' into regexcaseinsensitiveprefix

e1facae

Merge branch 'dotnet:main' into regexcaseinsensitiveprefix

f0b7e27

MihaZupan approved these changes May 1, 2023

View reviewed changes

stephentoub merged commit b4ecf10 into dotnet:main May 1, 2023

stephentoub deleted the regexcaseinsensitiveprefix branch May 1, 2023 16:56

joperezr reviewed May 2, 2023

View reviewed changes

...libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexCompiler.cs Show resolved Hide resolved

kotlarmilos mentioned this pull request May 4, 2023

[Perf] Linux/x64: 6 Improvements on 5/1/2023 6:56:14 PM dotnet/perf-autofiling-issues#17249

Closed

AndyAyersMS mentioned this pull request May 4, 2023

[Perf] Windows/x64: 25 Improvements on 5/1/2023 6:56:14 PM dotnet/perf-autofiling-issues#17243

Closed

EgorBo mentioned this pull request May 4, 2023

[Perf] Windows/arm64: 12 Improvements on 5/1/2023 10:26:58 PM dotnet/perf-autofiling-issues#17394

Closed

kunalspathak mentioned this pull request May 16, 2023

[Perf] Windows/x64: 37 Improvements on 5/1/2023 6:56:14 PM dotnet/perf-autofiling-issues#17619

Closed

ghost locked as resolved and limited conversation to collaborators Jun 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable regex to use IndexOf(..., OrdinalIgnoreCase) for prefix searching #85438

Enable regex to use IndexOf(..., OrdinalIgnoreCase) for prefix searching #85438

stephentoub commented Apr 27, 2023 •

edited

Loading

ghost commented Apr 27, 2023

joperezr left a comment

Enable regex to use IndexOf(..., OrdinalIgnoreCase) for prefix searching #85438

Enable regex to use IndexOf(..., OrdinalIgnoreCase) for prefix searching #85438

Conversation

stephentoub commented Apr 27, 2023 • edited Loading

ghost commented Apr 27, 2023

joperezr left a comment

Choose a reason for hiding this comment

stephentoub commented Apr 27, 2023 •

edited

Loading