-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve regex reductions and code gen for some alternations #59903
Conversation
Tagging subscribers to this area: @eerhardt, @dotnet/area-system-text-regularexpressions Issue DetailsThis does a few things:
In a data set of 94,465 patterns from real-world use, the reductions in (1) and (2) above find improvements in 3,527 of them, for ~3.5% (~21% of the total have alternations). For (3), there were 729 expressions that benefit from the new switch-based code gen prior to the improvements from (1) and (2), and after (1) and (2), that number doubles to 1446. Example ( Beforeprotected override void Go()
{
string runtext = base.runtext!;
int runtextpos = base.runtextpos;
int runtextend = base.runtextend;
int originalruntextpos = runtextpos;
ref byte byteStr = ref global::System.Runtime.CompilerServices.Unsafe.NullRef<byte>();
char ch;
global::System.ReadOnlySpan<char> textSpan = global::System.MemoryExtensions.AsSpan(runtext, runtextpos, runtextend - runtextpos);
// Alternate
{
int startingRunTextPos0 = runtextpos;
// Branch 0
{
// Multi "hola"
{
byteStr = ref global::System.Runtime.InteropServices.MemoryMarshal.GetReference(global::System.Runtime.InteropServices.MemoryMarshal.AsBytes(textSpan));
if ((uint)textSpan.Length < 4 ||
global::System.Runtime.CompilerServices.Unsafe.ReadUnaligned<ulong>(ref global::System.Runtime.CompilerServices.Unsafe.Add(ref byteStr, 0)) != (global::System.BitConverter.IsLittleEndian ? 0x61006C006F0068ul : 0x68006F006C006100ul))
{
goto L1;
}
}
runtextpos += 4;
textSpan = textSpan.Slice(4);
goto L0;
L1:
runtextpos = startingRunTextPos0;
textSpan = global::System.MemoryExtensions.AsSpan(runtext, runtextpos, runtextend - runtextpos);
}
// Branch 1
{
// Multi "ciao"
{
byteStr = ref global::System.Runtime.InteropServices.MemoryMarshal.GetReference(global::System.Runtime.InteropServices.MemoryMarshal.AsBytes(textSpan));
if ((uint)textSpan.Length < 4 ||
global::System.Runtime.CompilerServices.Unsafe.ReadUnaligned<ulong>(ref global::System.Runtime.CompilerServices.Unsafe.Add(ref byteStr, 0)) != (global::System.BitConverter.IsLittleEndian ? 0x6F006100690063ul : 0x6300690061006F00ul))
{
goto L2;
}
}
runtextpos += 4;
textSpan = textSpan.Slice(4);
goto L0;
L2:
runtextpos = startingRunTextPos0;
textSpan = global::System.MemoryExtensions.AsSpan(runtext, runtextpos, runtextend - runtextpos);
}
// Branch 2
{
// Multi "bonjour"
{
byteStr = ref global::System.Runtime.InteropServices.MemoryMarshal.GetReference(global::System.Runtime.InteropServices.MemoryMarshal.AsBytes(textSpan));
if ((uint)textSpan.Length < 7 ||
global::System.Runtime.CompilerServices.Unsafe.ReadUnaligned<ulong>(ref global::System.Runtime.CompilerServices.Unsafe.Add(ref byteStr, 0)) != (global::System.BitConverter.IsLittleEndian ? 0x6A006E006F0062ul : 0x62006F006E006A00ul) ||
global::System.Runtime.CompilerServices.Unsafe.ReadUnaligned<uint>(ref global::System.Runtime.CompilerServices.Unsafe.Add(ref byteStr, 8)) != (global::System.BitConverter.IsLittleEndian ? 0x75006Fu : 0x6F007500u) ||
textSpan[6] != 'r')
{
goto L3;
}
}
runtextpos += 7;
textSpan = textSpan.Slice(7);
goto L0;
L3:
runtextpos = startingRunTextPos0;
textSpan = global::System.MemoryExtensions.AsSpan(runtext, runtextpos, runtextend - runtextpos);
}
// Branch 3
{
// Multi "hello"
{
byteStr = ref global::System.Runtime.InteropServices.MemoryMarshal.GetReference(global::System.Runtime.InteropServices.MemoryMarshal.AsBytes(textSpan));
if ((uint)textSpan.Length < 5 ||
global::System.Runtime.CompilerServices.Unsafe.ReadUnaligned<ulong>(ref global::System.Runtime.CompilerServices.Unsafe.Add(ref byteStr, 0)) != (global::System.BitConverter.IsLittleEndian ? 0x6C006C00650068ul : 0x680065006C006C00ul) ||
textSpan[4] != 'o')
{
goto NoMatch;
}
}
runtextpos += 5;
textSpan = textSpan.Slice(5);
}
L0:
;
}
// Match
base.runtextpos = runtextpos;
base.Capture(0, originalruntextpos, runtextpos);
return;
// No match
NoMatch:
return;
} Afterprotected override void Go()
{
string runtext = base.runtext!;
int runtextpos = base.runtextpos;
int runtextend = base.runtextend;
int originalruntextpos = runtextpos;
ref byte byteStr = ref global::System.Runtime.CompilerServices.Unsafe.NullRef<byte>();
char ch;
global::System.ReadOnlySpan<char> textSpan = global::System.MemoryExtensions.AsSpan(runtext, runtextpos, runtextend - runtextpos);
// Alternate
{
if ((uint)textSpan.Length < 1)
{
goto NoMatch;
}
switch (textSpan[0])
{
case 'h':
// Alternate
{
if ((uint)textSpan.Length < 2)
{
goto NoMatch;
}
switch (textSpan[1])
{
case 'o':
// Multi "la"
{
byteStr = ref global::System.Runtime.InteropServices.MemoryMarshal.GetReference(global::System.Runtime.InteropServices.MemoryMarshal.AsBytes(textSpan));
if ((uint)textSpan.Length < 4 ||
global::System.Runtime.CompilerServices.Unsafe.ReadUnaligned<uint>(ref global::System.Runtime.CompilerServices.Unsafe.Add(ref byteStr, 4)) != (global::System.BitConverter.IsLittleEndian ? 0x61006Cu : 0x6C006100u))
{
goto NoMatch;
}
}
runtextpos += 4;
textSpan = textSpan.Slice(4);
break;
case 'e':
// Multi "llo"
{
byteStr = ref global::System.Runtime.InteropServices.MemoryMarshal.GetReference(global::System.Runtime.InteropServices.MemoryMarshal.AsBytes(textSpan));
if ((uint)textSpan.Length < 5 ||
global::System.Runtime.CompilerServices.Unsafe.ReadUnaligned<uint>(ref global::System.Runtime.CompilerServices.Unsafe.Add(ref byteStr, 4)) != (global::System.BitConverter.IsLittleEndian ? 0x6C006Cu : 0x6C006C00u) ||
textSpan[4] != 'o')
{
goto NoMatch;
}
}
runtextpos += 5;
textSpan = textSpan.Slice(5);
break;
default:
goto NoMatch;
}
}
break;
case 'c':
// Multi "iao"
{
byteStr = ref global::System.Runtime.InteropServices.MemoryMarshal.GetReference(global::System.Runtime.InteropServices.MemoryMarshal.AsBytes(textSpan));
if ((uint)textSpan.Length < 4 ||
global::System.Runtime.CompilerServices.Unsafe.ReadUnaligned<uint>(ref global::System.Runtime.CompilerServices.Unsafe.Add(ref byteStr, 2)) != (global::System.BitConverter.IsLittleEndian ? 0x610069u : 0x69006100u) ||
textSpan[3] != 'o')
{
goto NoMatch;
}
}
runtextpos += 4;
textSpan = textSpan.Slice(4);
break;
case 'b':
// Multi "onjour"
{
byteStr = ref global::System.Runtime.InteropServices.MemoryMarshal.GetReference(global::System.Runtime.InteropServices.MemoryMarshal.AsBytes(textSpan));
if ((uint)textSpan.Length < 7 ||
global::System.Runtime.CompilerServices.Unsafe.ReadUnaligned<ulong>(ref global::System.Runtime.CompilerServices.Unsafe.Add(ref byteStr, 2)) != (global::System.BitConverter.IsLittleEndian ? 0x6F006A006E006Ful : 0x6F006E006A006F00ul) ||
global::System.Runtime.CompilerServices.Unsafe.ReadUnaligned<uint>(ref global::System.Runtime.CompilerServices.Unsafe.Add(ref byteStr, 10)) != (global::System.BitConverter.IsLittleEndian ? 0x720075u : 0x75007200u))
{
goto NoMatch;
}
}
runtextpos += 7;
textSpan = textSpan.Slice(7);
break;
default:
goto NoMatch;
}
}
// Match
base.runtextpos = runtextpos;
base.Capture(0, originalruntextpos, runtextpos);
return;
// No match
NoMatch:
return;
}
|
src/libraries/System.Text.RegularExpressions/tests/RegexReductionTests.cs
Show resolved
Hide resolved
@safern, do you know what's going on with this failure?
|
case 'h':
// Alternate
{
if ((uint)textSpan.Length < 2)
{
goto NoMatch;
} Why does this check for |
It's not 0-based at this point. The generator keeps track of whether the characters it's looking at are guaranteed to be a fixed position from the start of the relevant portion of the pattern, and it avoids slicing the span if it doesn't have to. So in this example, it previously processed textSpan[0] in order to get to this point, and now it needs to read textSpan[1], so it validates the span is at least two characters long. There's then a subsequent length check inside of each branch of the alternation before the subsequent comparison happens. As of #59660, we now combine length checks within concatenations, but we don't do so into alternations. The purpose of the length checks is both a) safety and b) elimination of bounds checks, and in some cases removing the length checks can actually make subsequent operations more expensive because each gets a bounds check that the Length check could actually obviate. There's room here to tweak things further, of course. I'm also about to put up a PR that removes all this Unsafe usage, keeping things in the span world rather than going to refs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I just had some nits/questions.
src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexNode.cs
Show resolved
Hide resolved
src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexNode.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexNode.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexNode.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexNode.cs
Outdated
Show resolved
Hide resolved
If all branches of an alternation begin with a fixed character, we can emit a switch over just that character and save on potentially lots of failed match attempts, especially if the C# compiler can lower the switch into an IL switch or otherwise optimize the search based on first character.
Given an expression like "ab|ac|ade", we already reduce this to "a(?:b|c|de)" in order to factor out the starting "a". But given an expression like "ab|acd|ef|egh", we don't currently extract the individual prefixes, e.g. "a(?:b|cd)|e(?:f|gh)", which would be valuable for a few reasons. Primarily, it enables more efficient processing of the alternation, as a failed match in one branch then has to explore fewer branches, and we can potentially make that initial branch selection even faster if all the branches start with unique, fixed characters.
The primary change here is an additional reduction for atomic alternations that enables subsequent optimizations to do more. We previously added an optimization for alternation reduction that enables extraction of a common prefix from a contiguous sequence of branches in an alternation. Such extraction isn't possible if there's a branch in between two that could otherwise be combined, and in the general case, we can't reorder the branches as that breaks the semantics of ordering being visible. However, for atomic alternations, where no backtracking back into the node is possible, if we can prove that the intermediate branch can't match the same things as the other branches, reordering is fine. Thus, for atomic alternations, we can reorder branches that begin with the same character as long as we can prove that the intermediate branches may never match that same character. For now, we stick to fixed characters, though in the future this could be extended to sets/notones as well. This also adds a minor optimization for atomic alternations that trims away all branches after an empty branch. And it tweaks the pass that finds nodes to mark as atomic, ensuring that a top-level alternation is marked atomic so that the aforementioned optimizations kick in.
5789704
to
b270261
Compare
b270261
to
dccb61e
Compare
This does a few things:
In a data set of 94,465 patterns from real-world use, the reductions in (1) and (2) above find improvements in 3,527 of them, for ~3.5% (~21% of the total have alternations). For (3), there were 729 expressions that benefit from the new switch-based code gen prior to the improvements from (1) and (2), and after (1) and (2), that number doubles to 1,446.
Example (
"hola|ciao|bonjour|hello"
):Before
After