Improve regex reductions and code gen for some alternations #59903

stephentoub · 2021-10-03T01:03:46Z

This does a few things:

Given an expression like "ab|ac|ade", we already reduce this to "a(?:b|c|de)" in order to factor out the starting "a". But given an expression like "ab|acd|ef|egh", we don't currently extract the individual prefixes, e.g. "a(?:b|cd)|e(?:f|gh)", which would be valuable for a few reasons. Primarily, it enables more efficient processing of the alternation, as a failed match in one branch then has to explore fewer branches, and we can potentially make that initial branch selection even faster if all the branches start with unique, fixed characters.
The aforementioned improved prefix extraction isn't possible if there's a branch in between two that could otherwise be combined, and in the general case, we can't reorder the branches as that breaks the visible ordering semantics. However, for atomic alternations, where no backtracking back into the node is possible, if we can prove that the intermediate branch can't match the same things as the other branches, reordering is fine. Thus, for atomic alternations, we can reorder branches that begin with the same character as long as we can prove that the intermediate branches may never match that same character. For now, we stick to fixed characters, though in the future this could be extended to sets/notones as well.
These optimizations lead to it being much more common that all branches in an alternation begin with a unique character. Now, If all branches of an alternation begin with a fixed character, we can emit a switch over just that character and save on potentially lots of failed match attempts, especially if the C# compiler can lower the switch into an IL switch or otherwise optimize the search based on first character. We do this only in our simplified generator that supports limited backtracking constructs, and we only do it in the source generator, as we'd otherwise need to implement all the same lowering logic in the ref emit implementation that Roslyn has for switches.

In a data set of 94,465 patterns from real-world use, the reductions in (1) and (2) above find improvements in 3,527 of them, for ~3.5% (~21% of the total have alternations). For (3), there were 729 expressions that benefit from the new switch-based code gen prior to the improvements from (1) and (2), and after (1) and (2), that number doubles to 1,446.

Example ("hola|ciao|bonjour|hello"):

Before

protected override void Go()
{
    string runtext = base.runtext!;
    int runtextpos = base.runtextpos;
    int runtextend = base.runtextend;
    int originalruntextpos = runtextpos;
    ref byte byteStr = ref global::System.Runtime.CompilerServices.Unsafe.NullRef<byte>();
    char ch;
    global::System.ReadOnlySpan<char> textSpan = global::System.MemoryExtensions.AsSpan(runtext, runtextpos, runtextend - runtextpos);
                    
    // Alternate
    {
        int startingRunTextPos0 = runtextpos;
                        
        // Branch 0
        {
            // Multi "hola"
            {
                byteStr = ref global::System.Runtime.InteropServices.MemoryMarshal.GetReference(global::System.Runtime.InteropServices.MemoryMarshal.AsBytes(textSpan));
                if ((uint)textSpan.Length < 4 ||
                    global::System.Runtime.CompilerServices.Unsafe.ReadUnaligned<ulong>(ref global::System.Runtime.CompilerServices.Unsafe.Add(ref byteStr, 0)) != (global::System.BitConverter.IsLittleEndian ? 0x61006C006F0068ul : 0x68006F006C006100ul))
                {
                    goto L1;
                }
            }
                            
            runtextpos += 4;
            textSpan = textSpan.Slice(4);
            goto L0;
            L1:
            runtextpos = startingRunTextPos0;
            textSpan = global::System.MemoryExtensions.AsSpan(runtext, runtextpos, runtextend - runtextpos);
        }
                        
        // Branch 1
        {
            // Multi "ciao"
            {
                byteStr = ref global::System.Runtime.InteropServices.MemoryMarshal.GetReference(global::System.Runtime.InteropServices.MemoryMarshal.AsBytes(textSpan));
                if ((uint)textSpan.Length < 4 ||
                    global::System.Runtime.CompilerServices.Unsafe.ReadUnaligned<ulong>(ref global::System.Runtime.CompilerServices.Unsafe.Add(ref byteStr, 0)) != (global::System.BitConverter.IsLittleEndian ? 0x6F006100690063ul : 0x6300690061006F00ul))
                {
                    goto L2;
                }
            }
                            
            runtextpos += 4;
            textSpan = textSpan.Slice(4);
            goto L0;
            L2:
            runtextpos = startingRunTextPos0;
            textSpan = global::System.MemoryExtensions.AsSpan(runtext, runtextpos, runtextend - runtextpos);
        }
                        
        // Branch 2
        {
            // Multi "bonjour"
            {
                byteStr = ref global::System.Runtime.InteropServices.MemoryMarshal.GetReference(global::System.Runtime.InteropServices.MemoryMarshal.AsBytes(textSpan));
                if ((uint)textSpan.Length < 7 ||
                    global::System.Runtime.CompilerServices.Unsafe.ReadUnaligned<ulong>(ref global::System.Runtime.CompilerServices.Unsafe.Add(ref byteStr, 0)) != (global::System.BitConverter.IsLittleEndian ? 0x6A006E006F0062ul : 0x62006F006E006A00ul) ||
                    global::System.Runtime.CompilerServices.Unsafe.ReadUnaligned<uint>(ref global::System.Runtime.CompilerServices.Unsafe.Add(ref byteStr, 8)) != (global::System.BitConverter.IsLittleEndian ? 0x75006Fu : 0x6F007500u) ||
                    textSpan[6] != 'r')
                {
                    goto L3;
                }
            }
                            
            runtextpos += 7;
            textSpan = textSpan.Slice(7);
            goto L0;
            L3:
            runtextpos = startingRunTextPos0;
            textSpan = global::System.MemoryExtensions.AsSpan(runtext, runtextpos, runtextend - runtextpos);
        }
                        
        // Branch 3
        {
            // Multi "hello"
            {
                byteStr = ref global::System.Runtime.InteropServices.MemoryMarshal.GetReference(global::System.Runtime.InteropServices.MemoryMarshal.AsBytes(textSpan));
                if ((uint)textSpan.Length < 5 ||
                    global::System.Runtime.CompilerServices.Unsafe.ReadUnaligned<ulong>(ref global::System.Runtime.CompilerServices.Unsafe.Add(ref byteStr, 0)) != (global::System.BitConverter.IsLittleEndian ? 0x6C006C00650068ul : 0x680065006C006C00ul) ||
                    textSpan[4] != 'o')
                {
                    goto NoMatch;
                }
            }
                            
            runtextpos += 5;
            textSpan = textSpan.Slice(5);
        }
                        
        L0:
        ;
    }
                    
    // Match
    base.runtextpos = runtextpos;
    base.Capture(0, originalruntextpos, runtextpos);
    return;
                    
    // No match
    NoMatch:
    return;
}

After

protected override void Go()
{
    string runtext = base.runtext!;
    int runtextpos = base.runtextpos;
    int runtextend = base.runtextend;
    int originalruntextpos = runtextpos;
    ref byte byteStr = ref global::System.Runtime.CompilerServices.Unsafe.NullRef<byte>();
    char ch;
    global::System.ReadOnlySpan<char> textSpan = global::System.MemoryExtensions.AsSpan(runtext, runtextpos, runtextend - runtextpos);
                    
    // Alternate
    {
        if ((uint)textSpan.Length < 1)
        {
            goto NoMatch;
        }
                        
        switch (textSpan[0])
        {
            case 'h':
                // Alternate
                {
                    if ((uint)textSpan.Length < 2)
                    {
                        goto NoMatch;
                    }
                                    
                    switch (textSpan[1])
                    {
                        case 'o':
                            // Multi "la"
                            {
                                byteStr = ref global::System.Runtime.InteropServices.MemoryMarshal.GetReference(global::System.Runtime.InteropServices.MemoryMarshal.AsBytes(textSpan));
                                if ((uint)textSpan.Length < 4 ||
                                    global::System.Runtime.CompilerServices.Unsafe.ReadUnaligned<uint>(ref global::System.Runtime.CompilerServices.Unsafe.Add(ref byteStr, 4)) != (global::System.BitConverter.IsLittleEndian ? 0x61006Cu : 0x6C006100u))
                                {
                                    goto NoMatch;
                                }
                            }
                                            
                            runtextpos += 4;
                            textSpan = textSpan.Slice(4);
                            break;
                                            
                        case 'e':
                            // Multi "llo"
                            {
                                byteStr = ref global::System.Runtime.InteropServices.MemoryMarshal.GetReference(global::System.Runtime.InteropServices.MemoryMarshal.AsBytes(textSpan));
                                if ((uint)textSpan.Length < 5 ||
                                    global::System.Runtime.CompilerServices.Unsafe.ReadUnaligned<uint>(ref global::System.Runtime.CompilerServices.Unsafe.Add(ref byteStr, 4)) != (global::System.BitConverter.IsLittleEndian ? 0x6C006Cu : 0x6C006C00u) ||
                                    textSpan[4] != 'o')
                                {
                                    goto NoMatch;
                                }
                            }
                                            
                            runtextpos += 5;
                            textSpan = textSpan.Slice(5);
                            break;
                                            
                        default:
                            goto NoMatch;
                    }
                }
                                
                break;
                                
            case 'c':
                // Multi "iao"
                {
                    byteStr = ref global::System.Runtime.InteropServices.MemoryMarshal.GetReference(global::System.Runtime.InteropServices.MemoryMarshal.AsBytes(textSpan));
                    if ((uint)textSpan.Length < 4 ||
                        global::System.Runtime.CompilerServices.Unsafe.ReadUnaligned<uint>(ref global::System.Runtime.CompilerServices.Unsafe.Add(ref byteStr, 2)) != (global::System.BitConverter.IsLittleEndian ? 0x610069u : 0x69006100u) ||
                        textSpan[3] != 'o')
                    {
                        goto NoMatch;
                    }
                }
                                
                runtextpos += 4;
                textSpan = textSpan.Slice(4);
                break;
                                
            case 'b':
                // Multi "onjour"
                {
                    byteStr = ref global::System.Runtime.InteropServices.MemoryMarshal.GetReference(global::System.Runtime.InteropServices.MemoryMarshal.AsBytes(textSpan));
                    if ((uint)textSpan.Length < 7 ||
                        global::System.Runtime.CompilerServices.Unsafe.ReadUnaligned<ulong>(ref global::System.Runtime.CompilerServices.Unsafe.Add(ref byteStr, 2)) != (global::System.BitConverter.IsLittleEndian ? 0x6F006A006E006Ful : 0x6F006E006A006F00ul) ||
                        global::System.Runtime.CompilerServices.Unsafe.ReadUnaligned<uint>(ref global::System.Runtime.CompilerServices.Unsafe.Add(ref byteStr, 10)) != (global::System.BitConverter.IsLittleEndian ? 0x720075u : 0x75007200u))
                    {
                        goto NoMatch;
                    }
                }
                                
                runtextpos += 7;
                textSpan = textSpan.Slice(7);
                break;
                                
            default:
                goto NoMatch;
        }
    }
                    
    // Match
    base.runtextpos = runtextpos;
    base.Capture(0, originalruntextpos, runtextpos);
    return;
                    
    // No match
    NoMatch:
    return;
}

ghost · 2021-10-03T01:03:55Z

Tagging subscribers to this area: @eerhardt, @dotnet/area-system-text-regularexpressions
See info in area-owners.md if you want to be subscribed.

Issue Details

This does a few things:

Given an expression like "ab|ac|ade", we already reduce this to "a(?:b|c|de)" in order to factor out the starting "a". But given an expression like "ab|acd|ef|egh", we don't currently extract the individual prefixes, e.g. "a(?:b|cd)|e(?:f|gh)", which would be valuable for a few reasons. Primarily, it enables more efficient processing of the alternation, as a failed match in one branch then has to explore fewer branches, and we can potentially make that initial branch selection even faster if all the branches start with unique, fixed characters.
The aforementioned improved prefix extraction isn't possible if there's a branch in between two that could otherwise be combined, and in the general case, we can't reorder the branches as that breaks the visible ordering semantics. However, for atomic alternations, where no backtracking back into the node is possible, if we can prove that the intermediate branch can't match the same things as the other branches, reordering is fine. Thus, for atomic alternations, we can reorder branches that begin with the same character as long as we can prove that the intermediate branches may never match that same character. For now, we stick to fixed characters, though in the future this could be extended to sets/notones as well.
These optimizations lead to it being much more common that all branches in an alternation begin with a unique character. Now, If all branches of an alternation begin with a fixed character, we can emit a switch over just that character and save on potentially lots of failed match attempts, especially if the C# compiler can lower the switch into an IL switch or otherwise optimize the search based on first character. We do this only in our simplified generator that supports limited backtracking constructs, and we only do it in the source generator, as we'd otherwise need to implement all the same lowering logic in the ref emit implementation that Roslyn has for switches.

In a data set of 94,465 patterns from real-world use, the reductions in (1) and (2) above find improvements in 3,527 of them, for ~3.5% (~21% of the total have alternations). For (3), there were 729 expressions that benefit from the new switch-based code gen prior to the improvements from (1) and (2), and after (1) and (2), that number doubles to 1446.

Example ("hola|ciao|bonjour|hello"):

Before

protected override void Go()
{
    string runtext = base.runtext!;
    int runtextpos = base.runtextpos;
    int runtextend = base.runtextend;
    int originalruntextpos = runtextpos;
    ref byte byteStr = ref global::System.Runtime.CompilerServices.Unsafe.NullRef<byte>();
    char ch;
    global::System.ReadOnlySpan<char> textSpan = global::System.MemoryExtensions.AsSpan(runtext, runtextpos, runtextend - runtextpos);
                    
    // Alternate
    {
        int startingRunTextPos0 = runtextpos;
                        
        // Branch 0
        {
            // Multi "hola"
            {
                byteStr = ref global::System.Runtime.InteropServices.MemoryMarshal.GetReference(global::System.Runtime.InteropServices.MemoryMarshal.AsBytes(textSpan));
                if ((uint)textSpan.Length < 4 ||
                    global::System.Runtime.CompilerServices.Unsafe.ReadUnaligned<ulong>(ref global::System.Runtime.CompilerServices.Unsafe.Add(ref byteStr, 0)) != (global::System.BitConverter.IsLittleEndian ? 0x61006C006F0068ul : 0x68006F006C006100ul))
                {
                    goto L1;
                }
            }
                            
            runtextpos += 4;
            textSpan = textSpan.Slice(4);
            goto L0;
            L1:
            runtextpos = startingRunTextPos0;
            textSpan = global::System.MemoryExtensions.AsSpan(runtext, runtextpos, runtextend - runtextpos);
        }
                        
        // Branch 1
        {
            // Multi "ciao"
            {
                byteStr = ref global::System.Runtime.InteropServices.MemoryMarshal.GetReference(global::System.Runtime.InteropServices.MemoryMarshal.AsBytes(textSpan));
                if ((uint)textSpan.Length < 4 ||
                    global::System.Runtime.CompilerServices.Unsafe.ReadUnaligned<ulong>(ref global::System.Runtime.CompilerServices.Unsafe.Add(ref byteStr, 0)) != (global::System.BitConverter.IsLittleEndian ? 0x6F006100690063ul : 0x6300690061006F00ul))
                {
                    goto L2;
                }
            }
                            
            runtextpos += 4;
            textSpan = textSpan.Slice(4);
            goto L0;
            L2:
            runtextpos = startingRunTextPos0;
            textSpan = global::System.MemoryExtensions.AsSpan(runtext, runtextpos, runtextend - runtextpos);
        }
                        
        // Branch 2
        {
            // Multi "bonjour"
            {
                byteStr = ref global::System.Runtime.InteropServices.MemoryMarshal.GetReference(global::System.Runtime.InteropServices.MemoryMarshal.AsBytes(textSpan));
                if ((uint)textSpan.Length < 7 ||
                    global::System.Runtime.CompilerServices.Unsafe.ReadUnaligned<ulong>(ref global::System.Runtime.CompilerServices.Unsafe.Add(ref byteStr, 0)) != (global::System.BitConverter.IsLittleEndian ? 0x6A006E006F0062ul : 0x62006F006E006A00ul) ||
                    global::System.Runtime.CompilerServices.Unsafe.ReadUnaligned<uint>(ref global::System.Runtime.CompilerServices.Unsafe.Add(ref byteStr, 8)) != (global::System.BitConverter.IsLittleEndian ? 0x75006Fu : 0x6F007500u) ||
                    textSpan[6] != 'r')
                {
                    goto L3;
                }
            }
                            
            runtextpos += 7;
            textSpan = textSpan.Slice(7);
            goto L0;
            L3:
            runtextpos = startingRunTextPos0;
            textSpan = global::System.MemoryExtensions.AsSpan(runtext, runtextpos, runtextend - runtextpos);
        }
                        
        // Branch 3
        {
            // Multi "hello"
            {
                byteStr = ref global::System.Runtime.InteropServices.MemoryMarshal.GetReference(global::System.Runtime.InteropServices.MemoryMarshal.AsBytes(textSpan));
                if ((uint)textSpan.Length < 5 ||
                    global::System.Runtime.CompilerServices.Unsafe.ReadUnaligned<ulong>(ref global::System.Runtime.CompilerServices.Unsafe.Add(ref byteStr, 0)) != (global::System.BitConverter.IsLittleEndian ? 0x6C006C00650068ul : 0x680065006C006C00ul) ||
                    textSpan[4] != 'o')
                {
                    goto NoMatch;
                }
            }
                            
            runtextpos += 5;
            textSpan = textSpan.Slice(5);
        }
                        
        L0:
        ;
    }
                    
    // Match
    base.runtextpos = runtextpos;
    base.Capture(0, originalruntextpos, runtextpos);
    return;
                    
    // No match
    NoMatch:
    return;
}

After

protected override void Go()
{
    string runtext = base.runtext!;
    int runtextpos = base.runtextpos;
    int runtextend = base.runtextend;
    int originalruntextpos = runtextpos;
    ref byte byteStr = ref global::System.Runtime.CompilerServices.Unsafe.NullRef<byte>();
    char ch;
    global::System.ReadOnlySpan<char> textSpan = global::System.MemoryExtensions.AsSpan(runtext, runtextpos, runtextend - runtextpos);
                    
    // Alternate
    {
        if ((uint)textSpan.Length < 1)
        {
            goto NoMatch;
        }
                        
        switch (textSpan[0])
        {
            case 'h':
                // Alternate
                {
                    if ((uint)textSpan.Length < 2)
                    {
                        goto NoMatch;
                    }
                                    
                    switch (textSpan[1])
                    {
                        case 'o':
                            // Multi "la"
                            {
                                byteStr = ref global::System.Runtime.InteropServices.MemoryMarshal.GetReference(global::System.Runtime.InteropServices.MemoryMarshal.AsBytes(textSpan));
                                if ((uint)textSpan.Length < 4 ||
                                    global::System.Runtime.CompilerServices.Unsafe.ReadUnaligned<uint>(ref global::System.Runtime.CompilerServices.Unsafe.Add(ref byteStr, 4)) != (global::System.BitConverter.IsLittleEndian ? 0x61006Cu : 0x6C006100u))
                                {
                                    goto NoMatch;
                                }
                            }
                                            
                            runtextpos += 4;
                            textSpan = textSpan.Slice(4);
                            break;
                                            
                        case 'e':
                            // Multi "llo"
                            {
                                byteStr = ref global::System.Runtime.InteropServices.MemoryMarshal.GetReference(global::System.Runtime.InteropServices.MemoryMarshal.AsBytes(textSpan));
                                if ((uint)textSpan.Length < 5 ||
                                    global::System.Runtime.CompilerServices.Unsafe.ReadUnaligned<uint>(ref global::System.Runtime.CompilerServices.Unsafe.Add(ref byteStr, 4)) != (global::System.BitConverter.IsLittleEndian ? 0x6C006Cu : 0x6C006C00u) ||
                                    textSpan[4] != 'o')
                                {
                                    goto NoMatch;
                                }
                            }
                                            
                            runtextpos += 5;
                            textSpan = textSpan.Slice(5);
                            break;
                                            
                        default:
                            goto NoMatch;
                    }
                }
                                
                break;
                                
            case 'c':
                // Multi "iao"
                {
                    byteStr = ref global::System.Runtime.InteropServices.MemoryMarshal.GetReference(global::System.Runtime.InteropServices.MemoryMarshal.AsBytes(textSpan));
                    if ((uint)textSpan.Length < 4 ||
                        global::System.Runtime.CompilerServices.Unsafe.ReadUnaligned<uint>(ref global::System.Runtime.CompilerServices.Unsafe.Add(ref byteStr, 2)) != (global::System.BitConverter.IsLittleEndian ? 0x610069u : 0x69006100u) ||
                        textSpan[3] != 'o')
                    {
                        goto NoMatch;
                    }
                }
                                
                runtextpos += 4;
                textSpan = textSpan.Slice(4);
                break;
                                
            case 'b':
                // Multi "onjour"
                {
                    byteStr = ref global::System.Runtime.InteropServices.MemoryMarshal.GetReference(global::System.Runtime.InteropServices.MemoryMarshal.AsBytes(textSpan));
                    if ((uint)textSpan.Length < 7 ||
                        global::System.Runtime.CompilerServices.Unsafe.ReadUnaligned<ulong>(ref global::System.Runtime.CompilerServices.Unsafe.Add(ref byteStr, 2)) != (global::System.BitConverter.IsLittleEndian ? 0x6F006A006E006Ful : 0x6F006E006A006F00ul) ||
                        global::System.Runtime.CompilerServices.Unsafe.ReadUnaligned<uint>(ref global::System.Runtime.CompilerServices.Unsafe.Add(ref byteStr, 10)) != (global::System.BitConverter.IsLittleEndian ? 0x720075u : 0x75007200u))
                    {
                        goto NoMatch;
                    }
                }
                                
                runtextpos += 7;
                textSpan = textSpan.Slice(7);
                break;
                                
            default:
                goto NoMatch;
        }
    }
                    
    // Match
    base.runtextpos = runtextpos;
    base.Capture(0, originalruntextpos, runtextpos);
    return;
                    
    // No match
    NoMatch:
    return;
}

Author:	stephentoub
Assignees:	-
Labels:	`area-System.Text.RegularExpressions`
Milestone:	-

src/libraries/System.Text.RegularExpressions/tests/RegexReductionTests.cs

stephentoub · 2021-10-03T12:08:06Z

@safern, do you know what's going on with this failure?

The "Microsoft.DotNet.Compatibility.ValidatePackage" task failed unexpectedly.
System.TypeLoadException: Method 'CompareSourceLocations' in type 'Microsoft.CodeAnalysis.CSharp.CSharpCompilation' from assembly 'Microsoft.CodeAnalysis.CSharp, Version=4.0.0.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35' does not have an implementation.
   at Microsoft.DotNet.ApiCompatibility.AssemblySymbolLoader..ctor(Boolean resolveAssemblyReferences)
   at Microsoft.DotNet.PackageValidation.ApiCompatRunner.GetAssemblySymbolFromStream(Stream assemblyStream, MetadataInformation assemblyInformation, Boolean& resolvedReferences) in /_/src/Compatibility/Microsoft.DotNet.PackageValidation/ApiCompatRunner.cs:line 111
   at Microsoft.DotNet.PackageValidation.ApiCompatRunner.RunApiCompat() in /_/src/Compatibility/Microsoft.DotNet.PackageValidation/ApiCompatRunner.cs:line 49
   at Microsoft.DotNet.PackageValidation.CompatibleFrameworkInPackageValidator.Validate(Package package) in /_/src/Compatibility/Microsoft.DotNet.PackageValidation/CompatibleFrameworkInPackageValidator.cs:line 62
   at Microsoft.DotNet.Compatibility.ValidatePackage.ExecuteCore() in /_/src/Compatibility/Microsoft.DotNet.Compatibility/ValidatePackage.cs:line 92
   at Microsoft.NET.Build.Tasks.TaskBase.Execute() in /_/src/Tasks/Common/TaskBase.cs:line 38
   at Microsoft.DotNet.Compatibility.ValidatePackage.Execute() in /_/src/Compatibility/Microsoft.DotNet.Compatibility/ValidatePackage.cs:line 49
   at Microsoft.Build.BackEnd.TaskExecutionHost.Microsoft.Build.BackEnd.ITaskExecutionHost.Execute()
   at Microsoft.Build.BackEnd.TaskBuilder.ExecuteInstantiatedTask(ITaskExecutionHost taskExecutionHost, TaskLoggingContext taskLoggingContext, TaskHost taskHost, ItemBucket bucket, TaskExecutionMode howToExecuteTask)

eerhardt · 2021-10-05T16:05:23Z

 case 'h':
                // Alternate
                {
                    if ((uint)textSpan.Length < 2)
                    {
                        goto NoMatch;
                    }

Why does this check for textSpan.Length < 2 in this example? The 2 h words have lengths 4 and 5, so should it be checking for textSpan.Length < 4 -> goto NoMatch;?

stephentoub · 2021-10-05T16:12:37Z

Why does this check for textSpan.Length < 2 in this example?

It's not 0-based at this point. The generator keeps track of whether the characters it's looking at are guaranteed to be a fixed position from the start of the relevant portion of the pattern, and it avoids slicing the span if it doesn't have to. So in this example, it previously processed textSpan[0] in order to get to this point, and now it needs to read textSpan[1], so it validates the span is at least two characters long. There's then a subsequent length check inside of each branch of the alternation before the subsequent comparison happens.

As of #59660, we now combine length checks within concatenations, but we don't do so into alternations. The purpose of the length checks is both a) safety and b) elimination of bounds checks, and in some cases removing the length checks can actually make subsequent operations more expensive because each gets a bounds check that the Length check could actually obviate. There's room here to tweak things further, of course. I'm also about to put up a PR that removes all this Unsafe usage, keeping things in the span world rather than going to refs.

safern · 2021-10-05T17:21:28Z

@safern, do you know what's going on with this failure?

I missed this ping, sorry. I'm currently investigating what is going on: #59908

eerhardt

LGTM. I just had some nits/questions.

src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexNode.cs

If all branches of an alternation begin with a fixed character, we can emit a switch over just that character and save on potentially lots of failed match attempts, especially if the C# compiler can lower the switch into an IL switch or otherwise optimize the search based on first character.

Given an expression like "ab|ac|ade", we already reduce this to "a(?:b|c|de)" in order to factor out the starting "a". But given an expression like "ab|acd|ef|egh", we don't currently extract the individual prefixes, e.g. "a(?:b|cd)|e(?:f|gh)", which would be valuable for a few reasons. Primarily, it enables more efficient processing of the alternation, as a failed match in one branch then has to explore fewer branches, and we can potentially make that initial branch selection even faster if all the branches start with unique, fixed characters.

The primary change here is an additional reduction for atomic alternations that enables subsequent optimizations to do more. We previously added an optimization for alternation reduction that enables extraction of a common prefix from a contiguous sequence of branches in an alternation. Such extraction isn't possible if there's a branch in between two that could otherwise be combined, and in the general case, we can't reorder the branches as that breaks the semantics of ordering being visible. However, for atomic alternations, where no backtracking back into the node is possible, if we can prove that the intermediate branch can't match the same things as the other branches, reordering is fine. Thus, for atomic alternations, we can reorder branches that begin with the same character as long as we can prove that the intermediate branches may never match that same character. For now, we stick to fixed characters, though in the future this could be extended to sets/notones as well. This also adds a minor optimization for atomic alternations that trims away all branches after an empty branch. And it tweaks the pass that finds nodes to mark as atomic, ensuring that a top-level alternation is marked atomic so that the aforementioned optimizations kick in.

dotnet-issue-labeler bot added the area-System.Text.RegularExpressions label Oct 3, 2021

am11 reviewed Oct 3, 2021

View reviewed changes

src/libraries/System.Text.RegularExpressions/tests/RegexReductionTests.cs Show resolved Hide resolved

eerhardt approved these changes Oct 5, 2021

View reviewed changes

stephentoub added 3 commits October 5, 2021 23:22

stephentoub force-pushed the regexalternations branch from 5789704 to b270261 Compare October 6, 2021 03:30

Address PR feedback

dccb61e

stephentoub force-pushed the regexalternations branch from b270261 to dccb61e Compare October 6, 2021 05:13

stephentoub merged commit 44f8982 into dotnet:main Oct 6, 2021

stephentoub deleted the regexalternations branch October 6, 2021 15:16

kunalspathak mentioned this pull request Oct 14, 2021

[Perf] Changes at 10/8/2021 10:07:06 PM dotnet/perf-autofiling-issues#1818

Closed

ghost locked as resolved and limited conversation to collaborators Nov 5, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve regex reductions and code gen for some alternations #59903

Improve regex reductions and code gen for some alternations #59903

stephentoub commented Oct 3, 2021 •

edited

Loading

ghost commented Oct 3, 2021

stephentoub commented Oct 3, 2021

eerhardt commented Oct 5, 2021

stephentoub commented Oct 5, 2021 •

edited

Loading

safern commented Oct 5, 2021

eerhardt left a comment

Improve regex reductions and code gen for some alternations #59903

Improve regex reductions and code gen for some alternations #59903

Conversation

stephentoub commented Oct 3, 2021 • edited Loading

ghost commented Oct 3, 2021

stephentoub commented Oct 3, 2021

eerhardt commented Oct 5, 2021

stephentoub commented Oct 5, 2021 • edited Loading

safern commented Oct 5, 2021

eerhardt left a comment

Choose a reason for hiding this comment

stephentoub commented Oct 3, 2021 •

edited

Loading

stephentoub commented Oct 5, 2021 •

edited

Loading