Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve regex reductions and code gen for some alternations #59903

Merged
merged 4 commits into from
Oct 6, 2021

Conversation

stephentoub
Copy link
Member

@stephentoub stephentoub commented Oct 3, 2021

This does a few things:

  1. Given an expression like "ab|ac|ade", we already reduce this to "a(?:b|c|de)" in order to factor out the starting "a". But given an expression like "ab|acd|ef|egh", we don't currently extract the individual prefixes, e.g. "a(?:b|cd)|e(?:f|gh)", which would be valuable for a few reasons. Primarily, it enables more efficient processing of the alternation, as a failed match in one branch then has to explore fewer branches, and we can potentially make that initial branch selection even faster if all the branches start with unique, fixed characters.
  2. The aforementioned improved prefix extraction isn't possible if there's a branch in between two that could otherwise be combined, and in the general case, we can't reorder the branches as that breaks the visible ordering semantics. However, for atomic alternations, where no backtracking back into the node is possible, if we can prove that the intermediate branch can't match the same things as the other branches, reordering is fine. Thus, for atomic alternations, we can reorder branches that begin with the same character as long as we can prove that the intermediate branches may never match that same character. For now, we stick to fixed characters, though in the future this could be extended to sets/notones as well.
  3. These optimizations lead to it being much more common that all branches in an alternation begin with a unique character. Now, If all branches of an alternation begin with a fixed character, we can emit a switch over just that character and save on potentially lots of failed match attempts, especially if the C# compiler can lower the switch into an IL switch or otherwise optimize the search based on first character. We do this only in our simplified generator that supports limited backtracking constructs, and we only do it in the source generator, as we'd otherwise need to implement all the same lowering logic in the ref emit implementation that Roslyn has for switches.

In a data set of 94,465 patterns from real-world use, the reductions in (1) and (2) above find improvements in 3,527 of them, for ~3.5% (~21% of the total have alternations). For (3), there were 729 expressions that benefit from the new switch-based code gen prior to the improvements from (1) and (2), and after (1) and (2), that number doubles to 1,446.

Example ("hola|ciao|bonjour|hello"):

Before
protected override void Go()
{
    string runtext = base.runtext!;
    int runtextpos = base.runtextpos;
    int runtextend = base.runtextend;
    int originalruntextpos = runtextpos;
    ref byte byteStr = ref global::System.Runtime.CompilerServices.Unsafe.NullRef<byte>();
    char ch;
    global::System.ReadOnlySpan<char> textSpan = global::System.MemoryExtensions.AsSpan(runtext, runtextpos, runtextend - runtextpos);
                    
    // Alternate
    {
        int startingRunTextPos0 = runtextpos;
                        
        // Branch 0
        {
            // Multi "hola"
            {
                byteStr = ref global::System.Runtime.InteropServices.MemoryMarshal.GetReference(global::System.Runtime.InteropServices.MemoryMarshal.AsBytes(textSpan));
                if ((uint)textSpan.Length < 4 ||
                    global::System.Runtime.CompilerServices.Unsafe.ReadUnaligned<ulong>(ref global::System.Runtime.CompilerServices.Unsafe.Add(ref byteStr, 0)) != (global::System.BitConverter.IsLittleEndian ? 0x61006C006F0068ul : 0x68006F006C006100ul))
                {
                    goto L1;
                }
            }
                            
            runtextpos += 4;
            textSpan = textSpan.Slice(4);
            goto L0;
            L1:
            runtextpos = startingRunTextPos0;
            textSpan = global::System.MemoryExtensions.AsSpan(runtext, runtextpos, runtextend - runtextpos);
        }
                        
        // Branch 1
        {
            // Multi "ciao"
            {
                byteStr = ref global::System.Runtime.InteropServices.MemoryMarshal.GetReference(global::System.Runtime.InteropServices.MemoryMarshal.AsBytes(textSpan));
                if ((uint)textSpan.Length < 4 ||
                    global::System.Runtime.CompilerServices.Unsafe.ReadUnaligned<ulong>(ref global::System.Runtime.CompilerServices.Unsafe.Add(ref byteStr, 0)) != (global::System.BitConverter.IsLittleEndian ? 0x6F006100690063ul : 0x6300690061006F00ul))
                {
                    goto L2;
                }
            }
                            
            runtextpos += 4;
            textSpan = textSpan.Slice(4);
            goto L0;
            L2:
            runtextpos = startingRunTextPos0;
            textSpan = global::System.MemoryExtensions.AsSpan(runtext, runtextpos, runtextend - runtextpos);
        }
                        
        // Branch 2
        {
            // Multi "bonjour"
            {
                byteStr = ref global::System.Runtime.InteropServices.MemoryMarshal.GetReference(global::System.Runtime.InteropServices.MemoryMarshal.AsBytes(textSpan));
                if ((uint)textSpan.Length < 7 ||
                    global::System.Runtime.CompilerServices.Unsafe.ReadUnaligned<ulong>(ref global::System.Runtime.CompilerServices.Unsafe.Add(ref byteStr, 0)) != (global::System.BitConverter.IsLittleEndian ? 0x6A006E006F0062ul : 0x62006F006E006A00ul) ||
                    global::System.Runtime.CompilerServices.Unsafe.ReadUnaligned<uint>(ref global::System.Runtime.CompilerServices.Unsafe.Add(ref byteStr, 8)) != (global::System.BitConverter.IsLittleEndian ? 0x75006Fu : 0x6F007500u) ||
                    textSpan[6] != 'r')
                {
                    goto L3;
                }
            }
                            
            runtextpos += 7;
            textSpan = textSpan.Slice(7);
            goto L0;
            L3:
            runtextpos = startingRunTextPos0;
            textSpan = global::System.MemoryExtensions.AsSpan(runtext, runtextpos, runtextend - runtextpos);
        }
                        
        // Branch 3
        {
            // Multi "hello"
            {
                byteStr = ref global::System.Runtime.InteropServices.MemoryMarshal.GetReference(global::System.Runtime.InteropServices.MemoryMarshal.AsBytes(textSpan));
                if ((uint)textSpan.Length < 5 ||
                    global::System.Runtime.CompilerServices.Unsafe.ReadUnaligned<ulong>(ref global::System.Runtime.CompilerServices.Unsafe.Add(ref byteStr, 0)) != (global::System.BitConverter.IsLittleEndian ? 0x6C006C00650068ul : 0x680065006C006C00ul) ||
                    textSpan[4] != 'o')
                {
                    goto NoMatch;
                }
            }
                            
            runtextpos += 5;
            textSpan = textSpan.Slice(5);
        }
                        
        L0:
        ;
    }
                    
    // Match
    base.runtextpos = runtextpos;
    base.Capture(0, originalruntextpos, runtextpos);
    return;
                    
    // No match
    NoMatch:
    return;
}
After
protected override void Go()
{
    string runtext = base.runtext!;
    int runtextpos = base.runtextpos;
    int runtextend = base.runtextend;
    int originalruntextpos = runtextpos;
    ref byte byteStr = ref global::System.Runtime.CompilerServices.Unsafe.NullRef<byte>();
    char ch;
    global::System.ReadOnlySpan<char> textSpan = global::System.MemoryExtensions.AsSpan(runtext, runtextpos, runtextend - runtextpos);
                    
    // Alternate
    {
        if ((uint)textSpan.Length < 1)
        {
            goto NoMatch;
        }
                        
        switch (textSpan[0])
        {
            case 'h':
                // Alternate
                {
                    if ((uint)textSpan.Length < 2)
                    {
                        goto NoMatch;
                    }
                                    
                    switch (textSpan[1])
                    {
                        case 'o':
                            // Multi "la"
                            {
                                byteStr = ref global::System.Runtime.InteropServices.MemoryMarshal.GetReference(global::System.Runtime.InteropServices.MemoryMarshal.AsBytes(textSpan));
                                if ((uint)textSpan.Length < 4 ||
                                    global::System.Runtime.CompilerServices.Unsafe.ReadUnaligned<uint>(ref global::System.Runtime.CompilerServices.Unsafe.Add(ref byteStr, 4)) != (global::System.BitConverter.IsLittleEndian ? 0x61006Cu : 0x6C006100u))
                                {
                                    goto NoMatch;
                                }
                            }
                                            
                            runtextpos += 4;
                            textSpan = textSpan.Slice(4);
                            break;
                                            
                        case 'e':
                            // Multi "llo"
                            {
                                byteStr = ref global::System.Runtime.InteropServices.MemoryMarshal.GetReference(global::System.Runtime.InteropServices.MemoryMarshal.AsBytes(textSpan));
                                if ((uint)textSpan.Length < 5 ||
                                    global::System.Runtime.CompilerServices.Unsafe.ReadUnaligned<uint>(ref global::System.Runtime.CompilerServices.Unsafe.Add(ref byteStr, 4)) != (global::System.BitConverter.IsLittleEndian ? 0x6C006Cu : 0x6C006C00u) ||
                                    textSpan[4] != 'o')
                                {
                                    goto NoMatch;
                                }
                            }
                                            
                            runtextpos += 5;
                            textSpan = textSpan.Slice(5);
                            break;
                                            
                        default:
                            goto NoMatch;
                    }
                }
                                
                break;
                                
            case 'c':
                // Multi "iao"
                {
                    byteStr = ref global::System.Runtime.InteropServices.MemoryMarshal.GetReference(global::System.Runtime.InteropServices.MemoryMarshal.AsBytes(textSpan));
                    if ((uint)textSpan.Length < 4 ||
                        global::System.Runtime.CompilerServices.Unsafe.ReadUnaligned<uint>(ref global::System.Runtime.CompilerServices.Unsafe.Add(ref byteStr, 2)) != (global::System.BitConverter.IsLittleEndian ? 0x610069u : 0x69006100u) ||
                        textSpan[3] != 'o')
                    {
                        goto NoMatch;
                    }
                }
                                
                runtextpos += 4;
                textSpan = textSpan.Slice(4);
                break;
                                
            case 'b':
                // Multi "onjour"
                {
                    byteStr = ref global::System.Runtime.InteropServices.MemoryMarshal.GetReference(global::System.Runtime.InteropServices.MemoryMarshal.AsBytes(textSpan));
                    if ((uint)textSpan.Length < 7 ||
                        global::System.Runtime.CompilerServices.Unsafe.ReadUnaligned<ulong>(ref global::System.Runtime.CompilerServices.Unsafe.Add(ref byteStr, 2)) != (global::System.BitConverter.IsLittleEndian ? 0x6F006A006E006Ful : 0x6F006E006A006F00ul) ||
                        global::System.Runtime.CompilerServices.Unsafe.ReadUnaligned<uint>(ref global::System.Runtime.CompilerServices.Unsafe.Add(ref byteStr, 10)) != (global::System.BitConverter.IsLittleEndian ? 0x720075u : 0x75007200u))
                    {
                        goto NoMatch;
                    }
                }
                                
                runtextpos += 7;
                textSpan = textSpan.Slice(7);
                break;
                                
            default:
                goto NoMatch;
        }
    }
                    
    // Match
    base.runtextpos = runtextpos;
    base.Capture(0, originalruntextpos, runtextpos);
    return;
                    
    // No match
    NoMatch:
    return;
}

@ghost
Copy link

ghost commented Oct 3, 2021

Tagging subscribers to this area: @eerhardt, @dotnet/area-system-text-regularexpressions
See info in area-owners.md if you want to be subscribed.

Issue Details

This does a few things:

  1. Given an expression like "ab|ac|ade", we already reduce this to "a(?:b|c|de)" in order to factor out the starting "a". But given an expression like "ab|acd|ef|egh", we don't currently extract the individual prefixes, e.g. "a(?:b|cd)|e(?:f|gh)", which would be valuable for a few reasons. Primarily, it enables more efficient processing of the alternation, as a failed match in one branch then has to explore fewer branches, and we can potentially make that initial branch selection even faster if all the branches start with unique, fixed characters.
  2. The aforementioned improved prefix extraction isn't possible if there's a branch in between two that could otherwise be combined, and in the general case, we can't reorder the branches as that breaks the visible ordering semantics. However, for atomic alternations, where no backtracking back into the node is possible, if we can prove that the intermediate branch can't match the same things as the other branches, reordering is fine. Thus, for atomic alternations, we can reorder branches that begin with the same character as long as we can prove that the intermediate branches may never match that same character. For now, we stick to fixed characters, though in the future this could be extended to sets/notones as well.
  3. These optimizations lead to it being much more common that all branches in an alternation begin with a unique character. Now, If all branches of an alternation begin with a fixed character, we can emit a switch over just that character and save on potentially lots of failed match attempts, especially if the C# compiler can lower the switch into an IL switch or otherwise optimize the search based on first character. We do this only in our simplified generator that supports limited backtracking constructs, and we only do it in the source generator, as we'd otherwise need to implement all the same lowering logic in the ref emit implementation that Roslyn has for switches.

In a data set of 94,465 patterns from real-world use, the reductions in (1) and (2) above find improvements in 3,527 of them, for ~3.5% (~21% of the total have alternations). For (3), there were 729 expressions that benefit from the new switch-based code gen prior to the improvements from (1) and (2), and after (1) and (2), that number doubles to 1446.

Example ("hola|ciao|bonjour|hello"):

Before
protected override void Go()
{
    string runtext = base.runtext!;
    int runtextpos = base.runtextpos;
    int runtextend = base.runtextend;
    int originalruntextpos = runtextpos;
    ref byte byteStr = ref global::System.Runtime.CompilerServices.Unsafe.NullRef<byte>();
    char ch;
    global::System.ReadOnlySpan<char> textSpan = global::System.MemoryExtensions.AsSpan(runtext, runtextpos, runtextend - runtextpos);
                    
    // Alternate
    {
        int startingRunTextPos0 = runtextpos;
                        
        // Branch 0
        {
            // Multi "hola"
            {
                byteStr = ref global::System.Runtime.InteropServices.MemoryMarshal.GetReference(global::System.Runtime.InteropServices.MemoryMarshal.AsBytes(textSpan));
                if ((uint)textSpan.Length < 4 ||
                    global::System.Runtime.CompilerServices.Unsafe.ReadUnaligned<ulong>(ref global::System.Runtime.CompilerServices.Unsafe.Add(ref byteStr, 0)) != (global::System.BitConverter.IsLittleEndian ? 0x61006C006F0068ul : 0x68006F006C006100ul))
                {
                    goto L1;
                }
            }
                            
            runtextpos += 4;
            textSpan = textSpan.Slice(4);
            goto L0;
            L1:
            runtextpos = startingRunTextPos0;
            textSpan = global::System.MemoryExtensions.AsSpan(runtext, runtextpos, runtextend - runtextpos);
        }
                        
        // Branch 1
        {
            // Multi "ciao"
            {
                byteStr = ref global::System.Runtime.InteropServices.MemoryMarshal.GetReference(global::System.Runtime.InteropServices.MemoryMarshal.AsBytes(textSpan));
                if ((uint)textSpan.Length < 4 ||
                    global::System.Runtime.CompilerServices.Unsafe.ReadUnaligned<ulong>(ref global::System.Runtime.CompilerServices.Unsafe.Add(ref byteStr, 0)) != (global::System.BitConverter.IsLittleEndian ? 0x6F006100690063ul : 0x6300690061006F00ul))
                {
                    goto L2;
                }
            }
                            
            runtextpos += 4;
            textSpan = textSpan.Slice(4);
            goto L0;
            L2:
            runtextpos = startingRunTextPos0;
            textSpan = global::System.MemoryExtensions.AsSpan(runtext, runtextpos, runtextend - runtextpos);
        }
                        
        // Branch 2
        {
            // Multi "bonjour"
            {
                byteStr = ref global::System.Runtime.InteropServices.MemoryMarshal.GetReference(global::System.Runtime.InteropServices.MemoryMarshal.AsBytes(textSpan));
                if ((uint)textSpan.Length < 7 ||
                    global::System.Runtime.CompilerServices.Unsafe.ReadUnaligned<ulong>(ref global::System.Runtime.CompilerServices.Unsafe.Add(ref byteStr, 0)) != (global::System.BitConverter.IsLittleEndian ? 0x6A006E006F0062ul : 0x62006F006E006A00ul) ||
                    global::System.Runtime.CompilerServices.Unsafe.ReadUnaligned<uint>(ref global::System.Runtime.CompilerServices.Unsafe.Add(ref byteStr, 8)) != (global::System.BitConverter.IsLittleEndian ? 0x75006Fu : 0x6F007500u) ||
                    textSpan[6] != 'r')
                {
                    goto L3;
                }
            }
                            
            runtextpos += 7;
            textSpan = textSpan.Slice(7);
            goto L0;
            L3:
            runtextpos = startingRunTextPos0;
            textSpan = global::System.MemoryExtensions.AsSpan(runtext, runtextpos, runtextend - runtextpos);
        }
                        
        // Branch 3
        {
            // Multi "hello"
            {
                byteStr = ref global::System.Runtime.InteropServices.MemoryMarshal.GetReference(global::System.Runtime.InteropServices.MemoryMarshal.AsBytes(textSpan));
                if ((uint)textSpan.Length < 5 ||
                    global::System.Runtime.CompilerServices.Unsafe.ReadUnaligned<ulong>(ref global::System.Runtime.CompilerServices.Unsafe.Add(ref byteStr, 0)) != (global::System.BitConverter.IsLittleEndian ? 0x6C006C00650068ul : 0x680065006C006C00ul) ||
                    textSpan[4] != 'o')
                {
                    goto NoMatch;
                }
            }
                            
            runtextpos += 5;
            textSpan = textSpan.Slice(5);
        }
                        
        L0:
        ;
    }
                    
    // Match
    base.runtextpos = runtextpos;
    base.Capture(0, originalruntextpos, runtextpos);
    return;
                    
    // No match
    NoMatch:
    return;
}
After
protected override void Go()
{
    string runtext = base.runtext!;
    int runtextpos = base.runtextpos;
    int runtextend = base.runtextend;
    int originalruntextpos = runtextpos;
    ref byte byteStr = ref global::System.Runtime.CompilerServices.Unsafe.NullRef<byte>();
    char ch;
    global::System.ReadOnlySpan<char> textSpan = global::System.MemoryExtensions.AsSpan(runtext, runtextpos, runtextend - runtextpos);
                    
    // Alternate
    {
        if ((uint)textSpan.Length < 1)
        {
            goto NoMatch;
        }
                        
        switch (textSpan[0])
        {
            case 'h':
                // Alternate
                {
                    if ((uint)textSpan.Length < 2)
                    {
                        goto NoMatch;
                    }
                                    
                    switch (textSpan[1])
                    {
                        case 'o':
                            // Multi "la"
                            {
                                byteStr = ref global::System.Runtime.InteropServices.MemoryMarshal.GetReference(global::System.Runtime.InteropServices.MemoryMarshal.AsBytes(textSpan));
                                if ((uint)textSpan.Length < 4 ||
                                    global::System.Runtime.CompilerServices.Unsafe.ReadUnaligned<uint>(ref global::System.Runtime.CompilerServices.Unsafe.Add(ref byteStr, 4)) != (global::System.BitConverter.IsLittleEndian ? 0x61006Cu : 0x6C006100u))
                                {
                                    goto NoMatch;
                                }
                            }
                                            
                            runtextpos += 4;
                            textSpan = textSpan.Slice(4);
                            break;
                                            
                        case 'e':
                            // Multi "llo"
                            {
                                byteStr = ref global::System.Runtime.InteropServices.MemoryMarshal.GetReference(global::System.Runtime.InteropServices.MemoryMarshal.AsBytes(textSpan));
                                if ((uint)textSpan.Length < 5 ||
                                    global::System.Runtime.CompilerServices.Unsafe.ReadUnaligned<uint>(ref global::System.Runtime.CompilerServices.Unsafe.Add(ref byteStr, 4)) != (global::System.BitConverter.IsLittleEndian ? 0x6C006Cu : 0x6C006C00u) ||
                                    textSpan[4] != 'o')
                                {
                                    goto NoMatch;
                                }
                            }
                                            
                            runtextpos += 5;
                            textSpan = textSpan.Slice(5);
                            break;
                                            
                        default:
                            goto NoMatch;
                    }
                }
                                
                break;
                                
            case 'c':
                // Multi "iao"
                {
                    byteStr = ref global::System.Runtime.InteropServices.MemoryMarshal.GetReference(global::System.Runtime.InteropServices.MemoryMarshal.AsBytes(textSpan));
                    if ((uint)textSpan.Length < 4 ||
                        global::System.Runtime.CompilerServices.Unsafe.ReadUnaligned<uint>(ref global::System.Runtime.CompilerServices.Unsafe.Add(ref byteStr, 2)) != (global::System.BitConverter.IsLittleEndian ? 0x610069u : 0x69006100u) ||
                        textSpan[3] != 'o')
                    {
                        goto NoMatch;
                    }
                }
                                
                runtextpos += 4;
                textSpan = textSpan.Slice(4);
                break;
                                
            case 'b':
                // Multi "onjour"
                {
                    byteStr = ref global::System.Runtime.InteropServices.MemoryMarshal.GetReference(global::System.Runtime.InteropServices.MemoryMarshal.AsBytes(textSpan));
                    if ((uint)textSpan.Length < 7 ||
                        global::System.Runtime.CompilerServices.Unsafe.ReadUnaligned<ulong>(ref global::System.Runtime.CompilerServices.Unsafe.Add(ref byteStr, 2)) != (global::System.BitConverter.IsLittleEndian ? 0x6F006A006E006Ful : 0x6F006E006A006F00ul) ||
                        global::System.Runtime.CompilerServices.Unsafe.ReadUnaligned<uint>(ref global::System.Runtime.CompilerServices.Unsafe.Add(ref byteStr, 10)) != (global::System.BitConverter.IsLittleEndian ? 0x720075u : 0x75007200u))
                    {
                        goto NoMatch;
                    }
                }
                                
                runtextpos += 7;
                textSpan = textSpan.Slice(7);
                break;
                                
            default:
                goto NoMatch;
        }
    }
                    
    // Match
    base.runtextpos = runtextpos;
    base.Capture(0, originalruntextpos, runtextpos);
    return;
                    
    // No match
    NoMatch:
    return;
}
Author: stephentoub
Assignees: -
Labels:

area-System.Text.RegularExpressions

Milestone: -

@stephentoub
Copy link
Member Author

@safern, do you know what's going on with this failure?

The "Microsoft.DotNet.Compatibility.ValidatePackage" task failed unexpectedly.
System.TypeLoadException: Method 'CompareSourceLocations' in type 'Microsoft.CodeAnalysis.CSharp.CSharpCompilation' from assembly 'Microsoft.CodeAnalysis.CSharp, Version=4.0.0.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35' does not have an implementation.
   at Microsoft.DotNet.ApiCompatibility.AssemblySymbolLoader..ctor(Boolean resolveAssemblyReferences)
   at Microsoft.DotNet.PackageValidation.ApiCompatRunner.GetAssemblySymbolFromStream(Stream assemblyStream, MetadataInformation assemblyInformation, Boolean& resolvedReferences) in /_/src/Compatibility/Microsoft.DotNet.PackageValidation/ApiCompatRunner.cs:line 111
   at Microsoft.DotNet.PackageValidation.ApiCompatRunner.RunApiCompat() in /_/src/Compatibility/Microsoft.DotNet.PackageValidation/ApiCompatRunner.cs:line 49
   at Microsoft.DotNet.PackageValidation.CompatibleFrameworkInPackageValidator.Validate(Package package) in /_/src/Compatibility/Microsoft.DotNet.PackageValidation/CompatibleFrameworkInPackageValidator.cs:line 62
   at Microsoft.DotNet.Compatibility.ValidatePackage.ExecuteCore() in /_/src/Compatibility/Microsoft.DotNet.Compatibility/ValidatePackage.cs:line 92
   at Microsoft.NET.Build.Tasks.TaskBase.Execute() in /_/src/Tasks/Common/TaskBase.cs:line 38
   at Microsoft.DotNet.Compatibility.ValidatePackage.Execute() in /_/src/Compatibility/Microsoft.DotNet.Compatibility/ValidatePackage.cs:line 49
   at Microsoft.Build.BackEnd.TaskExecutionHost.Microsoft.Build.BackEnd.ITaskExecutionHost.Execute()
   at Microsoft.Build.BackEnd.TaskBuilder.ExecuteInstantiatedTask(ITaskExecutionHost taskExecutionHost, TaskLoggingContext taskLoggingContext, TaskHost taskHost, ItemBucket bucket, TaskExecutionMode howToExecuteTask)

@eerhardt
Copy link
Member

eerhardt commented Oct 5, 2021

 case 'h':
                // Alternate
                {
                    if ((uint)textSpan.Length < 2)
                    {
                        goto NoMatch;
                    }

Why does this check for textSpan.Length < 2 in this example? The 2 h words have lengths 4 and 5, so should it be checking for textSpan.Length < 4 -> goto NoMatch;?

@stephentoub
Copy link
Member Author

stephentoub commented Oct 5, 2021

Why does this check for textSpan.Length < 2 in this example?

It's not 0-based at this point. The generator keeps track of whether the characters it's looking at are guaranteed to be a fixed position from the start of the relevant portion of the pattern, and it avoids slicing the span if it doesn't have to. So in this example, it previously processed textSpan[0] in order to get to this point, and now it needs to read textSpan[1], so it validates the span is at least two characters long. There's then a subsequent length check inside of each branch of the alternation before the subsequent comparison happens.

As of #59660, we now combine length checks within concatenations, but we don't do so into alternations. The purpose of the length checks is both a) safety and b) elimination of bounds checks, and in some cases removing the length checks can actually make subsequent operations more expensive because each gets a bounds check that the Length check could actually obviate. There's room here to tweak things further, of course. I'm also about to put up a PR that removes all this Unsafe usage, keeping things in the span world rather than going to refs.

@safern
Copy link
Member

safern commented Oct 5, 2021

@safern, do you know what's going on with this failure?

I missed this ping, sorry. I'm currently investigating what is going on: #59908

Copy link
Member

@eerhardt eerhardt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I just had some nits/questions.

If all branches of an alternation begin with a fixed character, we can emit a switch over just that character and save on potentially lots of failed match attempts, especially if the C# compiler can lower the switch into an IL switch or otherwise optimize the search based on first character.
Given an expression like "ab|ac|ade", we already reduce this to "a(?:b|c|de)" in order to factor out the starting "a".  But given an expression like "ab|acd|ef|egh", we don't currently extract the individual prefixes, e.g. "a(?:b|cd)|e(?:f|gh)", which would be valuable for a few reasons.  Primarily, it enables more efficient processing of the alternation, as a failed match in one branch then has to explore fewer branches, and we can potentially make that initial branch selection even faster if all the branches start with unique, fixed characters.
The primary change here is an additional reduction for atomic alternations that enables subsequent optimizations to do more.  We previously added an optimization for alternation reduction that enables extraction of a common prefix from a contiguous sequence of branches in an alternation.  Such extraction isn't possible if there's a branch in between two that could otherwise be combined, and in the general case, we can't reorder the branches as that breaks the semantics of ordering being visible.  However, for atomic alternations, where no backtracking back into the node is possible, if we can prove that the intermediate branch can't match the same things as the other branches, reordering is fine.  Thus, for atomic alternations, we can reorder branches that begin with the same character as long as we can prove that the intermediate branches may never match that same character.  For now, we stick to fixed characters, though in the future this could be extended to sets/notones as well.

This also adds a minor optimization for atomic alternations that trims away all branches after an empty branch.

And it tweaks the pass that finds nodes to mark as atomic, ensuring that a top-level alternation is marked atomic so that the aforementioned optimizations kick in.
@stephentoub stephentoub merged commit 44f8982 into dotnet:main Oct 6, 2021
@stephentoub stephentoub deleted the regexalternations branch October 6, 2021 15:16
@ghost ghost locked as resolved and limited conversation to collaborators Nov 5, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants