Extend RegexCharClass.Canonicalize range inversion optimization #61562

stephentoub · 2021-11-14T03:19:10Z

There's a simple optimization in RegexCharClass.Canonicalize that was added in .NET 5, with the goal of taking a set that's made up of exactly two ranges and seeing whether those ranges were leaving out exactly one character. If they were, the set can instead be rewritten as that character negated, which is a normalized form used downstream and optimized. We can extend this normalization ever so slightly to be for two ranges separated not just be a single character but by more than that as well. That in turn lights up existing optimizations in both RegexCompiler and the source generator that are looking for a single range, e.g. for the pattern @"\P{IsGreek}", which is trying to match every character other than one that's Greek, FindFirstChar would previously have been emitted as:

for (int i = 0; i < span.Length; i++)
{
    if (((ch = span[i]) < 128 ? ("\uffff\uffff\uffff\uffff\uffff\uffff\uffff\uffff"[ch >> 4] & (1 << (ch & 0xF))) != 0 : CharInClass((char)ch, "\0\u0003\0\0ͰЀ")))
    {
        base.runtextpos = runtextpos + i;
        return true;
    }
}

and is now emitted as:

for (int i = 0; i < span.Length; i++)
{
    if ((((uint)span[i]) - 'Ͱ' > (uint)('Ͽ' - 'Ͱ')))
    {
        base.runtextpos = runtextpos + i;
        return true;
    }
}

cc: @joperezr

ps Separately the above example highlights how after enumerating all of ASCII to compute the bit map lookup for a set being emitted, we could use the resulting knowledge to e.g. not emit the lookup at all if we find that every ASCII character matches.

ghost · 2021-11-14T03:19:18Z

Tagging subscribers to this area: @eerhardt, @dotnet/area-system-text-regularexpressions
See info in area-owners.md if you want to be subscribed.

Issue Details

There's a simple optimization in RegexCharClass.Canonicalize that was added in .NET 5, with the goal of taking a set that's made up of exactly two ranges and seeing whether those ranges were leaving out exactly one character. If they were, the set can instead be rewritten as that character negated, which is a normalized form used downstream and optimized. We can extend this normalization ever so slightly to be for two ranges separated not just be a single character but by more than that as well. That in turn lights up existing optimizations in both RegexCompiler and the source generator that are looking for a single range, e.g. for the pattern @"\P{IsGreek}", which is trying to match every character other than one that's Greek, FindFirstChar would previously have been emitted as:

for (int i = 0; i < span.Length; i++)
{
    if (((ch = span[i]) < 128 ? ("\uffff\uffff\uffff\uffff\uffff\uffff\uffff\uffff"[ch >> 4] & (1 << (ch & 0xF))) != 0 : CharInClass((char)ch, "\0\u0003\0\0ͰЀ")))
    {
        base.runtextpos = runtextpos + i;
        return true;
    }
}

and is now emitted as:

for (int i = 0; i < span.Length; i++)
{
    if ((((uint)span[i]) - 'Ͱ' > (uint)('Ͽ' - 'Ͱ')))
    {
        base.runtextpos = runtextpos + i;
        return true;
    }
}

cc: @joperezr

ps Separately the above example highlights how after enumerating all of ASCII to compute the bit map lookup for a set being emitted, we could use the resulting knowledge to e.g. not emit the lookup at all if we find that every ASCII character matches.

Author:	stephentoub
Assignees:	-
Labels:	`area-System.Text.RegularExpressions`
Milestone:	-

joperezr

Code Changes look good to me. I'm not yet familiar with the source generator tests, but would it be possible to add some regression coverage here by ensuring that the Regex expression [\0-AC-\uFFFF] is translated into [^B] and specially to ensure that [\0-AE-\uFFFF] will now be [^B-D]?

There's a simple optimization in RegexCharClass.Canonicalize that was added in .NET 5, with the goal of taking a set that's made up of exactly two ranges and seeing whether those ranges were leaving out exactly one character. If they were, the set can instead be rewritten as that character negated, which is a normalized form used downstream and optimized. We can extend this normalization ever so slightly to be for two ranges separated not just be a single character but by more than that as well.

stephentoub · 2021-11-15T21:35:09Z

would it be possible to add some regression coverage

Done, thanks.

dotnet-issue-labeler bot added the area-System.Text.RegularExpressions label Nov 14, 2021

stephentoub force-pushed the regexrangeopt branch from 890cc84 to 616ed1a Compare November 15, 2021 01:21

runfoapp bot mentioned this pull request Nov 15, 2021

System.Net.Sockets.Tests.SendPacketsAsync.SendPacketsElement_FileZeroCount_Success sometimes fails #60017

Closed

joperezr approved these changes Nov 15, 2021

View reviewed changes

stephentoub added 3 commits November 15, 2021 16:13

Update TODO comment

585a7a0

Add some more reduction tests

0199b5a

stephentoub force-pushed the regexrangeopt branch from 616ed1a to 0199b5a Compare November 15, 2021 21:34

stephentoub merged commit 44d28bf into dotnet:main Nov 17, 2021

stephentoub deleted the regexrangeopt branch November 17, 2021 12:52

AndyAyersMS mentioned this pull request Nov 22, 2021

arm64 jitstress failing with encoding_found assert #61944

Closed

ghost locked as resolved and limited conversation to collaborators Dec 17, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extend RegexCharClass.Canonicalize range inversion optimization #61562

Extend RegexCharClass.Canonicalize range inversion optimization #61562

stephentoub commented Nov 14, 2021

ghost commented Nov 14, 2021

joperezr left a comment

stephentoub commented Nov 15, 2021

Extend RegexCharClass.Canonicalize range inversion optimization #61562

Extend RegexCharClass.Canonicalize range inversion optimization #61562

Conversation

stephentoub commented Nov 14, 2021

ghost commented Nov 14, 2021

joperezr left a comment

Choose a reason for hiding this comment

stephentoub commented Nov 15, 2021