-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extend RegexCharClass.Canonicalize range inversion optimization #61562
Conversation
Tagging subscribers to this area: @eerhardt, @dotnet/area-system-text-regularexpressions Issue DetailsThere's a simple optimization in RegexCharClass.Canonicalize that was added in .NET 5, with the goal of taking a set that's made up of exactly two ranges and seeing whether those ranges were leaving out exactly one character. If they were, the set can instead be rewritten as that character negated, which is a normalized form used downstream and optimized. We can extend this normalization ever so slightly to be for two ranges separated not just be a single character but by more than that as well. That in turn lights up existing optimizations in both RegexCompiler and the source generator that are looking for a single range, e.g. for the pattern for (int i = 0; i < span.Length; i++)
{
if (((ch = span[i]) < 128 ? ("\uffff\uffff\uffff\uffff\uffff\uffff\uffff\uffff"[ch >> 4] & (1 << (ch & 0xF))) != 0 : CharInClass((char)ch, "\0\u0003\0\0ͰЀ")))
{
base.runtextpos = runtextpos + i;
return true;
}
} and is now emitted as: for (int i = 0; i < span.Length; i++)
{
if ((((uint)span[i]) - 'Ͱ' > (uint)('Ͽ' - 'Ͱ')))
{
base.runtextpos = runtextpos + i;
return true;
}
} cc: @joperezr ps Separately the above example highlights how after enumerating all of ASCII to compute the bit map lookup for a set being emitted, we could use the resulting knowledge to e.g. not emit the lookup at all if we find that every ASCII character matches.
|
890cc84
to
616ed1a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Changes look good to me. I'm not yet familiar with the source generator tests, but would it be possible to add some regression coverage here by ensuring that the Regex expression [\0-AC-\uFFFF]
is translated into [^B]
and specially to ensure that [\0-AE-\uFFFF]
will now be [^B-D]
?
There's a simple optimization in RegexCharClass.Canonicalize that was added in .NET 5, with the goal of taking a set that's made up of exactly two ranges and seeing whether those ranges were leaving out exactly one character. If they were, the set can instead be rewritten as that character negated, which is a normalized form used downstream and optimized. We can extend this normalization ever so slightly to be for two ranges separated not just be a single character but by more than that as well.
616ed1a
to
0199b5a
Compare
Done, thanks. |
There's a simple optimization in RegexCharClass.Canonicalize that was added in .NET 5, with the goal of taking a set that's made up of exactly two ranges and seeing whether those ranges were leaving out exactly one character. If they were, the set can instead be rewritten as that character negated, which is a normalized form used downstream and optimized. We can extend this normalization ever so slightly to be for two ranges separated not just be a single character but by more than that as well. That in turn lights up existing optimizations in both RegexCompiler and the source generator that are looking for a single range, e.g. for the pattern
@"\P{IsGreek}"
, which is trying to match every character other than one that's Greek, FindFirstChar would previously have been emitted as:and is now emitted as:
cc: @joperezr
ps Separately the above example highlights how after enumerating all of ASCII to compute the bit map lookup for a set being emitted, we could use the resulting knowledge to e.g. not emit the lookup at all if we find that every ASCII character matches.