-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow Base64Decoder to ignore space chars, add IsValid methods and tests #79334
Conversation
Chars now ignored: 9: Line feed 10: Horizontal tab 13: Carriage return 32: Space -- Vertical tab omitted
Note regarding the This serves as a reminder for when your PR is modifying a ref *.cs file and adding/modifying public APIs, to please make sure the API implementation in the src *.cs file is documented with triple slash comments, so the PR reviewers can sign off that change. |
Tagging subscribers to this area: @dotnet/area-system-memory Issue DetailsGoals regarding the changes to decoding:
|
This test: runtime/src/libraries/System.Security.Cryptography/tests/Base64TransformsTests.cs Line 304 in 0c4ee9e
is no longer throwing an exception: https://helixre107v0xdeko0k025g8.blob.core.windows.net/dotnet-runtime-refs-pull-79334-merge-b56bfafc0ea743d08d/System.Security.Cryptography.Tests/1/console.b696fe8a.log?helixlogtype=result This seems reasonable. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
src/libraries/System.Private.CoreLib/src/System/Buffers/Text/Base64Validator.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/Buffers/Text/Base64Validator.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/Buffers/Text/Base64Validator.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/Buffers/Text/Base64Validator.cs
Outdated
Show resolved
Hide resolved
Thank you for the feedback @MihaZupan I believe I have addressed those comments in my latest commit. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Had a quick look over the code. I'd like to have one point addressed, then will do a more thorough review.
When vectorized code detects an invalid char (could be whitespace), then it falls back to pure scalar processing, thus leaving perf on the table.
Often whitespace is inserted after 76 chars, so we should try to optimize that case.
Take that example:
Span<byte> bytes = new byte[1000];
Random.Shared.NextBytes(bytes);
string base64Text = Convert.ToBase64String(bytes, Base64FormattingOptions.InsertLineBreaks);
ReadOnlySpan<byte> base64 = Encoding.ASCII.GetBytes(base64Text);
OperationStatus status = Base64.DecodeFromUtf8(base64, bytes, out int consumed, out int written);
Here line-breaks are inserted every 76-chars. So after decoding the first "set", all the remainder is done scalar, whilst it could be done vectorized too. This is what I meant in #78951 (comment).
Do to this you can use something like:
static OperationStatus DecodeFromUtf8(ReadOnlySpan<byte> utf8, Span<byte> bytes, out int bytesConsumed, out int bytesWritten, bool isFinalBlock = true)
{
OperationStatus status;
int consumed = 0;
int written = 0;
int totalConsumed = 0;
int totalWritten = 0;
while (true)
{
// DecodeFromUtf8Core is the current implementation from main, just renamed and made private
status = DecodeFromUtf8Core(utf8, bytes, out consumed, out written, isFinalBlock);
totalConsumed += consumed;
totalWritten += written;
if (status != OperationStatus.InvalidData)
{
break;
}
// Found invalid data, check if it's whitespace and can be skipped
utf8 = utf8.Slice(consumed);
bytes = bytes.Slice(written);
if (!TrySkipWhitespace(utf8, out consumed))
{
break;
}
utf8 = utf8.Slice(consumed);
}
bytesConsumed = totalConsumed;
bytesWritten = totalWritten;
return status;
}
// This scans potentially to the end. There could be a more robust approach.
// Maybe try IndexOfAnyValues here?
static bool TrySkipWhitespace(ReadOnlySpan<byte> utf8, out int consumed)
{
for (int i = 0; i < utf8.Length; ++i)
{
if (!IsByteToBeIgnored(utf8[i]))
{
consumed = i;
return true;
}
}
consumed = 0;
return false;
}
I hope you get the idea.
Note: the shown code doesn't handle edge-cases, etc. so that should be considered, also if true invalid data is present that it's not stuck in an endless loop.
|
||
while (src + 4 <= srcEnd) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
while (src + 4 <= srcEnd) | |
while (src <= srcEnd - 4) |
src
will be incremented, so src + 4
needs to be evaluated in each iteration. With srcEnd - 4
it's the same condition, but that value can be kept w/o re-evaluation.
@@ -46,7 +46,7 @@ public static unsafe OperationStatus DecodeFromUtf8(ReadOnlySpan<byte> utf8, Spa | |||
fixed (byte* srcBytes = &MemoryMarshal.GetReference(utf8)) | |||
fixed (byte* destBytes = &MemoryMarshal.GetReference(bytes)) | |||
{ | |||
int srcLength = utf8.Length & ~0x3; // only decode input up to the closest multiple of 4. | |||
int srcLength = utf8.Length; // only decode input up to the closest multiple of 4. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't the & ~0x3
be kept?
It's needed for invalid input, i.e. if it's not a multiple of 4. Or is this handled elsewhere?
goto InvalidDataExit; | ||
{ | ||
int firstInvalidIndex = GetIndexOfFirstByteToBeIgnored(src); | ||
if (firstInvalidIndex != -1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (firstInvalidIndex != -1) | |
if (firstInvalidIndex >= 0) |
A comparison wtih 0
is a tiny little bit faster than other comparisons.
|
||
for (int currentBlockIndex = firstInvalidIndex; currentBlockIndex < 4; currentBlockIndex++) | ||
{ | ||
while (src + validBytesSearchIndex < srcEnd |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
while (src + validBytesSearchIndex < srcEnd | |
while (src < srcEnd - validBytesSearchIndex |
totalBytesIgnored++; | ||
} | ||
|
||
if (src + validBytesSearchIndex >= srcEnd) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (src + validBytesSearchIndex >= srcEnd) | |
if (src >= srcEnd - validBytesSearchIndex) |
} | ||
|
||
// Remove padding to get exact length | ||
decodedLength = (length / 4 * 3) - paddingCount; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
decodedLength = (length / 4 * 3) - paddingCount; | |
decodedLength = (int)((uint)length / 4 * 3) - paddingCount; |
Ugly, but better codegen.
} | ||
|
||
[MethodImpl(MethodImplOptions.AggressiveInlining)] | ||
private static bool IsCharToBeIgnored(char aChar) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See above.
Or will the compilers be able to optimize this (in the near future)?
} | ||
|
||
[MethodImpl(MethodImplOptions.AggressiveInlining)] | ||
private static bool IsCharToBeIgnored(char aChar) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this method takes an int
as argument, then it could be used here and in the decoder, so moved to a shared place, thus avoiding the duplication.
} | ||
} | ||
|
||
if (length % 4 != 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TIL: here is no need for the uint-cast, the JIT will emit good code 👍🏻.
} | ||
|
||
// Remove padding to get exact length | ||
decodedLength = (length / 4 * 3) - paddingCount; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
decodedLength = (length / 4 * 3) - paddingCount; | |
decodedLength = (int)((uint)length / 4 * 3) - paddingCount; |
@gfoidl Thank you very much for the example. I had this idea with the However, with the decode method, one of the goals is "The Base64 decoding methods should handle whitespace chars as the So your suggestion will not work:
As this would fail for example: " Y Q = = ". Yet Convert would be able to parse it. I have implemented or addressed your other comments also. I hope my approach is not too 'bloaty'. "The Base64 decoding methods should handle whitespace chars as the |
Well it works...at least under the presupposition that this is an suggestion to get you on the right road 😉
Just to specify: only whitespace is skipped, not any invalid bytes.
Yep, but that's an artificial example which (I assume) won't be quite often in real-world base64 encoded bytes.
So I'd be happy when we can optimize for these cases. All other cases, like the one given in the quote, it should work, but perf-wise that can be penalized (as being very uncommon). In my suggestion
That's not done in the suggestion (for simplicity, and just to outline how it could work). Thus -- at least I hope so -- the actual code change is quite minimal. Have the workhorse-method as is and just renamed to Core-suffix. The call that workhorse method from the "driver"-method, that handles the checks as given in the two bullets above. So for the case "no whitespace at all" there's only one method call more than the actual code has. So perf should be on-par. The example " Y Q = = " should work too, but you're rigth that this is a case I didn't consider in my suggestion so far. So lets extend that suggestion (point 4, last ->)
The "slow method" creates a buffer of block-size (e.g. This also means that once such a sequence -- invalid data, where the first byte is actual a valid base64 -- is detected it will be pure scalar processing, w/o going back to vectorized again. I'd take this, as
I don't follow. Let's bring your / this PR to a successful end. |
Thank you for the details, I'm 100% onboard with those 3 scenarios. I will update the PR accordingly. |
Side note: please don't do the full-quotes -- that content is in the comment history anyway. |
@gfoidl Regarding:
and
Suggested code for reference: ...
if (!TrySkipWhitespace(utf8, out consumed))
{
break;
}
utf8 = utf8.Slice(consumed);
...
...
static bool TrySkipWhitespace(ReadOnlySpan<byte> utf8, out int consumed)
{
for (int i = 0; i < utf8.Length; ++i)
{
if (!IsByteToBeIgnored(utf8[i]))
{
consumed = i;
return true;
}
}
consumed = 0;
return false;
}
...
We can avoid scanning, since we should be able to check for the 2 whitespace chars (\r\n) after every 76 bytes. I currently do this in: runtime/src/libraries/System.Private.CoreLib/src/System/Buffers/Text/Base64Decoder.cs Line 559 in f389289
I'd also like to annotate the vectorized code path: runtime/src/libraries/System.Private.CoreLib/src/System/Buffers/Text/Base64Decoder.cs Line 72 in f389289
if (Avx2.IsSupported)
{
while (end >= src)
{
// ---------- Scenario 1: no whitespace, will complete and exit
bool isComplete = Avx2Decode(ref src, ref dest, end, maxSrcLength, destLength, srcBytes, destBytes);
if (isComplete)
{
break;
}
// ---------- Scenario 2: whitespace after every 76 chars, method above will eventually fail, here whitespace is omitted in the next vectorized decode, then the path jumps to the start of the loop to repeat the process.
isComplete = TryDecodeCurrentGroupIfWhitespaceIsSeparatingValidBytesInCommonLocation(utf8, ref src, ref dest, ref end, destLength, srcBytes, destBytes, 32);
if (isComplete)
{
continue;
}
// ---------- Scenario 3: whitespace is somewhere else or invalid bytes are present, as soon as whitespace has been skipped, this path jumps to the start of the loop to reattempt vectorized decoding.
// Process 4 bytes, until first set of invalid bytes are skipped.
lastBlockStatus = IgnoreWhitespaceAndConsumeValidBytesUntilInvalidBlock(ref src, srcEnd, ref dest, destEnd, ref decodingMap, isFinalBlock, lastBlockStatus, true, out pendingSrcIncrement);
if (lastBlockStatus != OperationStatus.Done)
{
break;
}
}
} |
It's not and was not meant as "suggested code" which can be used 1:1. It's a piece of code to lay you the road on how it could be done. Thus I don't understand
as we had this already before. In the code fragment that I've shown above some parts were missing, which should be worked out. For me your code with In my Base64-project I did a quick stab based on the idea outlined above. The code for the actual decoding isn't touched -- similar to the outline (not suggestion) in #79334 (review). All work is done on top of the core decoding method. If that doesn't work it's fallen back to a slow method, the decodes by blocks of four. Perf isn't great here, but when we land here then the input is degenerated (in my opinion), so it's not worth to have super fast perf as long as correct decoding can be done. Note: at the moment there are some basic tests and a fuzz-run. I don't know if my PR for the change in that repo is good enough to merge or not -- anyway that's independent from this PR here. |
I just checked out the code in your project. I stepped through the code and it appears that my previous commit essentially does the same as yours:
if (Avx2.IsSupported)
{
while (end >= src)
{
// ---------- Step 1: Decode fast, until whitespace is hit
bool isComplete = Avx2Decode(ref src, ref dest, end, maxSrcLength, destLength, srcBytes, destBytes);
if (isComplete)
{
break;
}
// ---------- Step 2: Process 4 bytes, until first set of invalid bytes are skipped. Then go back to fast decoding.
lastBlockStatus = IgnoreWhitespaceAndConsumeValidBytesUntilInvalidBlock(ref src, srcEnd, ref dest, destEnd, ref decodingMap, isFinalBlock, lastBlockStatus, true, out pendingSrcIncrement);
if (lastBlockStatus != OperationStatus.Done)
{
break;
}
}
} You mentioned this needed to be optimized which is why I added Step 2 bellow: if (Avx2.IsSupported)
{
while (end >= src)
{
// ---------- Step 1: no whitespace, will complete and exit
bool isComplete = Avx2Decode(ref src, ref dest, end, maxSrcLength, destLength, srcBytes, destBytes);
if (isComplete)
{
break;
}
// ---------- Step 2: whitespace after every 76 chars, method above will eventually fail, here whitespace is omitted in the next vectorized decode, then the path jumps to the start of the loop to repeat the process.
isComplete = TryDecodeCurrentGroupIfWhitespaceIsSeparatingValidBytesInCommonLocation(utf8, ref src, ref dest, ref end, destLength, srcBytes, destBytes, 32);
if (isComplete)
{
continue;
}
// ---------- Step 3: whitespace is somewhere else or invalid bytes are present, as soon as whitespace has been skipped, this path jumps to the start of the loop to reattempt vectorized decoding.
// Process 4 bytes, until first set of invalid bytes are skipped.
lastBlockStatus = IgnoreWhitespaceAndConsumeValidBytesUntilInvalidBlock(ref src, srcEnd, ref dest, destEnd, ref decodingMap, isFinalBlock, lastBlockStatus, true, out pendingSrcIncrement);
if (lastBlockStatus != OperationStatus.Done)
{
break;
}
}
} Step 2 here attempts to get rid of the whitespace (without scanning as we should be able to quickly detect if this is in fact the whitespace every 76 chars scenario) and go straight back to vector decoding instead of 4 byte blocks. Is this optimization not necessary? |
To add a little more context as to why my PR appears quite large, a lot of my approach was also influenced by the requirement: "The Base64 decoding methods should handle whitespace chars as the Convert.ToBase64 decoding methods". e.g. If Base64.IsValid == true, then Base64.Decode should return OperationStatus.Done. |
This fixes a failing test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some nits about naming.
My only open question is about #79334 (comment)
A test with AQ==
should have consumed = 5
. If this test (and all others) pass, then I'm happy with this PR.
When existing tests with invalid input start to fail, then these tests should be validated for the correct behavior.
} | ||
|
||
[MethodImpl(MethodImplOptions.AggressiveInlining)] | ||
private static bool IsValid<T, T2>(ReadOnlySpan<T> base64Text, out int decodedLength) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
private static bool IsValid<T, T2>(ReadOnlySpan<T> base64Text, out int decodedLength) | |
private static bool IsValid<T, TBase64Validatable>(ReadOnlySpan<T> base64Text, out int decodedLength) |
to give it a more handy name?
static abstract bool IsEncodingPad(T value); | ||
} | ||
|
||
internal readonly struct Base64CharValidationHandler : IBase64Validatable<char> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
internal readonly struct Base64CharValidationHandler : IBase64Validatable<char> | |
internal readonly struct Base64CharValidatable : IBase64Validatable<char> |
Same for the Byte-type.
static abstract int IndexOfAnyExcept(ReadOnlySpan<T> span); | ||
static abstract bool IsWhitespace(T value); | ||
static abstract bool IsEncodingPad(T value); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really ❤️ this possibility in the language.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only one question regarding tests left + 2 nits, otherwise LGTM.
if (length < 0) | ||
ThrowHelper.ThrowArgumentOutOfRangeException(ExceptionArgument.length); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (length < 0) | |
ThrowHelper.ThrowArgumentOutOfRangeException(ExceptionArgument.length); | |
ArgumentOutOfRangeException.ThrowIfNegative(length); |
I'm not sure if the throw-helper should be used here or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method just got moved around, it is untouched from the original: https://github.com/dotnet/runtime/blob/f52e277f54a9413d2f0bd42b8b957c8e4fd263ad/src/libraries/System.Private.CoreLib/src/System/Buffers/Text/Base64Decoder.cs#LL238C24-L238C24
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, I know. It's just as this PR touches that code already that piece can be re-visited to take the new throw-helpers.
I'm not sure what's the general guidance here in runtime -- so whether keep the current implementation or the new possibilities. (for the latter it should be validated if it this method still inlines)
I'll leave that to the maintainers though.
{ | ||
int bufferIdx = 0; | ||
|
||
while (sourceIndex < length |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
while (sourceIndex < length | |
while ((uint)sourceIndex < (uint)length |
To elide the bound-check in L544?
[InlineData("AQ==", 4, 1)] | ||
[InlineData("AQ== ", 5, 1)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you mix in some variants like
[InlineData("AQ==", 4, 1)] | |
[InlineData("AQ== ", 5, 1)] | |
[InlineData("AQ==", 4, 1)] | |
[InlineData("AQ== ", 5, 1)] | |
[InlineData("AQ ==", 5, 1)] | |
[InlineData("AQ= =", 5, 1)] |
or something like that covered by the test ValidBase64Strings_WithCharsThatMustBeIgnored
below?
(Sorry, that I don't remember the generated testcases, it's too long ago 😉)
Also in https://gist.github.com/heathbm/f59662bd2334761d28288755a34e29ec you had some cool tests that generated inputs with whitespace at various places. In my lib that gist got into unit tests. Is everything covered here or should these tests be added too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I covered some of these cases here: https://github.com/dotnet/runtime/pull/79334/files#diff-6b4fec85572fdbcc9b03c81fe66cf8bf31cbb5620235abce7dab3c6dc034d862R11
It's a lighter version of the gist, as I was not sure if a test with that many loops/asserts would be acceptable in the pipeline.
src/libraries/System.Private.CoreLib/src/System/Buffers/Text/Base64Validator.cs
Outdated
Show resolved
Hide resolved
@heathbm could you please resolve the merge conflict? |
1 similar comment
@heathbm could you please resolve the merge conflict? |
@dotnet/area-system-memory can one of you review this change to unblock the PR and merge it? |
@heathbm could you please resolve the merge conflict? |
@heathbm any news? |
@tarekgh I would love to work on this, however I have no free cycles. I did check, and it looks like Base64Transforms.cs was refactored. The methods I modified no longer exist. This would require a different approach from what I initially proposed. |
Thanks @heathbm for your feedback. I am closing this PR as it became stale and needs a fully different fix. |
@tarekgh Please be aware, only Base64Transforms.cs requires rework, this is very small in the context of this PR. All other work is not stale. |
Implementation of api-approved issue: #76020
Goals regarding the changes to decoding:
Chars now ignored by Decoding methods:
Performance:
The
Base64
decoding methods should handle whitespace chars as theConvert.ToBase64
decoding methods. A large number of tests have been carried out to ensure this is the case: https://gist.github.com/heathbm/f59662bd2334761d28288755a34e29ecExisting tests should not need to be altered. At the moment, a slight modification has been made to
Base64DecoderUnitTests.cs
due to the fact that invalid ranges cannot as easily be inferred with the length of the input span. If this is a deal-breaker, a change can be made, that would involve the decoding loop to look at the next block before writing the current one.