-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow Base64Decoder to ignore space chars, add IsValid methods and tests #78951
Conversation
Chars now ignored: 9: Line feed 10: Horizontal tab 13: Carriage return 32: Space -- Vertical tab omitted
Note regarding the This serves as a reminder for when your PR is modifying a ref *.cs file and adding/modifying public APIs, to please make sure the API implementation in the src *.cs file is documented with triple slash comments, so the PR reviewers can sign off that change. |
Tagging subscribers to this area: @dotnet/area-system-memory Issue DetailsImplementation of api-approved issue: #76020
|
@@ -35,6 +35,215 @@ public static partial class Base64 | |||
/// or if the input is incomplete (i.e. not a multiple of 4) and <paramref name="isFinalBlock"/> is <see langword="true"/>. | |||
/// </returns> | |||
public static unsafe OperationStatus DecodeFromUtf8(ReadOnlySpan<byte> utf8, Span<byte> bytes, out int bytesConsumed, out int bytesWritten, bool isFinalBlock = true) | |||
{ | |||
// Validation must occur prior to decoding as the actual length will impact future calculations | |||
bool containsIgnoredBytes = utf8.IndexOfAny(BytesToIgnore) != -1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I expect this is going to regress performance, which we want to avoid. It should be possible to structure this such that we only pay additional cost in the existing implementation when it encounters an invalid character according to its current definition, which includes whitespace, at which point it can fall back to an implementation that allows for whitespace but is slower.
if (containsIgnoredBytes) | ||
{ | ||
// Create a new span without bytes to be ignored | ||
Span<byte> utf8WithIgnoredBytesRemoved = stackalloc byte[utf8.Length]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not safe and could easily stack overflow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your interest in working on this. I only took a cursory look, but I think the whole approach taken here needs to be revisited.
if (containsIgnoredBytes) | ||
{ | ||
// Create a new span without bytes to be ignored | ||
Span<byte> utf8WithIgnoredBytesRemoved = stackalloc byte[utf8.Length]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What will happen if utf8.Length
is huge? Stackoverflow. Stack allocation must be guarded by some reasonable threshold (e.g. 256 byte), otherwise fall back to renting an array from the ArrayPool or if considered as rare-path just allocate (ToArray).
@@ -35,6 +35,215 @@ public static partial class Base64 | |||
/// or if the input is incomplete (i.e. not a multiple of 4) and <paramref name="isFinalBlock"/> is <see langword="true"/>. | |||
/// </returns> | |||
public static unsafe OperationStatus DecodeFromUtf8(ReadOnlySpan<byte> utf8, Span<byte> bytes, out int bytesConsumed, out int bytesWritten, bool isFinalBlock = true) | |||
{ | |||
// Validation must occur prior to decoding as the actual length will impact future calculations |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will most likely cause a perf-regression, as decoding changed from
Maybe it's better to just start decoding w/o any upfront check or copy, and if an invalid input is encountered fall back and decide what to do. If it's truly invalid then exit. If it's space, then ignore that position and continue. Thus the
To implement this you can split the decoding into two methods. One "driver" method, and one "worker" method. This is only relevant for the invalid-case, as then the driver can evaluate the state and call the worker again. Look at the buffer-chain examples / tests to see how someting like this can be done.
} | ||
|
||
// Check for invalid chars | ||
int indexOfFirstNonBase64 = base64Text.IndexOfAnyExcept(ValidBase64CharsSortedAsc); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should use #68328 (comment)
(see linked PRs for examples).
|
||
// Check for invalid chars | ||
int indexOfFirstNonBase64 = base64Text.IndexOfAnyExcept(ValidBase64CharsSortedAsc); | ||
if (indexOfFirstNonBase64 > -1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (indexOfFirstNonBase64 > -1) | |
if (indexOfFirstNonBase64 >= 0) |
Is a tiny little bit better at CPU-level as comparisons with 0 followed by a jump (branch) can be optimized by CPUs.
int paddingCount = 0; | ||
|
||
// Check if there are chars that need to be ignored while determining the length | ||
if (base64Text.IndexOfAny(CharsToIgnore) > -1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (base64Text.IndexOfAny(CharsToIgnore) > -1) | |
if (base64Text.IndexOfAny(CharsToIgnore) >= 0) |
And use IndexOfAnyValues
(see above).
/// <param name="base64TextUtf8">The input span which contains UTF-8 encoded text in base64 that needs to be validated.</param> | ||
/// <param name="decodedLength">The maximum length (in bytes) if you were to decode the base 64 encoded text <paramref name="base64TextUtf8"/> within a byte span.</param> | ||
/// <returns>true if <paramref name="base64TextUtf8"/> is decodable; otherwise, false.</returns> | ||
public static unsafe bool IsValid(ReadOnlySpan<byte> base64TextUtf8, out int decodedLength) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comments as above apply.
@stephentoub race condition -> you won 😄 |
Thanks for the feedback @stephentoub and @gfoidl , I will use it to improve to make better contributions in the future. |
Thanks, @heathbm. I do appreciate your efforts here. Are you planning to take another look at this one, or you're going to leave it for someone else and look at other things? |
@stephentoub I'm currently working on a new PR that does retain the O(1n) performance characteristic using the existing implementation as outlined in the feedback. Since I am not assigned to the issue, if anyone submits a complete PR before I do, I will gladly step aside however. |
Implementation of api-approved issue: #76020