Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow Base64Decoder to ignore space chars, add IsValid methods and tests #79334

Closed
wants to merge 19 commits into from

Conversation

heathbm
Copy link
Contributor

@heathbm heathbm commented Dec 7, 2022

Implementation of api-approved issue: #76020

Goals regarding the changes to decoding:

  1. Chars now ignored by Decoding methods:

    • 9: Line feed
    • 10: Horizontal tab
    • 13: Carriage return
    • 32: Space
    • Vertical tab omitted
  2. Performance:

    • The best-case path (when no chars to be ignored are encountered) should remain as untouched as possible.
    • The worst-case path: should still perform at O(1n)
  3. The Base64 decoding methods should handle whitespace chars as the Convert.ToBase64 decoding methods. A large number of tests have been carried out to ensure this is the case: https://gist.github.com/heathbm/f59662bd2334761d28288755a34e29ec

  4. Existing tests should not need to be altered. At the moment, a slight modification has been made to Base64DecoderUnitTests.cs due to the fact that invalid ranges cannot as easily be inferred with the length of the input span. If this is a deal-breaker, a change can be made, that would involve the decoding loop to look at the next block before writing the current one.

Chars now ignored:
9: Line feed
10: Horizontal tab
13: Carriage return
32: Space
-- Vertical tab omitted
@ghost ghost added the community-contribution Indicates that the PR has been added by a community member label Dec 7, 2022
@dotnet-issue-labeler
Copy link

Note regarding the new-api-needs-documentation label:

This serves as a reminder for when your PR is modifying a ref *.cs file and adding/modifying public APIs, to please make sure the API implementation in the src *.cs file is documented with triple slash comments, so the PR reviewers can sign off that change.

@ghost
Copy link

ghost commented Dec 7, 2022

Tagging subscribers to this area: @dotnet/area-system-memory
See info in area-owners.md if you want to be subscribed.

Issue Details

Goals regarding the changes to decoding:

  1. Chars now ignored by Decoding methods:

    • 9: Line feed
    • 10: Horizontal tab
    • 13: Carriage return
    • 32: Space
    • Vertical tab omitted
  2. Performance:

    • The best-case path (when no chars to be ignored are encountered) should remain as untouched as possible.
    • The worst-case path: should still perform at O(1n)
  3. The Base64 decoding methods should handle whitespace chars as the Convert.ToBase64 decoding methods. A large number of tests have been carried out to ensure this is the case: https://gist.github.com/heathbm/f59662bd2334761d28288755a34e29ec

  4. Existing tests should not need to be altered. At the moment, a slight modification has been made to Base64DecoderUnitTests.cs due to the fact that invalid ranges cannot as easily be inferred with the length of the input span. If this is a deal-breaker, a change can be made, that would involve the decoding loop to look at the next block before writing the current one.

Author: heathbm
Assignees: -
Labels:

area-System.Memory, new-api-needs-documentation, community-contribution

Milestone: -

@heathbm
Copy link
Contributor Author

heathbm commented Dec 7, 2022

Copy link
Member

@MihaZupan MihaZupan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for looking into this again!

I only looked through the IsValid part.

cc: @gfoidl

@heathbm
Copy link
Contributor Author

heathbm commented Dec 8, 2022

Thank you for the feedback @MihaZupan I believe I have addressed those comments in my latest commit.

Copy link
Member

@gfoidl gfoidl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had a quick look over the code. I'd like to have one point addressed, then will do a more thorough review.


When vectorized code detects an invalid char (could be whitespace), then it falls back to pure scalar processing, thus leaving perf on the table.
Often whitespace is inserted after 76 chars, so we should try to optimize that case.

Take that example:

Span<byte> bytes = new byte[1000];
Random.Shared.NextBytes(bytes);

string base64Text = Convert.ToBase64String(bytes, Base64FormattingOptions.InsertLineBreaks);
ReadOnlySpan<byte> base64 = Encoding.ASCII.GetBytes(base64Text);

OperationStatus status = Base64.DecodeFromUtf8(base64, bytes, out int consumed, out int written);

Here line-breaks are inserted every 76-chars. So after decoding the first "set", all the remainder is done scalar, whilst it could be done vectorized too. This is what I meant in #78951 (comment).

Do to this you can use something like:

static OperationStatus DecodeFromUtf8(ReadOnlySpan<byte> utf8, Span<byte> bytes, out int bytesConsumed, out int bytesWritten, bool isFinalBlock = true)
{
    OperationStatus status;
    int consumed = 0;
    int written = 0;
    int totalConsumed = 0;
    int totalWritten = 0;

    while (true)
    {
        // DecodeFromUtf8Core is the current implementation from main, just renamed and made private
        status = DecodeFromUtf8Core(utf8, bytes, out consumed, out written, isFinalBlock);
        totalConsumed += consumed;
        totalWritten += written;

        if (status != OperationStatus.InvalidData)
        {
            break;
        }

        // Found invalid data, check if it's whitespace and can be skipped
        utf8 = utf8.Slice(consumed);
        bytes = bytes.Slice(written);

        if (!TrySkipWhitespace(utf8, out consumed))
        {
            break;
        }

        utf8 = utf8.Slice(consumed);
    }

    bytesConsumed = totalConsumed;
    bytesWritten = totalWritten;
    return status;
}

// This scans potentially to the end. There could be a more robust approach.
// Maybe try IndexOfAnyValues here?
static bool TrySkipWhitespace(ReadOnlySpan<byte> utf8, out int consumed)
{
    for (int i = 0; i < utf8.Length; ++i)
    {
        if (!IsByteToBeIgnored(utf8[i]))
        {
            consumed = i;
            return true;
        }
    }

    consumed = 0;
    return false;
}

I hope you get the idea.
Note: the shown code doesn't handle edge-cases, etc. so that should be considered, also if true invalid data is present that it's not stuck in an endless loop.


while (src + 4 <= srcEnd)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
while (src + 4 <= srcEnd)
while (src <= srcEnd - 4)

src will be incremented, so src + 4 needs to be evaluated in each iteration. With srcEnd - 4 it's the same condition, but that value can be kept w/o re-evaluation.

@@ -46,7 +46,7 @@ public static unsafe OperationStatus DecodeFromUtf8(ReadOnlySpan<byte> utf8, Spa
fixed (byte* srcBytes = &MemoryMarshal.GetReference(utf8))
fixed (byte* destBytes = &MemoryMarshal.GetReference(bytes))
{
int srcLength = utf8.Length & ~0x3; // only decode input up to the closest multiple of 4.
int srcLength = utf8.Length; // only decode input up to the closest multiple of 4.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't the & ~0x3 be kept?
It's needed for invalid input, i.e. if it's not a multiple of 4. Or is this handled elsewhere?

goto InvalidDataExit;
{
int firstInvalidIndex = GetIndexOfFirstByteToBeIgnored(src);
if (firstInvalidIndex != -1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (firstInvalidIndex != -1)
if (firstInvalidIndex >= 0)

A comparison wtih 0 is a tiny little bit faster than other comparisons.


for (int currentBlockIndex = firstInvalidIndex; currentBlockIndex < 4; currentBlockIndex++)
{
while (src + validBytesSearchIndex < srcEnd
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
while (src + validBytesSearchIndex < srcEnd
while (src < srcEnd - validBytesSearchIndex

totalBytesIgnored++;
}

if (src + validBytesSearchIndex >= srcEnd)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (src + validBytesSearchIndex >= srcEnd)
if (src >= srcEnd - validBytesSearchIndex)

}

// Remove padding to get exact length
decodedLength = (length / 4 * 3) - paddingCount;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
decodedLength = (length / 4 * 3) - paddingCount;
decodedLength = (int)((uint)length / 4 * 3) - paddingCount;

Ugly, but better codegen.

}

[MethodImpl(MethodImplOptions.AggressiveInlining)]
private static bool IsCharToBeIgnored(char aChar)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above.
Or will the compilers be able to optimize this (in the near future)?

}

[MethodImpl(MethodImplOptions.AggressiveInlining)]
private static bool IsCharToBeIgnored(char aChar)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this method takes an int as argument, then it could be used here and in the decoder, so moved to a shared place, thus avoiding the duplication.

}
}

if (length % 4 != 0)
Copy link
Member

@gfoidl gfoidl Dec 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL: here is no need for the uint-cast, the JIT will emit good code 👍🏻.

}

// Remove padding to get exact length
decodedLength = (length / 4 * 3) - paddingCount;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
decodedLength = (length / 4 * 3) - paddingCount;
decodedLength = (int)((uint)length / 4 * 3) - paddingCount;

@heathbm
Copy link
Contributor Author

heathbm commented Dec 14, 2022

Had a quick look over the code. I'd like to have one point addressed, then will do a more thorough review.

When vectorized code detects an invalid char (could be whitespace), then it falls back to pure scalar processing, thus leaving perf on the table. Often whitespace is inserted after 76 chars, so we should try to optimize that case.

Take that example:

Span<byte> bytes = new byte[1000];
Random.Shared.NextBytes(bytes);

string base64Text = Convert.ToBase64String(bytes, Base64FormattingOptions.InsertLineBreaks);
ReadOnlySpan<byte> base64 = Encoding.ASCII.GetBytes(base64Text);

OperationStatus status = Base64.DecodeFromUtf8(base64, bytes, out int consumed, out int written);

Here line-breaks are inserted every 76-chars. So after decoding the first "set", all the remainder is done scalar, whilst it could be done vectorized too. This is what I meant in #78951 (comment).

Do to this you can use something like:

static OperationStatus DecodeFromUtf8(ReadOnlySpan<byte> utf8, Span<byte> bytes, out int bytesConsumed, out int bytesWritten, bool isFinalBlock = true)
{
    OperationStatus status;
    int consumed = 0;
    int written = 0;
    int totalConsumed = 0;
    int totalWritten = 0;

    while (true)
    {
        // DecodeFromUtf8Core is the current implementation from main, just renamed and made private
        status = DecodeFromUtf8Core(utf8, bytes, out consumed, out written, isFinalBlock);
        totalConsumed += consumed;
        totalWritten += written;

        if (status != OperationStatus.InvalidData)
        {
            break;
        }

        // Found invalid data, check if it's whitespace and can be skipped
        utf8 = utf8.Slice(consumed);
        bytes = bytes.Slice(written);

        if (!TrySkipWhitespace(utf8, out consumed))
        {
            break;
        }

        utf8 = utf8.Slice(consumed);
    }

    bytesConsumed = totalConsumed;
    bytesWritten = totalWritten;
    return status;
}

// This scans potentially to the end. There could be a more robust approach.
// Maybe try IndexOfAnyValues here?
static bool TrySkipWhitespace(ReadOnlySpan<byte> utf8, out int consumed)
{
    for (int i = 0; i < utf8.Length; ++i)
    {
        if (!IsByteToBeIgnored(utf8[i]))
        {
            consumed = i;
            return true;
        }
    }

    consumed = 0;
    return false;
}

I hope you get the idea. Note: the shown code doesn't handle edge-cases, etc. so that should be considered, also if true invalid data is present that it's not stuck in an endless loop.

@gfoidl Thank you very much for the example. I had this idea with the IsValid method too, where I sliced until I hit padding, then proceeded to loop over the remaining bytes.

However, with the decode method, one of the goals is "The Base64 decoding methods should handle whitespace chars as the Convert.ToBase64". To elaborate, Convert can handle any number of whitespace in any position.

So your suggestion will not work:

...
if (!TrySkipWhitespace(utf8, out consumed))
{
    break;
}

utf8 = utf8.Slice(consumed);
...

As this would fail for example: " Y Q = = ". Yet Convert would be able to parse it.
As a compromise, if invalid bytes are hit while decoding with the vectorized paths, block by block decoding will take over, until invalid bytes have been skipped. Then, vectorized decoding will try to resume if possible.

I have implemented or addressed your other comments also.

I hope my approach is not too 'bloaty'. "The Base64 decoding methods should handle whitespace chars as the Convert.ToBase64" + "Existing tests should not need to be altered" resulted in more code than I initially expected.
I would also like to reiterate, if a complete PR is submitted while this one is still open, I would be happy to close this one.

@gfoidl
Copy link
Member

gfoidl commented Dec 14, 2022

So your suggestion will not work:

Well it works...at least under the presupposition that this is an suggestion to get you on the right road 😉

invalid bytes are hit while decoding with the vectorized paths, block by block decoding will take over, until invalid bytes have been skipped. Then, vectorized decoding will try to resume if possible.

Just to specify: only whitespace is skipped, not any invalid bytes.
(I know you know this, but as written it can be mis-read)

Convert can handle any number of whitespace in any position.
...
" Y Q = = ". Yet Convert would be able to parse it.

Yep, but that's an artificial example which (I assume) won't be quite often in real-world base64 encoded bytes.
Basically there will be two major sets of base64 encoded bytes:

  • no whitespace at all
  • line-breaks after 76 chars

So I'd be happy when we can optimize for these cases. All other cases, like the one given in the quote, it should work, but perf-wise that can be penalized (as being very uncommon).

In my suggestion TrySkipWhitespace needs to be replaced with a correct / better implementation that

  • skips whitespace and moves the pointer forward in the input-data
  • signals "end" when truly invalid data is detected

That's not done in the suggestion (for simplicity, and just to outline how it could work).

Thus -- at least I hope so -- the actual code change is quite minimal. Have the workhorse-method as is and just renamed to Core-suffix. The call that workhorse method from the "driver"-method, that handles the checks as given in the two bullets above.

So for the case "no whitespace at all" there's only one method call more than the actual code has. So perf should be on-par.
For the case "line-breaks after 76 chars" 76 chars can be processed fast, then the line-break is skipped, another 76 chars are processed, and so on.

The example " Y Q = = " should work too, but you're rigth that this is a case I didn't consider in my suggestion so far. So lets extend that suggestion (point 4, last ->)

  1. Core called with " Y Q = = " -> returns OperationStatus.InvalidData
  2. check if truly invalid or whitespace -> whitespace, so skip it -> it remains "Y Q = = "
  3. Core called with "Y Q = = " -> returns OperationStatus.InvalidData
  4. check if truly invalid or whitspace -> first char "Y" is valid -> fallback to slow method

The "slow method" creates a buffer of block-size (e.g. stackalloc byte[4]) and fills that with the valid non-whitespace bytes. Except when invalid byte found, then stop. For the example that buffer would be filled with YQ==.
After filling that block-buffer Core is called to do actual decoding. The buffer is valid, has padding so it should be the final "block" to process. After that only only byte from the input remains, so check if that's whitespace (-> OK) or invalid data, where in this case invalid data is anything else than whitspace (-> KO).

This also means that once such a sequence -- invalid data, where the first byte is actual a valid base64 -- is detected it will be pure scalar processing, w/o going back to vectorized again. I'd take this, as

  • it's an unlikey input -- constructed artificially for tests
  • it keeps the code stream-lined, so the common inputs aren't penalized
  • the code-change is also streamlined
  • as whitespace wasn't allowed before it's not a perf-regression and that like, simply as that nothing to compare with
  • bring a PR in that enabled whitespace, optimize it later with another PR if needed

I would also like to reiterate, if a complete PR is submitted while this one is still open, I would be happy to close this one.

I don't follow. Let's bring your / this PR to a successful end.

@heathbm
Copy link
Contributor Author

heathbm commented Dec 15, 2022

So your suggestion will not work:

Well it works...at least under the presupposition that this is an suggestion to get you on the right road 😉

invalid bytes are hit while decoding with the vectorized paths, block by block decoding will take over, until invalid bytes have been skipped. Then, vectorized decoding will try to resume if possible.

Just to specify: only whitespace is skipped, not any invalid bytes. (I know you know this, but as written it can be mis-read)

Convert can handle any number of whitespace in any position.
...
" Y Q = = ". Yet Convert would be able to parse it.

Yep, but that's an artificial example which (I assume) won't be quite often in real-world base64 encoded bytes. Basically there will be two major sets of base64 encoded bytes:

  • no whitespace at all
  • line-breaks after 76 chars

So I'd be happy when we can optimize for these cases. All other cases, like the one given in the quote, it should work, but perf-wise that can be penalized (as being very uncommon).

In my suggestion TrySkipWhitespace needs to be replaced with a correct / better implementation that

  • skips whitespace and moves the pointer forward in the input-data
  • signals "end" when truly invalid data is detected

That's not done in the suggestion (for simplicity, and just to outline how it could work).

Thus -- at least I hope so -- the actual code change is quite minimal. Have the workhorse-method as is and just renamed to Core-suffix. The call that workhorse method from the "driver"-method, that handles the checks as given in the two bullets above.

So for the case "no whitespace at all" there's only one method call more than the actual code has. So perf should be on-par. For the case "line-breaks after 76 chars" 76 chars can be processed fast, then the line-break is skipped, another 76 chars are processed, and so on.

The example " Y Q = = " should work too, but you're rigth that this is a case I didn't consider in my suggestion so far. So lets extend that suggestion (point 4, last ->)

  1. Core called with " Y Q = = " -> returns OperationStatus.InvalidData
  2. check if truly invalid or whitespace -> whitespace, so skip it -> it remains "Y Q = = "
  3. Core called with "Y Q = = " -> returns OperationStatus.InvalidData
  4. check if truly invalid or whitspace -> first char "Y" is valid -> fallback to slow method

The "slow method" creates a buffer of block-size (e.g. stackalloc byte[4]) and fills that with the valid non-whitespace bytes. Except when invalid byte found, then stop. For the example that buffer would be filled with YQ==. After filling that block-buffer Core is called to do actual decoding. The buffer is valid, has padding so it should be the final "block" to process. After that only only byte from the input remains, so check if that's whitespace (-> OK) or invalid data, where in this case invalid data is anything else than whitspace (-> KO).

This also means that once such a sequence -- invalid data, where the first byte is actual a valid base64 -- is detected it will be pure scalar processing, w/o going back to vectorized again. I'd take this, as

  • it's an unlikey input -- constructed artificially for tests
  • it keeps the code stream-lined, so the common inputs aren't penalized
  • the code-change is also streamlined
  • as whitespace wasn't allowed before it's not a perf-regression and that like, simply as that nothing to compare with
  • bring a PR in that enabled whitespace, optimize it later with another PR if needed

I would also like to reiterate, if a complete PR is submitted while this one is still open, I would be happy to close this one.

I don't follow. Let's bring your / this PR to a successful end.

Thank you for the details, I'm 100% onboard with those 3 scenarios. I will update the PR accordingly.

@gfoidl
Copy link
Member

gfoidl commented Dec 15, 2022

Side note: please don't do the full-quotes -- that content is in the comment history anyway.

@heathbm
Copy link
Contributor Author

heathbm commented Dec 20, 2022

@gfoidl Regarding:

For the case "line-breaks after 76 chars" 76 chars can be processed fast, then the line-break is skipped, another 76 chars are > processed, and so on.".

and

In my suggestion TrySkipWhitespace needs to be replaced with a correct / better implementation that

  • skips whitespace and moves the pointer forward in the input-data
  • signals "end" when truly invalid data is detected

Suggested code for reference:

...
if (!TrySkipWhitespace(utf8, out consumed))
{
    break;
}

utf8 = utf8.Slice(consumed);
...

...
static bool TrySkipWhitespace(ReadOnlySpan<byte> utf8, out int consumed)
{
    for (int i = 0; i < utf8.Length; ++i)
    {
        if (!IsByteToBeIgnored(utf8[i]))
        {
            consumed = i;
            return true;
        }
    }

    consumed = 0;
    return false;
}
...

TrySkipWhitespace skips valid bytes until whitespace is found. Those valid bytes need to be decoded also. I did not see a straightforward way to fill in the gaps with this suggestion, since those skipped bytes would be lost after the slice. However, I would like to raise another similar approach:

We can avoid scanning, since we should be able to check for the 2 whitespace chars (\r\n) after every 76 bytes. I currently do this in:

private static unsafe bool TryDecodeCurrentGroupIfWhitespaceIsSeparatingValidBytesInCommonLocation(ReadOnlySpan<byte> utf8, ref byte* src, ref byte* dest, ref byte* end, int destLength, byte* srcBytes, byte* destBytes, byte groupSize)

I'd also like to annotate the vectorized code path:

if (Avx2.IsSupported)
{
    while (end >= src)
    {
        // ---------- Scenario 1: no whitespace, will complete and exit
        bool isComplete = Avx2Decode(ref src, ref dest, end, maxSrcLength, destLength, srcBytes, destBytes);
        if (isComplete)
        {
            break;
        }

        // ---------- Scenario 2: whitespace after every 76 chars, method above will eventually fail, here whitespace is omitted in the next vectorized decode, then the path jumps to the start of the loop to repeat the process.
        isComplete = TryDecodeCurrentGroupIfWhitespaceIsSeparatingValidBytesInCommonLocation(utf8, ref src, ref dest, ref end, destLength, srcBytes, destBytes, 32);
        if (isComplete)
        {
            continue;
        }

        // ---------- Scenario 3: whitespace is somewhere else or invalid bytes are present, as soon as whitespace has been skipped, this path jumps to the start of the loop to reattempt vectorized decoding.
        // Process 4 bytes, until first set of invalid bytes are skipped.
        lastBlockStatus = IgnoreWhitespaceAndConsumeValidBytesUntilInvalidBlock(ref src, srcEnd, ref dest, destEnd, ref decodingMap, isFinalBlock, lastBlockStatus, true, out pendingSrcIncrement);
        if (lastBlockStatus != OperationStatus.Done)
        {
            break;
        }
    }
}

@gfoidl
Copy link
Member

gfoidl commented Dec 20, 2022

Suggested code for reference:

It's not and was not meant as "suggested code" which can be used 1:1. It's a piece of code to lay you the road on how it could be done. Thus I don't understand

TrySkipWhitespace skips valid bytes until whitespace is found. Those valid bytes need to be decoded also.

as we had this already before. In the code fragment that I've shown above some parts were missing, which should be worked out.

For me your code with TryDecodeCurrentGroupIfWhitespaceIsSeparatingValidBytesInCommonLocation and the % 76 looks too complicated.

In my Base64-project I did a quick stab based on the idea outlined above. The code for the actual decoding isn't touched -- similar to the outline (not suggestion) in #79334 (review). All work is done on top of the core decoding method.
The basic part is this loop, which tries to decode as much as possible. If invalid data is found -- to which whitespace is accounted at that point -- it's tried to skip possible whitespace, then loop over. So cases with regular whitespace-patterns (e.g. linebreak after 76 chars) are handled fast and vectorized.

If that doesn't work it's fallen back to a slow method, the decodes by blocks of four. Perf isn't great here, but when we land here then the input is degenerated (in my opinion), so it's not worth to have super fast perf as long as correct decoding can be done.

Note: at the moment there are some basic tests and a fuzz-run. I don't know if my PR for the change in that repo is good enough to merge or not -- anyway that's independent from this PR here.

@heathbm
Copy link
Contributor Author

heathbm commented Dec 20, 2022

I just checked out the code in your project. I stepped through the code and it appears that my previous commit essentially does the same as yours:

  • Vectorized decode, until whitespace is hit
  • Decode 4 bytes blocks until whitespace is hit, then try to go back to vectorized decoding. My method advances the pointer, yours slices from a worker method.
if (Avx2.IsSupported)
{
    while (end >= src)
    {
        // ---------- Step 1: Decode fast, until whitespace is hit
        bool isComplete = Avx2Decode(ref src, ref dest, end, maxSrcLength, destLength, srcBytes, destBytes);
        if (isComplete)
        {
            break;
        }

        // ---------- Step 2: Process 4 bytes, until first set of invalid bytes are skipped. Then go back to fast decoding.
        lastBlockStatus = IgnoreWhitespaceAndConsumeValidBytesUntilInvalidBlock(ref src, srcEnd, ref dest, destEnd, ref decodingMap, isFinalBlock, lastBlockStatus, true, out pendingSrcIncrement);
        if (lastBlockStatus != OperationStatus.Done)
        {
            break;
        }
    }
}

You mentioned this needed to be optimized which is why I added Step 2 bellow:

if (Avx2.IsSupported)
{
    while (end >= src)
    {
        // ---------- Step 1: no whitespace, will complete and exit
        bool isComplete = Avx2Decode(ref src, ref dest, end, maxSrcLength, destLength, srcBytes, destBytes);
        if (isComplete)
        {
            break;
        }

        // ---------- Step 2: whitespace after every 76 chars, method above will eventually fail, here whitespace is omitted in the next vectorized decode, then the path jumps to the start of the loop to repeat the process.
        isComplete = TryDecodeCurrentGroupIfWhitespaceIsSeparatingValidBytesInCommonLocation(utf8, ref src, ref dest, ref end, destLength, srcBytes, destBytes, 32);
        if (isComplete)
        {
            continue;
        }

        // ---------- Step 3: whitespace is somewhere else or invalid bytes are present, as soon as whitespace has been skipped, this path jumps to the start of the loop to reattempt vectorized decoding.
        // Process 4 bytes, until first set of invalid bytes are skipped.
        lastBlockStatus = IgnoreWhitespaceAndConsumeValidBytesUntilInvalidBlock(ref src, srcEnd, ref dest, destEnd, ref decodingMap, isFinalBlock, lastBlockStatus, true, out pendingSrcIncrement);
        if (lastBlockStatus != OperationStatus.Done)
        {
            break;
        }
    }
}

Step 2 here attempts to get rid of the whitespace (without scanning as we should be able to quickly detect if this is in fact the whitespace every 76 chars scenario) and go straight back to vector decoding instead of 4 byte blocks. Is this optimization not necessary?

@heathbm
Copy link
Contributor Author

heathbm commented Dec 20, 2022

To add a little more context as to why my PR appears quite large, a lot of my approach was also influenced by the requirement: "The Base64 decoding methods should handle whitespace chars as the Convert.ToBase64 decoding methods". e.g. If Base64.IsValid == true, then Base64.Decode should return OperationStatus.Done.
Currently, the code in your project returns StatusOperation.InvalidData for " ", where as Convert.FromBase64String just returns an empty array. I have been using https://gist.github.com/heathbm/f59662bd2334761d28288755a34e29ec to test many scenarios. Currently, 397,488 should be passing. However, with your code, 133,538 pass and 263,902 fail. I believe the intention should be to have all these pass, such that, anything that Convert.ToBase64 can decode Base64.Decode should also be able to decode.

@build-analysis build-analysis bot mentioned this pull request Dec 21, 2022
Copy link
Member

@gfoidl gfoidl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some nits about naming.

My only open question is about #79334 (comment)
A test with AQ== should have consumed = 5. If this test (and all others) pass, then I'm happy with this PR.

When existing tests with invalid input start to fail, then these tests should be validated for the correct behavior.

}

[MethodImpl(MethodImplOptions.AggressiveInlining)]
private static bool IsValid<T, T2>(ReadOnlySpan<T> base64Text, out int decodedLength)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
private static bool IsValid<T, T2>(ReadOnlySpan<T> base64Text, out int decodedLength)
private static bool IsValid<T, TBase64Validatable>(ReadOnlySpan<T> base64Text, out int decodedLength)

to give it a more handy name?

static abstract bool IsEncodingPad(T value);
}

internal readonly struct Base64CharValidationHandler : IBase64Validatable<char>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
internal readonly struct Base64CharValidationHandler : IBase64Validatable<char>
internal readonly struct Base64CharValidatable : IBase64Validatable<char>

Same for the Byte-type.

Comment on lines +130 to +132
static abstract int IndexOfAnyExcept(ReadOnlySpan<T> span);
static abstract bool IsWhitespace(T value);
static abstract bool IsEncodingPad(T value);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really ❤️ this possibility in the language.

@heathbm
Copy link
Contributor Author

heathbm commented Mar 10, 2023

A test with AQ== should have consumed = 5
@gfoidl Here is a test that covers some extra whitespace at the end scenarios: 5848058

Copy link
Member

@gfoidl gfoidl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only one question regarding tests left + 2 nits, otherwise LGTM.

Comment on lines +92 to +93
if (length < 0)
ThrowHelper.ThrowArgumentOutOfRangeException(ExceptionArgument.length);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (length < 0)
ThrowHelper.ThrowArgumentOutOfRangeException(ExceptionArgument.length);
ArgumentOutOfRangeException.ThrowIfNegative(length);

I'm not sure if the throw-helper should be used here or not.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, I know. It's just as this PR touches that code already that piece can be re-visited to take the new throw-helpers.
I'm not sure what's the general guidance here in runtime -- so whether keep the current implementation or the new possibilities. (for the latter it should be validated if it this method still inlines)

I'll leave that to the maintainers though.

{
int bufferIdx = 0;

while (sourceIndex < length
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
while (sourceIndex < length
while ((uint)sourceIndex < (uint)length

To elide the bound-check in L544?

Comment on lines +735 to +736
[InlineData("AQ==", 4, 1)]
[InlineData("AQ== ", 5, 1)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you mix in some variants like

Suggested change
[InlineData("AQ==", 4, 1)]
[InlineData("AQ== ", 5, 1)]
[InlineData("AQ==", 4, 1)]
[InlineData("AQ== ", 5, 1)]
[InlineData("AQ ==", 5, 1)]
[InlineData("AQ= =", 5, 1)]

or something like that covered by the test ValidBase64Strings_WithCharsThatMustBeIgnored below?
(Sorry, that I don't remember the generated testcases, it's too long ago 😉)

Also in https://gist.github.com/heathbm/f59662bd2334761d28288755a34e29ec you had some cool tests that generated inputs with whitespace at various places. In my lib that gist got into unit tests. Is everything covered here or should these tests be added too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I covered some of these cases here: https://github.com/dotnet/runtime/pull/79334/files#diff-6b4fec85572fdbcc9b03c81fe66cf8bf31cbb5620235abce7dab3c6dc034d862R11
It's a lighter version of the gist, as I was not sure if a test with that many loops/asserts would be acceptable in the pipeline.

@heathbm heathbm requested a review from MihaZupan March 21, 2023 15:38
@danmoseley danmoseley added the partner-impact This issue impacts a partner who needs to be kept updated label Mar 21, 2023
@tarekgh
Copy link
Member

tarekgh commented Apr 23, 2023

@heathbm could you please resolve the merge conflict?

1 similar comment
@tarekgh
Copy link
Member

tarekgh commented Apr 23, 2023

@heathbm could you please resolve the merge conflict?

@tarekgh
Copy link
Member

tarekgh commented Apr 23, 2023

@dotnet/area-system-memory can one of you review this change to unblock the PR and merge it?

@tarekgh
Copy link
Member

tarekgh commented Apr 26, 2023

@heathbm could you please resolve the merge conflict?

@tarekgh
Copy link
Member

tarekgh commented May 3, 2023

@heathbm any news?

@heathbm
Copy link
Contributor Author

heathbm commented May 4, 2023

@tarekgh I would love to work on this, however I have no free cycles. I did check, and it looks like Base64Transforms.cs was refactored. The methods I modified no longer exist. This would require a different approach from what I initially proposed.

@tarekgh
Copy link
Member

tarekgh commented May 4, 2023

Thanks @heathbm for your feedback. I am closing this PR as it became stale and needs a fully different fix.

@tarekgh tarekgh closed this May 4, 2023
@heathbm
Copy link
Contributor Author

heathbm commented May 4, 2023

@tarekgh Please be aware, only Base64Transforms.cs requires rework, this is very small in the context of this PR. All other work is not stale.

@ghost ghost locked as resolved and limited conversation to collaborators Jun 3, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-System.Memory community-contribution Indicates that the PR has been added by a community member new-api-needs-documentation partner-impact This issue impacts a partner who needs to be kept updated
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants