Allow Base64Decoder to ignore space chars, add IsValid methods and tests #79334

heathbm · 2022-12-07T07:59:45Z

Implementation of api-approved issue: #76020

Goals regarding the changes to decoding:

Chars now ignored by Decoding methods:
- 9: Line feed
- 10: Horizontal tab
- 13: Carriage return
- 32: Space
- Vertical tab omitted
Performance:
- The best-case path (when no chars to be ignored are encountered) should remain as untouched as possible.
- The worst-case path: should still perform at O(1n)
The Base64 decoding methods should handle whitespace chars as the Convert.ToBase64 decoding methods. A large number of tests have been carried out to ensure this is the case: https://gist.github.com/heathbm/f59662bd2334761d28288755a34e29ec
Existing tests should not need to be altered. At the moment, a slight modification has been made to Base64DecoderUnitTests.cs due to the fact that invalid ranges cannot as easily be inferred with the length of the input span. If this is a deal-breaker, a change can be made, that would involve the decoding loop to look at the next block before writing the current one.

Chars now ignored: 9: Line feed 10: Horizontal tab 13: Carriage return 32: Space -- Vertical tab omitted

dotnet-issue-labeler · 2022-12-07T07:59:53Z

Note regarding the new-api-needs-documentation label:

This serves as a reminder for when your PR is modifying a ref *.cs file and adding/modifying public APIs, to please make sure the API implementation in the src *.cs file is documented with triple slash comments, so the PR reviewers can sign off that change.

ghost · 2022-12-07T07:59:58Z

Tagging subscribers to this area: @dotnet/area-system-memory
See info in area-owners.md if you want to be subscribed.

Issue Details

Goals regarding the changes to decoding:

Chars now ignored by Decoding methods:
- 9: Line feed
- 10: Horizontal tab
- 13: Carriage return
- 32: Space
- Vertical tab omitted
Performance:
- The best-case path (when no chars to be ignored are encountered) should remain as untouched as possible.
- The worst-case path: should still perform at O(1n)
The Base64 decoding methods should handle whitespace chars as the Convert.ToBase64 decoding methods. A large number of tests have been carried out to ensure this is the case: https://gist.github.com/heathbm/f59662bd2334761d28288755a34e29ec
Existing tests should not need to be altered. At the moment, a slight modification has been made to Base64DecoderUnitTests.cs due to the fact that invalid ranges cannot as easily be inferred with the length of the input span. If this is a deal-breaker, a change can be made, that would involve the decoding loop to look at the next block before writing the current one.

Author:	heathbm
Assignees:	-
Labels:	`area-System.Memory`, `new-api-needs-documentation`, `community-contribution`
Milestone:	-

heathbm · 2022-12-07T16:44:49Z

This test:

runtime/src/libraries/System.Security.Cryptography/tests/Base64TransformsTests.cs

Line 304 in 0c4ee9e

    
           Assert.Throws<FormatException>(() => cs.Read(outputBytes, 0, outputBytes.Length));

is no longer throwing an exception: https://helixre107v0xdeko0k025g8.blob.core.windows.net/dotnet-runtime-refs-pull-79334-merge-b56bfafc0ea743d08d/System.Security.Cryptography.Tests/1/console.b696fe8a.log?helixlogtype=result

This seems reasonable.

MihaZupan

Thanks for looking into this again!

I only looked through the IsValid part.

cc: @gfoidl

src/libraries/System.Private.CoreLib/src/System/Buffers/Text/Base64Validator.cs

heathbm · 2022-12-08T09:03:35Z

Thank you for the feedback @MihaZupan I believe I have addressed those comments in my latest commit.

gfoidl

Had a quick look over the code. I'd like to have one point addressed, then will do a more thorough review.

When vectorized code detects an invalid char (could be whitespace), then it falls back to pure scalar processing, thus leaving perf on the table.
Often whitespace is inserted after 76 chars, so we should try to optimize that case.

Take that example:

Span<byte> bytes = new byte[1000];
Random.Shared.NextBytes(bytes);

string base64Text = Convert.ToBase64String(bytes, Base64FormattingOptions.InsertLineBreaks);
ReadOnlySpan<byte> base64 = Encoding.ASCII.GetBytes(base64Text);

OperationStatus status = Base64.DecodeFromUtf8(base64, bytes, out int consumed, out int written);

Here line-breaks are inserted every 76-chars. So after decoding the first "set", all the remainder is done scalar, whilst it could be done vectorized too. This is what I meant in #78951 (comment).

Do to this you can use something like:

static OperationStatus DecodeFromUtf8(ReadOnlySpan<byte> utf8, Span<byte> bytes, out int bytesConsumed, out int bytesWritten, bool isFinalBlock = true)
{
    OperationStatus status;
    int consumed = 0;
    int written = 0;
    int totalConsumed = 0;
    int totalWritten = 0;

    while (true)
    {
        // DecodeFromUtf8Core is the current implementation from main, just renamed and made private
        status = DecodeFromUtf8Core(utf8, bytes, out consumed, out written, isFinalBlock);
        totalConsumed += consumed;
        totalWritten += written;

        if (status != OperationStatus.InvalidData)
        {
            break;
        }

        // Found invalid data, check if it's whitespace and can be skipped
        utf8 = utf8.Slice(consumed);
        bytes = bytes.Slice(written);

        if (!TrySkipWhitespace(utf8, out consumed))
        {
            break;
        }

        utf8 = utf8.Slice(consumed);
    }

    bytesConsumed = totalConsumed;
    bytesWritten = totalWritten;
    return status;
}

// This scans potentially to the end. There could be a more robust approach.
// Maybe try IndexOfAnyValues here?
static bool TrySkipWhitespace(ReadOnlySpan<byte> utf8, out int consumed)
{
    for (int i = 0; i < utf8.Length; ++i)
    {
        if (!IsByteToBeIgnored(utf8[i]))
        {
            consumed = i;
            return true;
        }
    }

    consumed = 0;
    return false;
}

I hope you get the idea.
Note: the shown code doesn't handle edge-cases, etc. so that should be considered, also if true invalid data is present that it's not stuck in an endless loop.

gfoidl · 2022-12-07T12:52:16Z

src/libraries/System.Private.CoreLib/src/System/Buffers/Text/Base64Decoder.cs


+                while (src + 4 <= srcEnd)


Suggested change

while (src + 4 <= srcEnd)

while (src <= srcEnd - 4)

src will be incremented, so src + 4 needs to be evaluated in each iteration. With srcEnd - 4 it's the same condition, but that value can be kept w/o re-evaluation.

gfoidl · 2022-12-07T12:55:09Z

src/libraries/System.Private.CoreLib/src/System/Buffers/Text/Base64Decoder.cs

@@ -46,7 +46,7 @@ public static unsafe OperationStatus DecodeFromUtf8(ReadOnlySpan<byte> utf8, Spa
            fixed (byte* srcBytes = &MemoryMarshal.GetReference(utf8))
            fixed (byte* destBytes = &MemoryMarshal.GetReference(bytes))
            {
-                int srcLength = utf8.Length & ~0x3;  // only decode input up to the closest multiple of 4.
+                int srcLength = utf8.Length;  // only decode input up to the closest multiple of 4.


Shouldn't the & ~0x3 be kept?
It's needed for invalid input, i.e. if it's not a multiple of 4. Or is this handled elsewhere?

gfoidl · 2022-12-07T12:57:12Z

src/libraries/System.Private.CoreLib/src/System/Buffers/Text/Base64Decoder.cs

-                        goto InvalidDataExit;
+                    {
+                        int firstInvalidIndex = GetIndexOfFirstByteToBeIgnored(src);
+                        if (firstInvalidIndex != -1)


Suggested change

if (firstInvalidIndex != -1)

if (firstInvalidIndex >= 0)

A comparison wtih 0 is a tiny little bit faster than other comparisons.

gfoidl · 2022-12-07T12:57:41Z

src/libraries/System.Private.CoreLib/src/System/Buffers/Text/Base64Decoder.cs

+
+                            for (int currentBlockIndex = firstInvalidIndex; currentBlockIndex < 4; currentBlockIndex++)
+                            {
+                                while (src + validBytesSearchIndex < srcEnd


Suggested change

while (src + validBytesSearchIndex < srcEnd

while (src < srcEnd - validBytesSearchIndex

gfoidl · 2022-12-07T12:58:19Z

src/libraries/System.Private.CoreLib/src/System/Buffers/Text/Base64Decoder.cs

+                                    totalBytesIgnored++;
+                                }
+
+                                if (src + validBytesSearchIndex >= srcEnd)


Suggested change

if (src + validBytesSearchIndex >= srcEnd)

if (src >= srcEnd - validBytesSearchIndex)

gfoidl · 2022-12-08T11:27:45Z

src/libraries/System.Private.CoreLib/src/System/Buffers/Text/Base64Validator.cs

+            }
+
+            // Remove padding to get exact length
+            decodedLength = (length / 4 * 3) - paddingCount;


Suggested change

decodedLength = (length / 4 * 3) - paddingCount;

decodedLength = (int)((uint)length / 4 * 3) - paddingCount;

Ugly, but better codegen.

gfoidl · 2022-12-08T11:28:26Z

src/libraries/System.Private.CoreLib/src/System/Buffers/Text/Base64Validator.cs

+        }
+
+        [MethodImpl(MethodImplOptions.AggressiveInlining)]
+        private static bool IsCharToBeIgnored(char aChar)


See above.
Or will the compilers be able to optimize this (in the near future)?

gfoidl · 2022-12-08T11:29:58Z

src/libraries/System.Private.CoreLib/src/System/Buffers/Text/Base64Validator.cs

+        }
+
+        [MethodImpl(MethodImplOptions.AggressiveInlining)]
+        private static bool IsCharToBeIgnored(char aChar)


If this method takes an int as argument, then it could be used here and in the decoder, so moved to a shared place, thus avoiding the duplication.

gfoidl · 2022-12-08T11:30:40Z

src/libraries/System.Private.CoreLib/src/System/Buffers/Text/Base64Validator.cs

+                }
+            }
+
+            if (length % 4 != 0)


TIL: here is no need for the uint-cast, the JIT will emit good code 👍🏻.

gfoidl · 2022-12-08T11:31:30Z

src/libraries/System.Private.CoreLib/src/System/Buffers/Text/Base64Validator.cs

+            }
+
+            // Remove padding to get exact length
+            decodedLength = (length / 4 * 3) - paddingCount;


Suggested change

decodedLength = (length / 4 * 3) - paddingCount;

decodedLength = (int)((uint)length / 4 * 3) - paddingCount;

heathbm · 2022-12-14T08:08:02Z

Had a quick look over the code. I'd like to have one point addressed, then will do a more thorough review.

When vectorized code detects an invalid char (could be whitespace), then it falls back to pure scalar processing, thus leaving perf on the table. Often whitespace is inserted after 76 chars, so we should try to optimize that case.

Take that example:

Span<byte> bytes = new byte[1000];
Random.Shared.NextBytes(bytes);

string base64Text = Convert.ToBase64String(bytes, Base64FormattingOptions.InsertLineBreaks);
ReadOnlySpan<byte> base64 = Encoding.ASCII.GetBytes(base64Text);

OperationStatus status = Base64.DecodeFromUtf8(base64, bytes, out int consumed, out int written);

Here line-breaks are inserted every 76-chars. So after decoding the first "set", all the remainder is done scalar, whilst it could be done vectorized too. This is what I meant in #78951 (comment).

Do to this you can use something like:

static OperationStatus DecodeFromUtf8(ReadOnlySpan<byte> utf8, Span<byte> bytes, out int bytesConsumed, out int bytesWritten, bool isFinalBlock = true)
{
    OperationStatus status;
    int consumed = 0;
    int written = 0;
    int totalConsumed = 0;
    int totalWritten = 0;

    while (true)
    {
        // DecodeFromUtf8Core is the current implementation from main, just renamed and made private
        status = DecodeFromUtf8Core(utf8, bytes, out consumed, out written, isFinalBlock);
        totalConsumed += consumed;
        totalWritten += written;

        if (status != OperationStatus.InvalidData)
        {
            break;
        }

        // Found invalid data, check if it's whitespace and can be skipped
        utf8 = utf8.Slice(consumed);
        bytes = bytes.Slice(written);

        if (!TrySkipWhitespace(utf8, out consumed))
        {
            break;
        }

        utf8 = utf8.Slice(consumed);
    }

    bytesConsumed = totalConsumed;
    bytesWritten = totalWritten;
    return status;
}

// This scans potentially to the end. There could be a more robust approach.
// Maybe try IndexOfAnyValues here?
static bool TrySkipWhitespace(ReadOnlySpan<byte> utf8, out int consumed)
{
    for (int i = 0; i < utf8.Length; ++i)
    {
        if (!IsByteToBeIgnored(utf8[i]))
        {
            consumed = i;
            return true;
        }
    }

    consumed = 0;
    return false;
}

I hope you get the idea. Note: the shown code doesn't handle edge-cases, etc. so that should be considered, also if true invalid data is present that it's not stuck in an endless loop.

@gfoidl Thank you very much for the example. I had this idea with the IsValid method too, where I sliced until I hit padding, then proceeded to loop over the remaining bytes.

However, with the decode method, one of the goals is "The Base64 decoding methods should handle whitespace chars as the Convert.ToBase64". To elaborate, Convert can handle any number of whitespace in any position.

So your suggestion will not work:

...
if (!TrySkipWhitespace(utf8, out consumed))
{
    break;
}

utf8 = utf8.Slice(consumed);
...

As this would fail for example: " Y Q = = ". Yet Convert would be able to parse it.
As a compromise, if invalid bytes are hit while decoding with the vectorized paths, block by block decoding will take over, until invalid bytes have been skipped. Then, vectorized decoding will try to resume if possible.

I have implemented or addressed your other comments also.

I hope my approach is not too 'bloaty'. "The Base64 decoding methods should handle whitespace chars as the Convert.ToBase64" + "Existing tests should not need to be altered" resulted in more code than I initially expected.
I would also like to reiterate, if a complete PR is submitted while this one is still open, I would be happy to close this one.

gfoidl · 2022-12-14T11:01:52Z

So your suggestion will not work:

Well it works...at least under the presupposition that this is an suggestion to get you on the right road 😉

invalid bytes are hit while decoding with the vectorized paths, block by block decoding will take over, until invalid bytes have been skipped. Then, vectorized decoding will try to resume if possible.

Just to specify: only whitespace is skipped, not any invalid bytes.
(I know you know this, but as written it can be mis-read)

Convert can handle any number of whitespace in any position.
...
" Y Q = = ". Yet Convert would be able to parse it.

Yep, but that's an artificial example which (I assume) won't be quite often in real-world base64 encoded bytes.
Basically there will be two major sets of base64 encoded bytes:

no whitespace at all
line-breaks after 76 chars

So I'd be happy when we can optimize for these cases. All other cases, like the one given in the quote, it should work, but perf-wise that can be penalized (as being very uncommon).

In my suggestion TrySkipWhitespace needs to be replaced with a correct / better implementation that

skips whitespace and moves the pointer forward in the input-data
signals "end" when truly invalid data is detected

That's not done in the suggestion (for simplicity, and just to outline how it could work).

Thus -- at least I hope so -- the actual code change is quite minimal. Have the workhorse-method as is and just renamed to Core-suffix. The call that workhorse method from the "driver"-method, that handles the checks as given in the two bullets above.

So for the case "no whitespace at all" there's only one method call more than the actual code has. So perf should be on-par.
For the case "line-breaks after 76 chars" 76 chars can be processed fast, then the line-break is skipped, another 76 chars are processed, and so on.

The example " Y Q = = " should work too, but you're rigth that this is a case I didn't consider in my suggestion so far. So lets extend that suggestion (point 4, last ->)

Core called with " Y Q = = " -> returns OperationStatus.InvalidData
check if truly invalid or whitespace -> whitespace, so skip it -> it remains "Y Q = = "
Core called with "Y Q = = " -> returns OperationStatus.InvalidData
check if truly invalid or whitspace -> first char "Y" is valid -> fallback to slow method

The "slow method" creates a buffer of block-size (e.g. stackalloc byte[4]) and fills that with the valid non-whitespace bytes. Except when invalid byte found, then stop. For the example that buffer would be filled with YQ==.
After filling that block-buffer Core is called to do actual decoding. The buffer is valid, has padding so it should be the final "block" to process. After that only only byte from the input remains, so check if that's whitespace (-> OK) or invalid data, where in this case invalid data is anything else than whitspace (-> KO).

This also means that once such a sequence -- invalid data, where the first byte is actual a valid base64 -- is detected it will be pure scalar processing, w/o going back to vectorized again. I'd take this, as

it's an unlikey input -- constructed artificially for tests
it keeps the code stream-lined, so the common inputs aren't penalized
the code-change is also streamlined
as whitespace wasn't allowed before it's not a perf-regression and that like, simply as that nothing to compare with
bring a PR in that enabled whitespace, optimize it later with another PR if needed

I would also like to reiterate, if a complete PR is submitted while this one is still open, I would be happy to close this one.

I don't follow. Let's bring your / this PR to a successful end.

heathbm · 2022-12-15T04:44:28Z

So your suggestion will not work:

Well it works...at least under the presupposition that this is an suggestion to get you on the right road 😉

invalid bytes are hit while decoding with the vectorized paths, block by block decoding will take over, until invalid bytes have been skipped. Then, vectorized decoding will try to resume if possible.

Just to specify: only whitespace is skipped, not any invalid bytes. (I know you know this, but as written it can be mis-read)

Convert can handle any number of whitespace in any position.
...
" Y Q = = ". Yet Convert would be able to parse it.

Yep, but that's an artificial example which (I assume) won't be quite often in real-world base64 encoded bytes. Basically there will be two major sets of base64 encoded bytes:

no whitespace at all

line-breaks after 76 chars

So I'd be happy when we can optimize for these cases. All other cases, like the one given in the quote, it should work, but perf-wise that can be penalized (as being very uncommon).

In my suggestion TrySkipWhitespace needs to be replaced with a correct / better implementation that

skips whitespace and moves the pointer forward in the input-data

signals "end" when truly invalid data is detected

That's not done in the suggestion (for simplicity, and just to outline how it could work).

Thus -- at least I hope so -- the actual code change is quite minimal. Have the workhorse-method as is and just renamed to Core-suffix. The call that workhorse method from the "driver"-method, that handles the checks as given in the two bullets above.

So for the case "no whitespace at all" there's only one method call more than the actual code has. So perf should be on-par. For the case "line-breaks after 76 chars" 76 chars can be processed fast, then the line-break is skipped, another 76 chars are processed, and so on.

The example " Y Q = = " should work too, but you're rigth that this is a case I didn't consider in my suggestion so far. So lets extend that suggestion (point 4, last ->)

Core called with " Y Q = = " -> returns OperationStatus.InvalidData

check if truly invalid or whitespace -> whitespace, so skip it -> it remains "Y Q = = "

Core called with "Y Q = = " -> returns OperationStatus.InvalidData

check if truly invalid or whitspace -> first char "Y" is valid -> fallback to slow method

The "slow method" creates a buffer of block-size (e.g. stackalloc byte[4]) and fills that with the valid non-whitespace bytes. Except when invalid byte found, then stop. For the example that buffer would be filled with YQ==. After filling that block-buffer Core is called to do actual decoding. The buffer is valid, has padding so it should be the final "block" to process. After that only only byte from the input remains, so check if that's whitespace (-> OK) or invalid data, where in this case invalid data is anything else than whitspace (-> KO).

This also means that once such a sequence -- invalid data, where the first byte is actual a valid base64 -- is detected it will be pure scalar processing, w/o going back to vectorized again. I'd take this, as

it's an unlikey input -- constructed artificially for tests

it keeps the code stream-lined, so the common inputs aren't penalized

the code-change is also streamlined

as whitespace wasn't allowed before it's not a perf-regression and that like, simply as that nothing to compare with

bring a PR in that enabled whitespace, optimize it later with another PR if needed

I would also like to reiterate, if a complete PR is submitted while this one is still open, I would be happy to close this one.

I don't follow. Let's bring your / this PR to a successful end.

Thank you for the details, I'm 100% onboard with those 3 scenarios. I will update the PR accordingly.

gfoidl · 2022-12-15T10:20:37Z

Side note: please don't do the full-quotes -- that content is in the comment history anyway.

heathbm · 2022-12-20T07:22:49Z

@gfoidl Regarding:

For the case "line-breaks after 76 chars" 76 chars can be processed fast, then the line-break is skipped, another 76 chars are > processed, and so on.".

and

In my suggestion TrySkipWhitespace needs to be replaced with a correct / better implementation that

skips whitespace and moves the pointer forward in the input-data

signals "end" when truly invalid data is detected

Suggested code for reference:

...
if (!TrySkipWhitespace(utf8, out consumed))
{
    break;
}

utf8 = utf8.Slice(consumed);
...

...
static bool TrySkipWhitespace(ReadOnlySpan<byte> utf8, out int consumed)
{
    for (int i = 0; i < utf8.Length; ++i)
    {
        if (!IsByteToBeIgnored(utf8[i]))
        {
            consumed = i;
            return true;
        }
    }

    consumed = 0;
    return false;
}
...

TrySkipWhitespace skips valid bytes until whitespace is found. Those valid bytes need to be decoded also. I did not see a straightforward way to fill in the gaps with this suggestion, since those skipped bytes would be lost after the slice. However, I would like to raise another similar approach:

We can avoid scanning, since we should be able to check for the 2 whitespace chars (\r\n) after every 76 bytes. I currently do this in:

runtime/src/libraries/System.Private.CoreLib/src/System/Buffers/Text/Base64Decoder.cs

Line 559 in f389289

    
           private static unsafe bool TryDecodeCurrentGroupIfWhitespaceIsSeparatingValidBytesInCommonLocation(ReadOnlySpan<byte> utf8, ref byte* src, ref byte* dest, ref byte* end, int destLength, byte* srcBytes, byte* destBytes, byte groupSize)

I'd also like to annotate the vectorized code path:

runtime/src/libraries/System.Private.CoreLib/src/System/Buffers/Text/Base64Decoder.cs

Line 72 in f389289

if (Avx2.IsSupported)

if (Avx2.IsSupported)
{
    while (end >= src)
    {
        // ---------- Scenario 1: no whitespace, will complete and exit
        bool isComplete = Avx2Decode(ref src, ref dest, end, maxSrcLength, destLength, srcBytes, destBytes);
        if (isComplete)
        {
            break;
        }

        // ---------- Scenario 2: whitespace after every 76 chars, method above will eventually fail, here whitespace is omitted in the next vectorized decode, then the path jumps to the start of the loop to repeat the process.
        isComplete = TryDecodeCurrentGroupIfWhitespaceIsSeparatingValidBytesInCommonLocation(utf8, ref src, ref dest, ref end, destLength, srcBytes, destBytes, 32);
        if (isComplete)
        {
            continue;
        }

        // ---------- Scenario 3: whitespace is somewhere else or invalid bytes are present, as soon as whitespace has been skipped, this path jumps to the start of the loop to reattempt vectorized decoding.
        // Process 4 bytes, until first set of invalid bytes are skipped.
        lastBlockStatus = IgnoreWhitespaceAndConsumeValidBytesUntilInvalidBlock(ref src, srcEnd, ref dest, destEnd, ref decodingMap, isFinalBlock, lastBlockStatus, true, out pendingSrcIncrement);
        if (lastBlockStatus != OperationStatus.Done)
        {
            break;
        }
    }
}

gfoidl · 2022-12-20T22:11:30Z

Suggested code for reference:

It's not and was not meant as "suggested code" which can be used 1:1. It's a piece of code to lay you the road on how it could be done. Thus I don't understand

TrySkipWhitespace skips valid bytes until whitespace is found. Those valid bytes need to be decoded also.

as we had this already before. In the code fragment that I've shown above some parts were missing, which should be worked out.

For me your code with TryDecodeCurrentGroupIfWhitespaceIsSeparatingValidBytesInCommonLocation and the % 76 looks too complicated.

In my Base64-project I did a quick stab based on the idea outlined above. The code for the actual decoding isn't touched -- similar to the outline (not suggestion) in #79334 (review). All work is done on top of the core decoding method.
The basic part is this loop, which tries to decode as much as possible. If invalid data is found -- to which whitespace is accounted at that point -- it's tried to skip possible whitespace, then loop over. So cases with regular whitespace-patterns (e.g. linebreak after 76 chars) are handled fast and vectorized.

If that doesn't work it's fallen back to a slow method, the decodes by blocks of four. Perf isn't great here, but when we land here then the input is degenerated (in my opinion), so it's not worth to have super fast perf as long as correct decoding can be done.

Note: at the moment there are some basic tests and a fuzz-run. I don't know if my PR for the change in that repo is good enough to merge or not -- anyway that's independent from this PR here.

heathbm · 2022-12-20T23:28:47Z

I just checked out the code in your project. I stepped through the code and it appears that my previous commit essentially does the same as yours:

Vectorized decode, until whitespace is hit
Decode 4 bytes blocks until whitespace is hit, then try to go back to vectorized decoding. My method advances the pointer, yours slices from a worker method.

if (Avx2.IsSupported)
{
    while (end >= src)
    {
        // ---------- Step 1: Decode fast, until whitespace is hit
        bool isComplete = Avx2Decode(ref src, ref dest, end, maxSrcLength, destLength, srcBytes, destBytes);
        if (isComplete)
        {
            break;
        }

        // ---------- Step 2: Process 4 bytes, until first set of invalid bytes are skipped. Then go back to fast decoding.
        lastBlockStatus = IgnoreWhitespaceAndConsumeValidBytesUntilInvalidBlock(ref src, srcEnd, ref dest, destEnd, ref decodingMap, isFinalBlock, lastBlockStatus, true, out pendingSrcIncrement);
        if (lastBlockStatus != OperationStatus.Done)
        {
            break;
        }
    }
}

You mentioned this needed to be optimized which is why I added Step 2 bellow:

if (Avx2.IsSupported)
{
    while (end >= src)
    {
        // ---------- Step 1: no whitespace, will complete and exit
        bool isComplete = Avx2Decode(ref src, ref dest, end, maxSrcLength, destLength, srcBytes, destBytes);
        if (isComplete)
        {
            break;
        }

        // ---------- Step 2: whitespace after every 76 chars, method above will eventually fail, here whitespace is omitted in the next vectorized decode, then the path jumps to the start of the loop to repeat the process.
        isComplete = TryDecodeCurrentGroupIfWhitespaceIsSeparatingValidBytesInCommonLocation(utf8, ref src, ref dest, ref end, destLength, srcBytes, destBytes, 32);
        if (isComplete)
        {
            continue;
        }

        // ---------- Step 3: whitespace is somewhere else or invalid bytes are present, as soon as whitespace has been skipped, this path jumps to the start of the loop to reattempt vectorized decoding.
        // Process 4 bytes, until first set of invalid bytes are skipped.
        lastBlockStatus = IgnoreWhitespaceAndConsumeValidBytesUntilInvalidBlock(ref src, srcEnd, ref dest, destEnd, ref decodingMap, isFinalBlock, lastBlockStatus, true, out pendingSrcIncrement);
        if (lastBlockStatus != OperationStatus.Done)
        {
            break;
        }
    }
}

Step 2 here attempts to get rid of the whitespace (without scanning as we should be able to quickly detect if this is in fact the whitespace every 76 chars scenario) and go straight back to vector decoding instead of 4 byte blocks. Is this optimization not necessary?

heathbm · 2022-12-20T23:52:24Z

To add a little more context as to why my PR appears quite large, a lot of my approach was also influenced by the requirement: "The Base64 decoding methods should handle whitespace chars as the Convert.ToBase64 decoding methods". e.g. If Base64.IsValid == true, then Base64.Decode should return OperationStatus.Done.
Currently, the code in your project returns StatusOperation.InvalidData for " ", where as Convert.FromBase64String just returns an empty array. I have been using https://gist.github.com/heathbm/f59662bd2334761d28288755a34e29ec to test many scenarios. Currently, 397,488 should be passing. However, with your code, 133,538 pass and 263,902 fail. I believe the intention should be to have all these pass, such that, anything that Convert.ToBase64 can decode Base64.Decode should also be able to decode.

This fixes a failing test.

gfoidl

Some nits about naming.

My only open question is about #79334 (comment)
A test with AQ== should have consumed = 5. If this test (and all others) pass, then I'm happy with this PR.

When existing tests with invalid input start to fail, then these tests should be validated for the correct behavior.

gfoidl · 2023-03-09T09:57:16Z

src/libraries/System.Private.CoreLib/src/System/Buffers/Text/Base64Validator.cs

+        }
+
+        [MethodImpl(MethodImplOptions.AggressiveInlining)]
+        private static bool IsValid<T, T2>(ReadOnlySpan<T> base64Text, out int decodedLength)


Suggested change

private static bool IsValid<T, T2>(ReadOnlySpan<T> base64Text, out int decodedLength)

private static bool IsValid<T, TBase64Validatable>(ReadOnlySpan<T> base64Text, out int decodedLength)

to give it a more handy name?

gfoidl · 2023-03-09T09:58:22Z

src/libraries/System.Private.CoreLib/src/System/Buffers/Text/Base64Validator.cs

+            static abstract bool IsEncodingPad(T value);
+        }
+
+        internal readonly struct Base64CharValidationHandler : IBase64Validatable<char>


Suggested change

internal readonly struct Base64CharValidationHandler : IBase64Validatable<char>

internal readonly struct Base64CharValidatable : IBase64Validatable<char>

Same for the Byte-type.

gfoidl · 2023-03-09T10:08:38Z

src/libraries/System.Private.CoreLib/src/System/Buffers/Text/Base64Validator.cs

+            static abstract int IndexOfAnyExcept(ReadOnlySpan<T> span);
+            static abstract bool IsWhitespace(T value);
+            static abstract bool IsEncodingPad(T value);


I really ❤️ this possibility in the language.

heathbm · 2023-03-10T06:24:52Z

A test with AQ== should have consumed = 5
@gfoidl Here is a test that covers some extra whitespace at the end scenarios: 5848058

gfoidl

Only one question regarding tests left + 2 nits, otherwise LGTM.

gfoidl · 2023-03-10T10:25:05Z

src/libraries/System.Private.CoreLib/src/System/Buffers/Text/Base64Decoder.cs

+            if (length < 0)
+                ThrowHelper.ThrowArgumentOutOfRangeException(ExceptionArgument.length);


Suggested change

if (length < 0)

ThrowHelper.ThrowArgumentOutOfRangeException(ExceptionArgument.length);

ArgumentOutOfRangeException.ThrowIfNegative(length);

I'm not sure if the throw-helper should be used here or not.

This method just got moved around, it is untouched from the original: https://github.com/dotnet/runtime/blob/f52e277f54a9413d2f0bd42b8b957c8e4fd263ad/src/libraries/System.Private.CoreLib/src/System/Buffers/Text/Base64Decoder.cs#LL238C24-L238C24

Yep, I know. It's just as this PR touches that code already that piece can be re-visited to take the new throw-helpers.
I'm not sure what's the general guidance here in runtime -- so whether keep the current implementation or the new possibilities. (for the latter it should be validated if it this method still inlines)

I'll leave that to the maintainers though.

gfoidl · 2023-03-10T10:27:16Z

src/libraries/System.Private.CoreLib/src/System/Buffers/Text/Base64Decoder.cs

+            {
+                int bufferIdx = 0;
+
+                while (sourceIndex < length


Suggested change

while (sourceIndex < length

while ((uint)sourceIndex < (uint)length

To elide the bound-check in L544?

gfoidl · 2023-03-10T10:32:03Z

src/libraries/System.Memory/tests/Base64/Base64DecoderUnitTests.cs

+        [InlineData("AQ==", 4, 1)]
+        [InlineData("AQ== ", 5, 1)]


Can you mix in some variants like

Suggested change

[InlineData("AQ==", 4, 1)]

[InlineData("AQ== ", 5, 1)]

[InlineData("AQ==", 4, 1)]

[InlineData("AQ== ", 5, 1)]

[InlineData("AQ ==", 5, 1)]

[InlineData("AQ= =", 5, 1)]

or something like that covered by the test ValidBase64Strings_WithCharsThatMustBeIgnored below?
(Sorry, that I don't remember the generated testcases, it's too long ago 😉)

Also in https://gist.github.com/heathbm/f59662bd2334761d28288755a34e29ec you had some cool tests that generated inputs with whitespace at various places. In my lib that gist got into unit tests. Is everything covered here or should these tests be added too?

I covered some of these cases here: https://github.com/dotnet/runtime/pull/79334/files#diff-6b4fec85572fdbcc9b03c81fe66cf8bf31cbb5620235abce7dab3c6dc034d862R11
It's a lighter version of the gist, as I was not sure if a test with that many loops/asserts would be acceptable in the pipeline.

src/libraries/System.Private.CoreLib/src/System/Buffers/Text/Base64Validator.cs

tarekgh · 2023-04-23T19:24:10Z

@heathbm could you please resolve the merge conflict?

tarekgh · 2023-04-23T19:25:08Z

@heathbm could you please resolve the merge conflict?

tarekgh · 2023-04-23T19:26:42Z

@dotnet/area-system-memory can one of you review this change to unblock the PR and merge it?

tarekgh · 2023-04-26T01:17:49Z

@heathbm could you please resolve the merge conflict?

tarekgh · 2023-05-03T23:15:26Z

@heathbm any news?

heathbm · 2023-05-04T05:25:18Z

@tarekgh I would love to work on this, however I have no free cycles. I did check, and it looks like Base64Transforms.cs was refactored. The methods I modified no longer exist. This would require a different approach from what I initially proposed.

tarekgh · 2023-05-04T15:32:01Z

Thanks @heathbm for your feedback. I am closing this PR as it became stale and needs a fully different fix.

heathbm · 2023-05-04T15:35:17Z

@tarekgh Please be aware, only Base64Transforms.cs requires rework, this is very small in the context of this PR. All other work is not stale.

Allow Base64Decoder to ignore space chars, add IsValid methods and tests

6d72060

Chars now ignored: 9: Line feed 10: Horizontal tab 13: Carriage return 32: Space -- Vertical tab omitted

ghost added the community-contribution Indicates that the PR has been added by a community member label Dec 7, 2022

dotnet-issue-labeler bot added area-System.Memory new-api-needs-documentation labels Dec 7, 2022

MihaZupan reviewed Dec 7, 2022

View reviewed changes

Address PR feedback regarding Base64.IsValid

35302b4

gfoidl reviewed Dec 8, 2022

View reviewed changes

build-analysis bot mentioned this pull request Dec 8, 2022

System.Tests.ArrayTests.CreateInstance test failing in NativeAOT #79403

Closed

heathbm added 2 commits December 13, 2022 23:51

Address PR feedback: General optimizations

837b4bd

Address PR feedback: Use vectorized decoding while enough src

9b3b581

Address PR feedback: General optimization

9983889

Address PR feedback: Optimize for whitespace (\r\n) every 76 bytes

f389289

build-analysis bot mentioned this pull request Dec 21, 2022

emcc received SIGKILL #79874

Closed

runfoapp bot mentioned this pull request Mar 2, 2023

Test failure: System.Security.Cryptography.X509Certificates.Tests.CertificateCreation.CertificateRequestChainTests/CreateChain_Hybrid #25979

Closed

build-analysis bot mentioned this pull request Mar 2, 2023

System.Security.Cryptography.X509Certificates.Tests.ChainTests.BuildInvalidSignatureTwice failure #82837

Open

Throw Base64FormatException when whitespace should not be ignored

745fd41

This fixes a failing test.

This was referenced Mar 3, 2023

Roslyn source generator crash on mono/linux/arm64 #81123

Closed

Alpine System.Net.Security.Tests failing because of "Cannot load library libgssapi_krb5.so.2" #82945

Closed

heathbm requested a review from gfoidl March 8, 2023 19:58

gfoidl reviewed Mar 9, 2023

View reviewed changes

heathbm added 2 commits March 9, 2023 22:12

Adress PR feedback: Improve naming of Base64Validator.cs internals

b35ce12

Adress PR feedback: Add test to demonstrate extra whitespace is counted

5848058

gfoidl approved these changes Mar 10, 2023

View reviewed changes

MihaZupan reviewed Mar 10, 2023

View reviewed changes

src/libraries/System.Private.CoreLib/src/System/Buffers/Text/Base64Validator.cs Outdated Show resolved Hide resolved

heathbm added 2 commits March 13, 2023 20:59

Address PR feedback: avoid bound-check

7c022b0

Address PR feedback: Base64.IsValid: Return when no more invalid chars

04989b5

This was referenced Mar 14, 2023

[release/6.0] Doublelinklist GC failures on Mono #83245

Closed

[jitstress] HardwareIntrinsics_ro fails with "process cannot access the file" error #83298

Closed

Address PR feedback: Refactor Bas64.IsValid method

5007df7

heathbm requested a review from MihaZupan March 21, 2023 15:38

danmoseley added the partner-impact This issue impacts a partner who needs to be kept updated label Mar 21, 2023

tarekgh closed this May 4, 2023

stephentoub mentioned this pull request May 8, 2023

Add Base64.IsValid and allow Base64.DecodeXx methods to skip whitespace #85938

Merged

ghost locked as resolved and limited conversation to collaborators Jun 3, 2023

	while (src + validBytesSearchIndex < srcEnd
	while (src < srcEnd - validBytesSearchIndex

	if (src + validBytesSearchIndex >= srcEnd)
	if (src >= srcEnd - validBytesSearchIndex)

	decodedLength = (length / 4 * 3) - paddingCount;
	decodedLength = (int)((uint)length / 4 * 3) - paddingCount;

	private static bool IsValid<T, T2>(ReadOnlySpan<T> base64Text, out int decodedLength)
	private static bool IsValid<T, TBase64Validatable>(ReadOnlySpan<T> base64Text, out int decodedLength)

	internal readonly struct Base64CharValidationHandler : IBase64Validatable<char>
	internal readonly struct Base64CharValidatable : IBase64Validatable<char>

		if (length < 0)
		ThrowHelper.ThrowArgumentOutOfRangeException(ExceptionArgument.length);

	if (length < 0)
	ThrowHelper.ThrowArgumentOutOfRangeException(ExceptionArgument.length);
	ArgumentOutOfRangeException.ThrowIfNegative(length);

	while (sourceIndex < length
	while ((uint)sourceIndex < (uint)length

Allow Base64Decoder to ignore space chars, add IsValid methods and tests #79334

Allow Base64Decoder to ignore space chars, add IsValid methods and tests #79334

Conversation

heathbm commented Dec 7, 2022 • edited Loading

Goals regarding the changes to decoding:

dotnet-issue-labeler bot commented Dec 7, 2022

ghost commented Dec 7, 2022

Goals regarding the changes to decoding:

heathbm commented Dec 7, 2022

MihaZupan left a comment

Choose a reason for hiding this comment

heathbm commented Dec 8, 2022

gfoidl left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gfoidl Dec 8, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

heathbm commented Dec 14, 2022

gfoidl commented Dec 14, 2022

heathbm commented Dec 15, 2022

gfoidl commented Dec 15, 2022

heathbm commented Dec 20, 2022

gfoidl commented Dec 20, 2022

heathbm commented Dec 20, 2022

heathbm commented Dec 20, 2022

gfoidl left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

heathbm commented Mar 10, 2023

gfoidl left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tarekgh commented Apr 23, 2023

tarekgh commented Apr 23, 2023

tarekgh commented Apr 23, 2023

tarekgh commented Apr 26, 2023

tarekgh commented May 3, 2023

heathbm commented May 4, 2023

tarekgh commented May 4, 2023

heathbm commented May 4, 2023

heathbm commented Dec 7, 2022 •

edited

Loading

gfoidl Dec 8, 2022 •

edited

Loading