Allow Base64Decoder to ignore space chars, add IsValid methods and tests #78951

heathbm · 2022-11-29T07:24:25Z

Implementation of api-approved issue: #76020

Chars now ignored: 9: Line feed 10: Horizontal tab 13: Carriage return 32: Space -- Vertical tab omitted

dotnet-issue-labeler · 2022-11-29T07:24:31Z

Note regarding the new-api-needs-documentation label:

This serves as a reminder for when your PR is modifying a ref *.cs file and adding/modifying public APIs, to please make sure the API implementation in the src *.cs file is documented with triple slash comments, so the PR reviewers can sign off that change.

ghost · 2022-11-29T07:24:38Z

Tagging subscribers to this area: @dotnet/area-system-memory
See info in area-owners.md if you want to be subscribed.

Issue Details

Implementation of api-approved issue: #76020

Author:	heathbm
Assignees:	-
Labels:	`area-System.Memory`, `new-api-needs-documentation`
Milestone:	-

stephentoub · 2022-11-29T11:34:33Z

src/libraries/System.Private.CoreLib/src/System/Buffers/Text/Base64Decoder.cs

@@ -35,6 +35,215 @@ public static partial class Base64
        ///   or if the input is incomplete (i.e. not a multiple of 4) and <paramref name="isFinalBlock"/> is <see langword="true"/>.
        /// </returns>
        public static unsafe OperationStatus DecodeFromUtf8(ReadOnlySpan<byte> utf8, Span<byte> bytes, out int bytesConsumed, out int bytesWritten, bool isFinalBlock = true)
+        {
+            // Validation must occur prior to decoding as the actual length will impact future calculations
+            bool containsIgnoredBytes = utf8.IndexOfAny(BytesToIgnore) != -1;


I expect this is going to regress performance, which we want to avoid. It should be possible to structure this such that we only pay additional cost in the existing implementation when it encounters an invalid character according to its current definition, which includes whitespace, at which point it can fall back to an implementation that allows for whitespace but is slower.

stephentoub · 2022-11-29T11:37:02Z

src/libraries/System.Private.CoreLib/src/System/Buffers/Text/Base64Decoder.cs

+            if (containsIgnoredBytes)
+            {
+                // Create a new span without bytes to be ignored
+                Span<byte> utf8WithIgnoredBytesRemoved = stackalloc byte[utf8.Length];


This is not safe and could easily stack overflow.

stephentoub

Thanks for your interest in working on this. I only took a cursory look, but I think the whole approach taken here needs to be revisited.

gfoidl · 2022-11-29T11:25:26Z

src/libraries/System.Private.CoreLib/src/System/Buffers/Text/Base64Decoder.cs

+            if (containsIgnoredBytes)
+            {
+                // Create a new span without bytes to be ignored
+                Span<byte> utf8WithIgnoredBytesRemoved = stackalloc byte[utf8.Length];


What will happen if utf8.Length is huge? Stackoverflow. Stack allocation must be guarded by some reasonable threshold (e.g. 256 byte), otherwise fall back to renting an array from the ArrayPool or if considered as rare-path just allocate (ToArray).

gfoidl · 2022-11-29T11:33:58Z

src/libraries/System.Private.CoreLib/src/System/Buffers/Text/Base64Decoder.cs

@@ -35,6 +35,215 @@ public static partial class Base64
        ///   or if the input is incomplete (i.e. not a multiple of 4) and <paramref name="isFinalBlock"/> is <see langword="true"/>.
        /// </returns>
        public static unsafe OperationStatus DecodeFromUtf8(ReadOnlySpan<byte> utf8, Span<byte> bytes, out int bytesConsumed, out int bytesWritten, bool isFinalBlock = true)
+        {
+            // Validation must occur prior to decoding as the actual length will impact future calculations


This will most likely cause a perf-regression, as decoding changed from $O(n)$ to be $O(2n)$ with this upfront check. If there's space encountered it's even $O(3n)$ due the "copy and remove space" loop.

Maybe it's better to just start decoding w/o any upfront check or copy, and if an invalid input is encountered fall back and decide what to do. If it's truly invalid then exit. If it's space, then ignore that position and continue. Thus the $O(n)$ remains.

To implement this you can split the decoding into two methods. One "driver" method, and one "worker" method. This is only relevant for the invalid-case, as then the driver can evaluate the state and call the worker again. Look at the buffer-chain examples / tests to see how someting like this can be done.

gfoidl · 2022-11-29T11:35:42Z

src/libraries/System.Private.CoreLib/src/System/Buffers/Text/Base64Validator.cs

+            }
+
+            // Check for invalid chars
+            int indexOfFirstNonBase64 = base64Text.IndexOfAnyExcept(ValidBase64CharsSortedAsc);


This should use #68328 (comment)
(see linked PRs for examples).

gfoidl · 2022-11-29T11:36:47Z

src/libraries/System.Private.CoreLib/src/System/Buffers/Text/Base64Validator.cs

+
+            // Check for invalid chars
+            int indexOfFirstNonBase64 = base64Text.IndexOfAnyExcept(ValidBase64CharsSortedAsc);
+            if (indexOfFirstNonBase64 > -1)


Suggested change

if (indexOfFirstNonBase64 > -1)

if (indexOfFirstNonBase64 >= 0)

Is a tiny little bit better at CPU-level as comparisons with 0 followed by a jump (branch) can be optimized by CPUs.

gfoidl · 2022-11-29T11:38:08Z

src/libraries/System.Private.CoreLib/src/System/Buffers/Text/Base64Validator.cs

+            int paddingCount = 0;
+
+            // Check if there are chars that need to be ignored while determining the length
+            if (base64Text.IndexOfAny(CharsToIgnore) > -1)


Suggested change

if (base64Text.IndexOfAny(CharsToIgnore) > -1)

if (base64Text.IndexOfAny(CharsToIgnore) >= 0)

And use IndexOfAnyValues (see above).

gfoidl · 2022-11-29T11:40:16Z

src/libraries/System.Private.CoreLib/src/System/Buffers/Text/Base64Validator.cs

+        /// <param name="base64TextUtf8">The input span which contains UTF-8 encoded text in base64 that needs to be validated.</param>
+        /// <param name="decodedLength">The maximum length (in bytes) if you were to decode the base 64 encoded text <paramref name="base64TextUtf8"/> within a byte span.</param>
+        /// <returns>true if <paramref name="base64TextUtf8"/> is decodable; otherwise, false.</returns>
+        public static unsafe bool IsValid(ReadOnlySpan<byte> base64TextUtf8, out int decodedLength)


Same comments as above apply.

gfoidl · 2022-11-29T11:42:37Z

@stephentoub race condition -> you won 😄

heathbm · 2022-11-29T17:01:43Z

Thanks for your interest in working on this. I only took a cursory look, but I think the whole approach taken here needs to be revisited.

Thanks for the feedback @stephentoub and @gfoidl , I will use it to improve to make better contributions in the future.

stephentoub · 2022-11-29T17:26:11Z

Thanks, @heathbm. I do appreciate your efforts here. Are you planning to take another look at this one, or you're going to leave it for someone else and look at other things?

heathbm · 2022-11-30T08:21:42Z

Thanks, @heathbm. I do appreciate your efforts here. Are you planning to take another look at this one, or you're going to leave it for someone else and look at other things?

@stephentoub I'm currently working on a new PR that does retain the O(1n) performance characteristic using the existing implementation as outlined in the feedback. Since I am not assigned to the issue, if anyone submits a complete PR before I do, I will gladly step aside however.

Allow Base64Decoder to ignore space chars, add IsValid methods and tests

99aaaee

Chars now ignored: 9: Line feed 10: Horizontal tab 13: Carriage return 32: Space -- Vertical tab omitted

dotnet-issue-labeler bot added area-System.Memory new-api-needs-documentation labels Nov 29, 2022

ghost added the community-contribution Indicates that the PR has been added by a community member label Nov 29, 2022

stephentoub reviewed Nov 29, 2022

View reviewed changes

stephentoub requested changes Nov 29, 2022

View reviewed changes

ghost added the needs-author-action An issue or pull request that requires more info or actions from the author. label Nov 29, 2022

gfoidl reviewed Nov 29, 2022

View reviewed changes

build-analysis bot mentioned this pull request Nov 29, 2022

CI build failure: Build MacCatalyst x64 Release AllSubsets_Mono - XcodeBuildApp task failed unexpectedly - Could not find System.Runtime.Tests.app #78778

Closed

heathbm closed this Nov 29, 2022

heathbm mentioned this pull request Dec 14, 2022

Allow Base64Decoder to ignore space chars, add IsValid methods and tests #79334

Closed

ghost locked as resolved and limited conversation to collaborators Dec 30, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow Base64Decoder to ignore space chars, add IsValid methods and tests #78951

Allow Base64Decoder to ignore space chars, add IsValid methods and tests #78951

heathbm commented Nov 29, 2022

dotnet-issue-labeler bot commented Nov 29, 2022

ghost commented Nov 29, 2022

stephentoub Nov 29, 2022 •

edited

Loading

stephentoub Nov 29, 2022

stephentoub left a comment

gfoidl Nov 29, 2022

gfoidl Nov 29, 2022

gfoidl Nov 29, 2022

gfoidl Nov 29, 2022

gfoidl Nov 29, 2022

gfoidl Nov 29, 2022

gfoidl commented Nov 29, 2022

heathbm commented Nov 29, 2022

stephentoub commented Nov 29, 2022

heathbm commented Nov 30, 2022 •

edited

Loading

	if (indexOfFirstNonBase64 > -1)
	if (indexOfFirstNonBase64 >= 0)

	if (base64Text.IndexOfAny(CharsToIgnore) > -1)
	if (base64Text.IndexOfAny(CharsToIgnore) >= 0)

Allow Base64Decoder to ignore space chars, add IsValid methods and tests #78951

Allow Base64Decoder to ignore space chars, add IsValid methods and tests #78951

Conversation

heathbm commented Nov 29, 2022

dotnet-issue-labeler bot commented Nov 29, 2022

ghost commented Nov 29, 2022

stephentoub Nov 29, 2022 • edited Loading

Choose a reason for hiding this comment

stephentoub Nov 29, 2022

Choose a reason for hiding this comment

stephentoub left a comment

Choose a reason for hiding this comment

gfoidl Nov 29, 2022

Choose a reason for hiding this comment

gfoidl Nov 29, 2022

Choose a reason for hiding this comment

gfoidl Nov 29, 2022

Choose a reason for hiding this comment

gfoidl Nov 29, 2022

Choose a reason for hiding this comment

gfoidl Nov 29, 2022

Choose a reason for hiding this comment

gfoidl Nov 29, 2022

Choose a reason for hiding this comment

gfoidl commented Nov 29, 2022

heathbm commented Nov 29, 2022

stephentoub commented Nov 29, 2022

heathbm commented Nov 30, 2022 • edited Loading

stephentoub Nov 29, 2022 •

edited

Loading

heathbm commented Nov 30, 2022 •

edited

Loading