Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve XmlDictionaryWriter UTF8 encoding performance #73336

Merged
merged 27 commits into from
Apr 4, 2023

Conversation

Daniel-Svensson
Copy link
Contributor

@Daniel-Svensson Daniel-Svensson commented Aug 3, 2022

Summary

  • Remove allocations for writing longer (42 /85+) strings containing non-ascii characters
  • Improve performance for utf8 encoding all strings with lenght 8 or longer
    • escpecially for cases where they contain a mix of ascii and non ascii characters (for "mixed chars" variants with a few non ascii the speedup was up to ~3x by not calling into encoding.GetBytes multiple times)
  • if anyone want to run the benchmarks (original, encoding, int32, long and at least Vector256) on arm hardware I can consider making the implementation fallback to generic simd on non x86 platforms, but since i have no idea of how the unaligned loads/stores affect performance i feel it is safer to fallback to System.encoding earlier on arm.

Feedback wanted on implementation to choose

  1. For UnsafeGetUTF8Chars does it makes sense to go with the AVX version which is faster for all lenghts > 8 even if it has a bit more code than the SSE verions?
  2. For UnsafeGetUTF8Length does it make sense to go with:
    2a. the Vector256 version (in this PR)
    * and if so what cutoff to use before calling Encoding.GetByteCount (1024 , 2048 ?)
    2b. Just call Encoding.GetByteCount (it is faster always faster than current code)
    2c. Vector version (with the risk of "AVX downclocking" on older intel hardware when AVX512 support is added)
  3. After writing this I discovered the Ascii-utilities and the Narrow method in CoreLib is there any internal low overhead call path possible to use those methods from runtime libraries ?
  4. What is a reasonable cutoff for when to call into system.encoding for the non accelerated case 16, 24,32?
Original UnsafeGetUTF8Chars benchmarks

UnsafeGetUTF8Chars benchmarks

Source: https://github.com/Daniel-Svensson/ClrExperiments/blob/master/BinaryXmlBenchmarks/ConsoleApp1/Utf8Benchmarks.cs

non ascii the speedup was up to ~3x when only a few characters were by not calling into encoding.GetBytes multiple times) so thoses measurements are not show below

BenchmarkDotNet=v0.13.1, OS=Windows 10.0.22000
AMD Ryzen 7 5800X, 1 CPU, 16 logical and 8 physical cores
.NET SDK=7.0.100-preview.6.22352.1
  [Host]     : .NET 7.0.0 (7.0.22.32404), X64 RyuJIT
  Job-GOOLTT : .NET 7.0.0 (7.0.22.32404), X64 RyuJIT

MaxRelativeError=0.01  IterationTime=250.0000 ms  
Method StringLengthInChars Scenario Mean Error StdDev Median
Original 5 AsciiOnly 3.667 ns 0.0079 ns 0.0066 ns 3.669 ns
Encoding 5 AsciiOnly 7.491 ns 0.0127 ns 0.0119 ns 7.490 ns
SimdSSE_v4 5 AsciiOnly 3.199 ns 0.0053 ns 0.0050 ns 3.199 ns
SimdAVX_2 5 AsciiOnly 3.495 ns 0.0064 ns 0.0057 ns 3.493 ns
SimdVector256 5 AsciiOnly 3.487 ns 0.0095 ns 0.0153 ns 3.483 ns
Original 8 AsciiOnly 4.524 ns 0.0129 ns 0.0115 ns 4.526 ns
Encoding 8 AsciiOnly 7.694 ns 0.0056 ns 0.0044 ns 7.696 ns
SimdSSE_v4 8 AsciiOnly 2.768 ns 0.0051 ns 0.0048 ns 2.767 ns
SimdAVX_2 8 AsciiOnly 2.767 ns 0.0070 ns 0.0065 ns 2.768 ns
SimdVector256 8 AsciiOnly 2.777 ns 0.0037 ns 0.0034 ns 2.777 ns
Original 10 AsciiOnly 5.115 ns 0.0189 ns 0.0177 ns 5.108 ns
Encoding 10 AsciiOnly 7.916 ns 0.0122 ns 0.0102 ns 7.914 ns
SimdSSE_v4 10 AsciiOnly 2.973 ns 0.0093 ns 0.0087 ns 2.970 ns
SimdAVX_2 10 AsciiOnly 2.772 ns 0.0053 ns 0.0047 ns 2.773 ns
SimdVector256 10 AsciiOnly 3.067 ns 0.0167 ns 0.0148 ns 3.065 ns
Original 16 AsciiOnly 7.357 ns 0.0669 ns 0.0626 ns 7.361 ns
Encoding 16 AsciiOnly 8.554 ns 0.0155 ns 0.0137 ns 8.554 ns
SimdSSE_v4 16 AsciiOnly 2.987 ns 0.0098 ns 0.0087 ns 2.985 ns
SimdAVX_2 16 AsciiOnly 2.770 ns 0.0056 ns 0.0050 ns 2.770 ns
SimdVector256 16 AsciiOnly 2.802 ns 0.0077 ns 0.0068 ns 2.801 ns
Original 20 AsciiOnly 8.159 ns 0.0708 ns 0.0695 ns 8.145 ns
Encoding 20 AsciiOnly 9.612 ns 0.1040 ns 0.2629 ns 9.582 ns
SimdSSE_v4 20 AsciiOnly 3.229 ns 0.0165 ns 0.0147 ns 3.225 ns
SimdAVX_2 20 AsciiOnly 3.438 ns 0.0081 ns 0.0076 ns 3.436 ns
SimdVector256 20 AsciiOnly 3.227 ns 0.0071 ns 0.0063 ns 3.228 ns
Original 30 AsciiOnly 10.717 ns 0.0926 ns 0.0723 ns 10.707 ns
SimdSSE_v4 30 AsciiOnly 3.661 ns 0.0057 ns 0.0051 ns 3.660 ns
SimdAVX_2 30 AsciiOnly 3.411 ns 0.0051 ns 0.0048 ns 3.410 ns
SimdVector256 30 AsciiOnly 3.288 ns 0.0134 ns 0.0125 ns 3.287 ns
Original 32 AsciiOnly 11.116 ns 0.0687 ns 0.0609 ns 11.116 ns
SimdSSE_v4 32 AsciiOnly 3.652 ns 0.0176 ns 0.0156 ns 3.644 ns
SimdAVX_2 32 AsciiOnly 3.425 ns 0.0078 ns 0.0073 ns 3.424 ns
SimdVector256 32 AsciiOnly 3.203 ns 0.0053 ns 0.0047 ns 3.204 ns
Original 34 AsciiOnly 11.562 ns 0.0515 ns 0.0430 ns 11.567 ns
SimdSSE_v4 34 AsciiOnly 3.901 ns 0.0157 ns 0.0122 ns 3.898 ns
SimdAVX_2 34 AsciiOnly 3.848 ns 0.0085 ns 0.0071 ns 3.850 ns
SimdVector256 34 AsciiOnly 3.651 ns 0.0075 ns 0.0066 ns 3.652 ns
Original 84 AsciiOnly 22.269 ns 0.0708 ns 0.0627 ns 22.277 ns
SimdSSE_v4 84 AsciiOnly 5.958 ns 0.0163 ns 0.0145 ns 5.955 ns
SimdAVX_2 84 AsciiOnly 5.148 ns 0.0097 ns 0.0091 ns 5.150 ns
SimdVector256 84 AsciiOnly 5.155 ns 0.0087 ns 0.0081 ns 5.154 ns
Original 170 AsciiOnly 44.331 ns 0.0684 ns 0.0607 ns 44.326 ns
SimdSSE_v4 170 AsciiOnly 10.980 ns 0.0215 ns 0.0179 ns 10.985 ns
SimdAVX_2 170 AsciiOnly 7.546 ns 0.0112 ns 0.0105 ns 7.549 ns
SimdVector256 170 AsciiOnly 9.863 ns 0.0161 ns 0.0143 ns 9.863 ns

The gains of AVX is somewhat less on older hardware

32bit results:

Original: UnsafeGetUTF8Length benchmarks: removed

Source: https://github.com/Daniel-Svensson/ClrExperiments/blob/master/BinaryXmlBenchmarks/ConsoleApp1/Utf8BenchmarksLength.cs

BenchmarkDotNet=v0.13.1, OS=Windows 10.0.22000
AMD Ryzen 7 5800X, 1 CPU, 16 logical and 8 physical cores
.NET SDK=7.0.100-preview.6.22352.1
  [Host]     : .NET 7.0.0 (7.0.22.32404), X64 RyuJIT
  Job-QIEUWM : .NET 7.0.0 (7.0.22.32404), X64 RyuJIT

MaxRelativeError=0.01  IterationTime=250.0000 ms  
Method StringLengthInChars Scenario Mean Error StdDev
Encoding 42 AsciiOnly 5.047 ns 0.0167 ns 0.0148 ns
VectorLength 42 AsciiOnly 2.087 ns 0.0069 ns 0.0058 ns
VectorLength_Aligned 42 AsciiOnly 2.077 ns 0.0086 ns 0.0076 ns
Encoding 85 AsciiOnly 6.113 ns 0.0233 ns 0.0218 ns
VectorLength 85 AsciiOnly 2.755 ns 0.0149 ns 0.0139 ns
VectorLength_Aligned 85 AsciiOnly 2.761 ns 0.0144 ns 0.0134 ns
Encoding 256 AsciiOnly 9.585 ns 0.0339 ns 0.0317 ns
VectorLength 256 AsciiOnly 6.839 ns 0.0227 ns 0.0201 ns
VectorLength_Aligned 256 AsciiOnly 6.944 ns 0.0211 ns 0.0197 ns
Encoding 512 AsciiOnly 15.959 ns 0.0308 ns 0.0288 ns
VectorLength 512 AsciiOnly 11.452 ns 0.0327 ns 0.0290 ns
VectorLength_Aligned 512 AsciiOnly 11.632 ns 0.0165 ns 0.0147 ns
Encoding 2048 AsciiOnly 59.927 ns 0.1031 ns 0.0914 ns
VectorLength 2048 AsciiOnly 36.509 ns 0.0851 ns 0.0754 ns
VectorLength_Aligned 2048 AsciiOnly 44.433 ns 0.4511 ns 0.7154 ns

2023-03-26: Updated PR with vectorisation removed
For latest results se comment below

@ghost ghost added the community-contribution Indicates that the PR has been added by a community member label Aug 3, 2022
@Daniel-Svensson Daniel-Svensson marked this pull request as ready for review August 4, 2022 17:28
@Daniel-Svensson Daniel-Svensson changed the title Feedback wanted: Improve XmlDictionaryWriter UTF8 encoding performance Improve XmlDictionaryWriter (text and binary xml) UTF8 encoding performance Aug 5, 2022
@Daniel-Svensson Daniel-Svensson changed the title Improve XmlDictionaryWriter (text and binary xml) UTF8 encoding performance Improve XmlDictionaryWriter UTF8 encoding performance Aug 5, 2022
@danmoseley
Copy link
Member

@HongGit who is the right reviewer for this PR?

@StephenMolloy
Copy link
Member

@tannergooding and @stephentoub... can you guys take a peek at this one? This is an improvement we'd like to take if it looks good, but we wanted some more eyes on it with the use of Vector256.

@danmoseley
Copy link
Member

@adamsitnik might be able to help review the vector stuff also.

int numRemaining = (int)(charsMax - chars);
int numAscii = charCount - numRemaining;

return numAscii + (_encoding ?? s_UTF8Encoding).GetByteCount(chars, numRemaining);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the possible values of _encoding? Can it be something other than Utf8?

Note that it better to call Encoding.UTF8.GetBytes directly without caching the encoding locally. Encoding.UTF8.GetBytes allows devitalization optimization to kick in that eliminates the overhead of Encoding being an abstract type.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can be passed by the user when creating a text XmlDictionaryWriter, but it is only set to _encoding if the codepage is the same as utf8.
So in theory it can be any encoding class even if unlikely .

for s_encoding it does not use the default constructor but passes is (false, true) so I did no dare to do that change.
If it does not change the behaviour then that can be a simple follow up fix.

Daniel-Svensson and others added 2 commits August 12, 2022 20:04
…tem/Xml/XmlStreamNodeWriter.cs

Co-authored-by: Stephen Toub <stoub@microsoft.com>
@danmoseley
Copy link
Member

as an aside @Daniel-Svensson how is the coverage of this code in dotnet/performance? ie., are there scenario/s there that will show improvement, and thus protect the improvement from future regression?

@Daniel-Svensson
Copy link
Contributor Author

Daniel-Svensson commented Mar 26, 2023

I've removed the improved utf8 encoding logic so the 25% speedup is gone, but at least is a bit faster for inputs with mixed ascii / non ascii characters.

While it is more than 3 times slower than original proposal it is still faster than the current version for larger strings, and for all cases where the input contains one or more "non-ascii" characters.

I do not expect any large performance changes from normal "all ascii" cases since many input strings falls in the range 8-25 characters and it should hopefully not change the performance of that case.

I have moved the vectorisation code to a sepate branch where I moved the vectorization code to Utf8Encoding and the results would be somewhat better than here, and I might create a separate PR (or not) from it.

The input to the method in question will be at most 170 (512/3) in length but depending on entry point the limit can also be 42 or 85

BenchmarkDotNet=v0.13.5, OS=Windows 11 (10.0.22621.1413/22H2/2022Update/SunValley2)
AMD Ryzen 7 5800X, 1 CPU, 16 logical and 8 physical cores
.NET SDK=8.0.100-preview.1.23115.2
  [Host]     : .NET 8.0.0 (8.0.23.11008), X64 RyuJIT AVX2
  Job-XXUKJT : .NET 8.0.0 (8.0.23.11008), X64 RyuJIT AVX2

MaxRelativeError=0.01  IterationTime=250.0000 ms  

Ascii Only Case

Method StringLengthInChars Scenario Mean Error StdDev Median
Original 5 AsciiOnly 3.726 ns 0.0475 ns 0.0397 ns 3.717 ns
New 5 AsciiOnly 3.059 ns 0.0156 ns 0.0139 ns 3.059 ns
Encoding_GetBytes 5 AsciiOnly 6.959 ns 0.0192 ns 0.0179 ns 6.951 ns
Vector128 (original proposal) 5 AsciiOnly 4.007 ns 0.0507 ns 0.0676 ns 4.006 ns
Vector256 (AVX2) 5 AsciiOnly 4.357 ns 0.0537 ns 0.0851 ns 4.316 ns
Original 8 AsciiOnly 4.373 ns 0.0206 ns 0.0183 ns 4.374 ns
New 8 AsciiOnly 4.214 ns 0.0191 ns 0.0178 ns
Encoding_GetBytes 8 AsciiOnly 7.208 ns 0.0802 ns 0.0751 ns 7.159 ns
Vector128 (original proposal) 8 AsciiOnly 2.138 ns 0.0102 ns 0.0095 ns 2.138 ns
Vector256 (AVX2) 8 AsciiOnly 2.723 ns 0.0080 ns 0.0067 ns 2.722 ns
Original 34 AsciiOnly 11.589 ns 0.0808 ns 0.0631 ns 11.589 ns
New 34 AsciiOnly 9.909 ns 0.0292 ns 0.0273 ns 9.909 ns
Encoding_GetBytes 34 AsciiOnly 9.087 ns 0.0260 ns 0.0203 ns 9.091 ns
Vector128 (original proposal) 34 AsciiOnly 3.202 ns 0.0103 ns 0.0092 ns 3.202 ns
Vector256 (AVX2) 34 AsciiOnly 3.675 ns 0.0237 ns 0.0221 ns 3.673 ns
Original 50 AsciiOnly 17.487 ns 0.0404 ns 0.0358 ns 17.479 ns
New 50 AsciiOnly 10.246 ns 0.0301 ns 0.0267 ns 10.241 ns
Encoding_GetBytes 50 AsciiOnly 9.519 ns 0.0257 ns 0.0214 ns 9.515 ns
Vector128 (original proposal) 50 AsciiOnly 3.953 ns 0.0508 ns 0.0564 ns 3.953 ns
Vector256 (AVX2) 50 AsciiOnly 4.010 ns 0.0270 ns 0.0239 ns 4.002 ns
Original 84 AsciiOnly 21.932 ns 0.0669 ns 0.0626 ns 21.931 ns
New 84 AsciiOnly 11.791 ns 0.0478 ns 0.0399 ns 11.789 ns
Encoding_GetBytes 84 AsciiOnly 10.306 ns 0.0329 ns 0.0257 ns 10.306 ns
Vector128 (original proposal) 84 AsciiOnly 5.586 ns 0.0646 ns 0.0863 ns 5.609 ns
Vector256 (AVX2) 84 AsciiOnly 4.795 ns 0.0156 ns 0.0146 ns 4.790 ns
Original 170 AsciiOnly 43.624 ns 0.1264 ns 0.1056 ns 43.634 ns
New 170 AsciiOnly 13.350 ns 0.0441 ns 0.0391 ns 13.349 ns
SealedEncoding_If_Ptr 170 AsciiOnly 10.782 ns 0.1034 ns 0.0916 ns 10.763 ns
Vector128 (original proposal) 170 AsciiOnly 10.608 ns 0.1151 ns 0.1859 ns 10.627 ns
Vector256 (AVX2) 170 AsciiOnly 7.153 ns 0.0321 ns 0.0251 ns 7.156 ns

Mostly Ascii case

Method StringLengthInChars Scenario Mean Error StdDev Median
Original 5 Mixed 3.668 ns 0.0232 ns 0.0217 ns 3.661 ns
New 5 Mixed 3.061 ns 0.0112 ns 0.0099 ns 3.061 ns
Encoding_GetBytes 5 Mixed 6.941 ns 0.0301 ns 0.0252 ns 6.933 ns
Vector128 (original proposal) 5 Mixed 3.979 ns 0.0208 ns 0.0174 ns 3.981 ns
Vector256 (AVX2) 5 Mixed 4.312 ns 0.0533 ns 0.0473 ns 4.306 ns
Original 8 Mixed 14.868 ns 0.1622 ns 0.4385 ns 14.713 ns
New 8 Mixed 12.059 ns 0.0981 ns 0.0869 ns 12.049 ns
Encoding_GetBytes 8 Mixed 10.136 ns 0.0419 ns 0.0392 ns 10.140 ns
Vector128 (original proposal) 8 Mixed 13.309 ns 0.0433 ns 0.0405 ns 13.313 ns
Vector256 (AVX2) 8 Mixed 14.186 ns 0.0908 ns 0.0759 ns 14.168 ns
Original 34 Mixed 32.739 ns 0.3621 ns 1.0563 ns 32.527 ns
New 34 Mixed 17.102 ns 0.0567 ns 0.0503 ns 17.094 ns
SealedEncoding_If_Ptr 34 Mixed 18.330 ns 0.0560 ns 0.0468 ns 18.327 ns
Vector128 (original proposal) 34 Mixed 19.595 ns 0.0523 ns 0.0436 ns 19.607 ns
Vector256 (AVX2) 34 Mixed 20.805 ns 0.2183 ns 0.3462 ns 20.880 ns
Original 50 Mixed 47.258 ns 0.5794 ns 1.7083 ns 46.619 ns
New 50 Mixed 20.686 ns 0.0815 ns 0.0722 ns 20.677 ns
Encoding_GetBytes 50 Mixed 20.020 ns 0.0785 ns 0.0696 ns 20.011 ns
Vector128 (original proposal) 50 Mixed 23.761 ns 0.1688 ns 0.1579 ns 23.730 ns
Vector256 (AVX2) 50 Mixed 24.034 ns 0.0707 ns 0.0662 ns 24.053 ns
Original 84 Mixed 77.078 ns 0.7535 ns 1.6696 ns 76.631 ns
New 84 Mixed 29.049 ns 0.0639 ns 0.0597 ns 29.047 ns
Encoding_GetBytes 84 Mixed 28.414 ns 0.0752 ns 0.0666 ns 28.416 ns
Vector128 (original proposal) 84 Mixed 31.854 ns 0.3257 ns 0.4235 ns 31.655 ns
Vector256 (AVX2) 84 Mixed 32.396 ns 0.0955 ns 0.0847 ns 32.380 ns
Original 170 Mixed 165.831 ns 1.1964 ns 1.1191 ns 166.444 ns
New 170 Mixed 53.783 ns 0.2215 ns 0.1964 ns 53.753 ns
Encoding_GetBytes 170 Mixed 52.745 ns 0.1562 ns 0.1461 ns 52.748 ns
Vector128 (original proposal) 170 Mixed 55.608 ns 0.2007 ns 0.1877 ns 55.603 ns
Vector256 (AVX2) 170 Mixed 57.323 ns 0.5780 ns 0.9333 ns 56.901 ns

}

internal static SealedUTF8Encoding UTF8NoBom { get; } = new SealedUTF8Encoding(encoderShouldEmitUTF8Identifier: false, throwOnInvalidBytes: false);
internal static SealedUTF8Encoding ValidatingUTF8 { get; } = new SealedUTF8Encoding(encoderShouldEmitUTF8Identifier: false, throwOnInvalidBytes: true);
Copy link
Contributor Author

@Daniel-Svensson Daniel-Svensson Mar 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was originally a temporary part of moving vector code to Encoding class.

It does not seem to make any impact to datacontract serialisation at the moment so I can revert the changes if you want that. From the code it looks like improvements would mainly be from classes calling into XmlConverter which uses this encoding directly


while (true)
// Fast path for small strings, use Encoding.GetBytes for larger strings since it is faster when vectorization is possible
if ((uint)charCount < 25)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling into encoding actually seems to be faster from 25 characters up, but that is when we don't need to handle branch predictions so I increased it to 32 handle misspredictions without to much affect on performance.

The "microbenchmarks" showed >5% regression where a long class name was mixed with many short strings & names when calling encoding from 25 chars and upp for the (text based) DataContractSerializer. (In the same case the binary serializer was 10% faster). Now they are maybe? 1% regression and 5% improvement, but other things might be different since it is no r2r or pgo for local build)

@StephenMolloy
Copy link
Member

Test failure appears to be unrelated. #64227

Getting back to these serializer PR's... I was going to suggest that the vectorization stuff would be better handled at the encoding layer. I am sure the folks watching over the encoding classes would welcome the kind of improvement your initial testing was showing. But I see that's already been updated.

I would remove the sealed encoding classes. The calls to them generate 'callvirt's anyway, so there isn't really a performance win there as you've already noticed.

@StephenMolloy
Copy link
Member

/azp run runtime-community

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@StephenMolloy
Copy link
Member

/azp run runtime-community

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@StephenMolloy StephenMolloy dismissed GrabYourPitchforks’s stale review April 4, 2023 22:10

Vectorizing of UTF8 was removed. This is just tweaking the hand-rolling that has already existed here for years.

@StephenMolloy StephenMolloy merged commit e0c94f8 into dotnet:main Apr 4, 2023
@Daniel-Svensson Daniel-Svensson deleted the binary_xml_text branch April 5, 2023 17:44
@ghost ghost locked as resolved and limited conversation to collaborators May 5, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-Serialization community-contribution Indicates that the PR has been added by a community member
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants