When calculating with xxhash128, it is slower. #103539

DogFortune · 2024-06-16T13:23:57Z

Hi

I am building an application that calculates the hash value of a file using xxhash.
https://github.com/DogFortune/F2Checker/blob/63f6402ab041db50363a2e37bb2ccd180d04f9e4/F2Checker/Models/AppContext.cs#L29

I know that xxhash is very fast. However, we are currently encountering a problem where the speed is not as fast as we would like.
hanging the hash algorithm to SHA256 improves speed.

These are the results measured in the release build. Measurements were taken on 14 GiB mp4 files.

xxhash128: 350MB/s
SHA256: 1.63GB/s

Please tell me why performance is not improving.

I am Japanese and not good at English, so I am sorry if I am wrong.

Configuration

.NET 8
.NET MAUI
macOS Sonoma 14.5
Apple M1 Pro

stephentoub · 2024-06-16T14:00:48Z

Please share a standalone benchmark using benchmarkdotnet. My guess is you're measuring the native code behind SHA256 against unoptimized C# due to not having tiered up yet to optimized code. Measuring a single run with DateTime is not going to give you a representative answer for sustained throughput.

vcsjones · 2024-06-16T15:03:04Z

Using XxHash128 on an Apple M1 Ultra, I get better throughput.

using System.IO;
using System.IO.Hashing;
using System.Security.Cryptography;
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;

BenchmarkRunner.Run<Bench>();

[MemoryDiagnoser]
public class Bench {
    [Benchmark]
    public byte[] BenchSHA256()
    {
        using var fs = File.OpenRead("/Users/vcsjones/Projects/scratch/input.bin");
        return SHA256.HashData(fs);
    }

    [Benchmark]
    public byte[] BenchXxHash128()
    {
        using var fs = File.OpenRead("/Users/vcsjones/Projects/scratch/input.bin");
        XxHash128 hash = new();
        hash.Append(fs);
        return hash.GetHashAndReset();
    }
}

Method	Mean	Error	StdDev	Allocated
BenchSHA256	8.118 s	0.0091 s	0.0081 s	5.46 KB
BenchXxHash128	3.032 s	0.0139 s	0.0123 s	5.77 KB

Where input.bin is a 14 GB random binary file.

I will also note that in your example, you are reading from the file in chunks and then appending it to either SHA256 or XxHash128. If all you need is one of them, then both SHA256 and XxHash128 have the ability to read directly from a stream, as demonstrated above. In the case of SHA256, using the static SHA256.HashData will be significantly more efficient. There are also Async versions that accept a cancellation token to use if you need them.

EgorBo · 2024-06-16T23:04:36Z

I ran this bench on Azure arm64 vs x64 and arm64 seems to be 3x slower (x64 has avx2 though), see #99871 (comment)

after inspecting asm, it seems that arm64 impl suffers a lot from Vector128<ulong> * Vector128<ulong> operation which is not accelerated

EgorBo · 2024-06-16T23:16:33Z

Ah, same on x64 when Avx512 is not presented. Basically, XxHash128 hits this issue (in ScrambleAccumulator*):

Vector128<ulong> Mul(Vector128<ulong> a, Vector128<ulong> b) => a * b;

Avx512:

; Method Bench:Mul
       vmovups  xmm0, xmmword ptr [r8]
       vpmullq  xmm0, xmm0, xmmword ptr [r9]
       vmovups  xmmword ptr [rdx], xmm0
       mov      rax, rdx
       ret      
; Total bytes of code: 19

Avx2 only:

; Method Bench:Mul
       push     rsi
       push     rbx
       sub      rsp, 104
       mov      rbx, rdx
       mov      rdx, qword ptr [r8]
       mov      qword ptr [rsp+0x58], rdx
       mov      rdx, qword ptr [r9]
       mov      qword ptr [rsp+0x50], rdx
       mov      rdx, qword ptr [rsp+0x58]
       imul     rdx, qword ptr [rsp+0x50]
       mov      qword ptr [rsp+0x60], rdx
       mov      rsi, qword ptr [rsp+0x60]
       mov      rdx, qword ptr [r8+0x08]
       mov      qword ptr [rsp+0x40], rdx
       mov      rdx, qword ptr [r9+0x08]
       mov      qword ptr [rsp+0x38], rdx
       mov      rcx, qword ptr [rsp+0x40]
       mov      rdx, qword ptr [rsp+0x38]
       call     [System.Runtime.Intrinsics.Scalar`1[ulong]:Multiply(ulong,ulong):ulong]
       mov      qword ptr [rsp+0x48], rax
       mov      rax, qword ptr [rsp+0x48]
       mov      qword ptr [rsp+0x20], rsi
       mov      qword ptr [rsp+0x28], rax
       vmovaps  xmm0, xmmword ptr [rsp+0x20]
       vmovups  xmmword ptr [rbx], xmm0
       mov      rax, rbx
       add      rsp, 104
       pop      rbx
       pop      rsi
       ret      
; Total bytes of code: 120

Arm64:

; Method Bench:Mul
G_M34924_IG01:  ;; offset=0x0000
            stp     fp, lr, [sp, #-0x60]!
            str     d8, [sp, #0x58]
            mov     fp, sp
            mov     v16.16b, v0.16b
            str     d16, [fp, #0x48]	// [V08 tmp5]
            mov     v16.16b, v1.16b
            str     d16, [fp, #0x40]	// [V09 tmp6]
            ldp     x0, x1, [fp, #0x40]	// [V09 tmp6], [V08 tmp5]
            mul     x1, x0, x1
            str     x1, [fp, #0x50]	// [V06 tmp3]
            ldr     d8, [fp, #0x50]	// [V06 tmp3]
            ext     v16.16b, v0.16b, v0.16b, #8
            str     d16, [fp, #0x30]	// [V25 tmp22]
            ext     v16.16b, v1.16b, v1.16b, #8
            str     d16, [fp, #0x28]	// [V26 tmp23]
            ldp     x1, x0, [fp, #0x28]	// [V26 tmp23], [V25 tmp22]
            movz    x2, #0xD740      // code for System.Runtime.Intrinsics.Scalar`1[ulong]:Multiply(ulong,ulong):ulong
            movk    x2, #0xB2A5 LSL #16
            movk    x2, #0x7FFA LSL #32
            ldr     x2, [x2]
            blr     x2 // calling Scalar`1[ulong]:Multiply(ulong,ulong):ulong helper
            str     x0, [fp, #0x38]	// [V23 tmp20]
            ldr     d0, [fp, #0x38]	// [V23 tmp20]
            ldr     q16, [fp, #0x10]	// [V38 tmp35]
            ins     v16.d[0], v8.d[0]
            ins     v16.d[1], v0.d[0]
            mov     v0.16b, v16.16b
            ldr     d8, [sp, #0x58]
            ldp     fp, lr, [sp], #0x60
            ret     lr
; Total bytes of code: 120

TL;DR: XxHash128 hits some terrible penalty on Arm64 and x64 without Avx512

DogFortune · 2024-06-17T00:49:37Z

The only hash function required is xxhash. I added this tentatively to see what happens if the hash function is changed to SHA256.
I plan to implement a progress bar in the future, so I load it in chunks.
I intend to measure benchmarks in my environment as well. I will share the results when they are available.

Where input.bin is a 14 GB random binary file.

I will also note that in your example, you are reading from the file in chunks and then appending it to either SHA256 or XxHash128. If all you need is one of them, then both SHA256 and XxHash128 have the ability to read directly from a stream, as demonstrated above. In the case of SHA256, using the static SHA256.HashData will be significantly more efficient. There are also Async versions that accept a cancellation token to use if you need them.

DogFortune · 2024-06-17T01:34:32Z

    [Benchmark]
    public byte[] BenchSHA256()
    {
        using var fs = File.OpenRead("test.mp4");
        return SHA256.HashData(fs);
    }
    
    [Benchmark]
    public string BenchXxHash128()
    {
        using var fs = File.OpenRead("test.mp4");
        XxHash128 xxhash = new();
        xxhash.Append(fs);
        var hash = Convert.ToHexString(xxhash.GetHashAndReset()).ToLower();
        Console.WriteLine(hash);
        return hash;
    }

	[Benchmark]
    public async Task<string> GetHashAsync()
    {
        var filename = "test.mp4";
        int _bufferSize = 1024 * 1024 * 10;
        var Buffer = new byte[_bufferSize];
        var Status = string.Empty;
        using (var entryStream = File.OpenRead(filename))
        {
            var xxhash = new XxHash128();
            long totalBytesRead = 0;
            var bytesRead = await entryStream.ReadAsync(Buffer, 0, Buffer.Length)
                .ConfigureAwait(false);
            var startTime = DateTime.Now;
            var speed = new ByteSize();
            while (bytesRead > 0)
            {
                xxhash.Append(Buffer.AsSpan(0, bytesRead));
                totalBytesRead += bytesRead;
                speed = ByteSize.FromBytes(totalBytesRead / (DateTime.Now - startTime).TotalSeconds);
                Status = $"{speed.LargestWholeNumberValue:#.##}{speed.LargestWholeNumberSymbol}/s";
                Console.WriteLine(Status);
                bytesRead = await entryStream.ReadAsync(Buffer, 0, Buffer.Length)
                    .ConfigureAwait(false);
            }
            Array.Clear(Buffer, 0, Buffer.Length);
            return Convert.ToHexString(xxhash.GetHashAndReset()).ToLower();
        }
    }

// * Summary *

BenchmarkDotNet v0.13.12, macOS Sonoma 14.5 (23F79) [Darwin 23.5.0]
Apple M1 Pro, 1 CPU, 8 logical and 8 physical cores
.NET SDK 8.0.204
[Host] : .NET 8.0.4 (8.0.424.16909), Arm64 RyuJIT AdvSIMD
DefaultJob : .NET 8.0.4 (8.0.424.16909), Arm64 RyuJIT AdvSIMD

Method	Mean	Error	StdDev	Allocated
BenchSHA256	9.978 s	0.1320 s	0.1235 s	5.46 KB
BenchXxHash128	4.203 s	0.0452 s	0.0401 s	5.94 KB
GetHashAsync	2.623 s	0.0396 s	0.0309 s	10.27 MB

The speed indicated by GetHashAsync was 5.5GB/s. Benchmark measured results were very high performance, so we can conclude that there is no problem with the code.
Is the fact that it is a GUI application using net8.0-maccatalyst the cause of the performance degradation?

huoyaoyuan · 2024-06-17T05:50:03Z

@DogFortune Your benchmark code contains too many mixture of unrelated stuffs.

However, it seems that the default buffer size for Append(Stream) is too small:

    [Benchmark]
    [Arguments(4096)]
    [Arguments(4096 * 1024)]
    public UInt128 GetHash(int bufferSize)
    {
        using var fs = File.OpenRead(@"D:\data.zip"); // 162 MB
        XxHash128 hash = new XxHash128();
        byte[] buffer = ArrayPool<byte>.Shared.Rent(bufferSize);
        int bytesRead;
        while ((bytesRead = fs.Read(buffer)) != 0)
        {
            hash.Append(buffer.AsSpan(0, bytesRead));
        }
        ArrayPool<byte>.Shared.Return(buffer);

        return hash.GetCurrentHashAsUInt128();
    }

Method	bufferSize	Mean	Error	StdDev
GetHash	4096	131.24 ms	2.343 ms	2.077 ms
GetHash	4194304	45.92 ms	0.911 ms	2.130 ms

4KB is the default buffer size used for Append(Stream). Increasing it to 4MB significantly improves the performance.
Your code is using a large buffer size (10MB) and it's probably the reason of performance difference.

DogFortune · 2024-06-17T06:03:28Z

@huoyaoyuan I set the buffer to 4MB but no improvement.

When I run the benchmark in my environment, I get the desired performance, but when I run the same code in MAUI, the performance is extremely poor.
The hash calculation process was separated as a .NET Standard 2.0 project and referenced from the console application, which resulted in high performance.

huoyaoyuan · 2024-06-17T06:24:28Z

I set the buffer to 4MB but no improvement.

To be consistent and comparable with Append(Stream), you should set the buffer size to 4KB.
Append(Stream) doesn't allow setting buffer size currently, but it would perform better than your code under the same buffer size.

huoyaoyuan · 2024-06-17T06:26:53Z

but when I run the same code in MAUI, the performance is extremely poor.

Are you targeting iOS or macOS? iOS will hit optimization of Mono runtime, which is yet another topic.

DogFortune · 2024-06-17T06:29:24Z

@huoyaoyuan The target is macOS and Windows, IOS is not included.
Right now, I would like to build a satisfactory app on macOS and then start on Windows.

am11 · 2024-06-17T07:59:25Z

Ah, same on x64 when Avx512 is not presented. Basically, XxHash128 hits this issue (in ScrambleAccumulator*):

@EgorBo, could it be just https://godbolt.org/z/eqsrf341M?

EgorBo · 2024-06-17T10:29:51Z

Ah, same on x64 when Avx512 is not presented. Basically, XxHash128 hits this issue (in ScrambleAccumulator*):

@EgorBo, could it be just https://godbolt.org/z/eqsrf341M?

it looks like the x64 impl is not correct? we need _mm_mullo_epi64 which is only available on avx512 (-mavx512vl -mavx512dq)

NOTE: we need to accelerate multiplication for both ulong and signed long

am11 · 2024-06-17T11:08:55Z

it looks like the x64 impl is not correct? we need _mm_mullo_epi64 which is only available on avx512 (-mavx512vl -mavx512dq)

Ah, here is the special casing for avx512dq, avx2 and sse2: https://godbolt.org/z/aWr6rhbxh

vcsjones · 2024-06-17T12:42:26Z

MAUI

MAUI uses the Mono runtime in macOS.

For Android, iOS, and macOS, the environment is implemented by Mono

So it seems the that target runtime in question here is Mono, not CoreCLR.

EgorBo · 2024-06-17T14:55:45Z

it looks like the x64 impl is not correct? we need _mm_mullo_epi64 which is only available on avx512 (-mavx512vl -mavx512dq)

Ah, here is the special casing for avx512dq, avx2 and sse2: https://godbolt.org/z/aWr6rhbxh

tried these, still don't match what avx512 instruction produces when I run them through tests

dotnet-policy-service · 2024-07-17T14:18:41Z

Tagging subscribers to this area: @lambdageek, @steveisok
See info in area-owners.md if you want to be subscribed.

adamsitnik · 2024-07-17T14:19:08Z

It seems to be a Mono codegen issue, so I am moving it to area-Codegen-JIT-mono area.

adamsitnik · 2024-07-17T14:26:12Z

I took @stephentoub benchmark from #103669 and run it for .NET 9 CLR vs Mono:

dotnet run -c Release --filter * --runtimes net9.0 mono9.0

BenchmarkDotNet v0.13.12, Windows 11 (10.0.22631.3880/23H2/2023Update/SunValley3)
AMD Ryzen Threadripper PRO 3945WX 12-Cores, 1 CPU, 24 logical and 12 physical cores
.NET SDK 9.0.100-preview.6.24277.6
  [Host]     : .NET 8.0.7 (8.0.724.31311), X64 RyuJIT AVX2
  Job-CFIFYM : .NET 9.0.0 (9.0.24.27202), X64 RyuJIT AVX2
  Job-MISQRF : .NET 9.0.0 (9.0.24.27202) using MonoVM, X64 X86Base

Method	Runtime	Kind	Mean	Ratio	Allocated	Alloc Ratio
Hash	.NET 9.0	FileStream	111.71 us	1.00	600 B	1.00
Hash	Mono with .NET 9.0	FileStream	3,160.01 us	28.29	602 B	1.00

Hash	.NET 9.0	MemoryStream	63.70 us	1.00	600 B	1.00
Hash	Mono with .NET 9.0	MemoryStream	3,195.97 us	50.17	602 B	1.00

lambdageek · 2024-07-17T14:42:33Z

BDN numbers using the same benchmark as #103539 (comment) on osx-arm64 with .NET 9 Preview 6:

Method	Runtime	Kind	Mean	Ratio	Gen0	Allocated	Alloc Ratio
Hash	.NET 9.0	FileStream	85.75 us	1.00	-	600 B	1.00
Hash	Mono with .NET 9.0	FileStream	1,559.28 us	18.17	-	857 B	1.43

Hash	.NET 9.0	MemoryStream	45.72 us	1.00	0.0610	600 B	1.00
Hash	Mono with .NET 9.0	MemoryStream	1,512.97 us	33.10	-	601 B	1.00

Still not great. Whatever it is that we're not doing, we're not doing it on both x64 and arm64

DogFortune added the tenet-performance Performance related issue label Jun 16, 2024

dotnet-issue-labeler bot added the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label Jun 16, 2024

dotnet-policy-service bot added the untriaged New issue has not been triaged by the area owner label Jun 16, 2024

vcsjones added area-System.IO.Hashing and removed needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners labels Jun 16, 2024

neon-sunset mentioned this issue Jun 17, 2024

Accelerate Vector128<long>::op_Multiply on x64 #103555

Merged

stephentoub mentioned this issue Jun 18, 2024

Use Stream.CopyTo{Async} in NonCryptographicHashAlgorithm.Append{Async} #103669

Merged

adamsitnik added area-Codegen-JIT-mono and removed area-System.IO.Hashing labels Jul 17, 2024

lambdageek removed the untriaged New issue has not been triaged by the area owner label Jul 17, 2024

lambdageek added this to the 9.0.0 milestone Jul 17, 2024

steveisok modified the milestones: 9.0.0, 10.0.0 Jul 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

When calculating with xxhash128, it is slower. #103539

When calculating with xxhash128, it is slower. #103539

DogFortune commented Jun 16, 2024 •

edited

Loading

stephentoub commented Jun 16, 2024

vcsjones commented Jun 16, 2024

EgorBo commented Jun 16, 2024 •

edited

Loading

EgorBo commented Jun 16, 2024 •

edited

Loading

DogFortune commented Jun 17, 2024

DogFortune commented Jun 17, 2024 •

edited

Loading

huoyaoyuan commented Jun 17, 2024

DogFortune commented Jun 17, 2024

huoyaoyuan commented Jun 17, 2024

huoyaoyuan commented Jun 17, 2024

DogFortune commented Jun 17, 2024 •

edited

Loading

am11 commented Jun 17, 2024

EgorBo commented Jun 17, 2024 •

edited

Loading

am11 commented Jun 17, 2024

vcsjones commented Jun 17, 2024 •

edited

Loading

EgorBo commented Jun 17, 2024

dotnet-policy-service bot commented Jul 17, 2024

adamsitnik commented Jul 17, 2024

adamsitnik commented Jul 17, 2024

lambdageek commented Jul 17, 2024

When calculating with xxhash128, it is slower. #103539

When calculating with xxhash128, it is slower. #103539

Comments

DogFortune commented Jun 16, 2024 • edited Loading

Configuration

stephentoub commented Jun 16, 2024

vcsjones commented Jun 16, 2024

EgorBo commented Jun 16, 2024 • edited Loading

EgorBo commented Jun 16, 2024 • edited Loading

DogFortune commented Jun 17, 2024

DogFortune commented Jun 17, 2024 • edited Loading

huoyaoyuan commented Jun 17, 2024

DogFortune commented Jun 17, 2024

huoyaoyuan commented Jun 17, 2024

huoyaoyuan commented Jun 17, 2024

DogFortune commented Jun 17, 2024 • edited Loading

am11 commented Jun 17, 2024

EgorBo commented Jun 17, 2024 • edited Loading

am11 commented Jun 17, 2024

vcsjones commented Jun 17, 2024 • edited Loading

EgorBo commented Jun 17, 2024

dotnet-policy-service bot commented Jul 17, 2024

adamsitnik commented Jul 17, 2024

adamsitnik commented Jul 17, 2024

lambdageek commented Jul 17, 2024

DogFortune commented Jun 16, 2024 •

edited

Loading

EgorBo commented Jun 16, 2024 •

edited

Loading

EgorBo commented Jun 16, 2024 •

edited

Loading

DogFortune commented Jun 17, 2024 •

edited

Loading

DogFortune commented Jun 17, 2024 •

edited

Loading

EgorBo commented Jun 17, 2024 •

edited

Loading

vcsjones commented Jun 17, 2024 •

edited

Loading