Add gxhash #279

ogxd · 2023-10-15T12:12:58Z

I have been working on a non-cryptographic hash algorithm with performance in mind. The idea is to leverage modern hardware capabilities as much as possible for maximum throughput (SIMD instrinsics, ILP, and some tricks).
For the story, I don't consider myself an expert in cryptography, but I simply fell down the rabbit hole... At the beginning, I had as an objective to improve a simple hash algorithm in C#. Limited by the possibilities offered by C#, but teased by the possibilities I envisioned on the way, I rewrote it in Rust. Optimization after optimization, the algorithm started to outperform many of its counterparts, and so at this point I decided to give it a name and to write a paper on it (it's still a rough draft!).

The algorithm is named GxHash and has the following features:

The fastest non-cryptographic algorithm of all? It seems it outperforms xxHash and t1ha (all versions) on both ARM64 and x86 64 and by quite a margin
Quite simple codewise
Can output 32 and 64-bit wide hashes (possibly 128-bit also)

The algorithm isn't stabilized yet: the version I have reimplemented in C and included in this PR is a variation with a more robust compress implementation. With this tweak, the algorithm seems to pass all SMHasher tests, but at the cost of performance. Still, on my laptop (Macbook M1 pro) it seems like it still outperforms all of its counterparts. Possibly it can be made as robust while not giving off as much performance, this is WIP.

One important thing about this algorithm also is that in its current form, it will generate different hashes on two machines with different SIMD register width. So it's best when used "in-process", for hashtables for instance.

Regarding SMHasher and this PR, I had to cheat a bit to be able to build on my ARM Macbook, what's the recommended way to build / testall on this platform ? @rurban

Any feedback is welcome on the PR and even the algorithm itself, I did all of this by myself, and so I may have missed something!

rurban · 2023-10-15T12:17:48Z

I'll check what we can do. Esp. The SIMD width discrepancy is unfortunate

ogxd · 2023-10-15T12:50:18Z

The SIMD width discrepancy is unfortunate

We can stick to 128-bit width SIMD intrinsics on X86, but at the cost of a lower throughput and I think it's best in terms of usability to hide this to the user (have the algorithm pick whatever instruction set is available to reach the maximum throughput it can)

the algorithm seems to pass all SMHasher tests, but at the cost of performance. Still, on my laptop (Macbook M1 pro) it seems like it still outperforms all of its counterparts

Here are some raw numbers for this platform:

	xxHash64	gxHash64
Small key speed test (average)	29 cycles/hash	19 cycles/hash
Bulk speed test (average)	27458 MiB/sec	78359 MiB/sec

ogxd · 2023-10-16T12:07:31Z

For now I see two options:

Option 1: Make `GxHash` only use 128-bit SIMD, with a derived version `GxHash_VAES` using 256-bit SIMD

Given the quite constraining VAES requirement for the 256-bit SIMD version, have GxHash only use 128-bit SIMD, even on X86, and another version GxHash_VAES (or GxHash_256 ?) which will use 256-bit SIMD for even better performances when supported by the platform. This is similar to what has been done with the several t1ha0 versions.

#  ifndef _MSC_VER
{ t1ha0_ia32aes_noavx_test,  64, 0xF07C4DA5, "t1ha0_aes_noavx", "Fast Positive Hash (AES-NI)", GOOD, {} },
#  endif                         ^ different verification hashes
#  if defined(__AVX__)
{ t1ha0_ia32aes_avx1_test,   64, 0xF07C4DA5, "t1ha0_aes_avx1",  "Fast Positive Hash (AES-NI & AVX)", GOOD, {} },
#  endif /* __AVX__ */           ^ different verification hashes
#  if defined(__AVX2__)
{ t1ha0_ia32aes_avx2_test,   64, 0x8B38C599, "t1ha0_aes_avx2",  "Fast Positive Hash (AES-NI & AVX2)", GOOD, {} },
#  endif /* __AVX2__ */          ^ different verification hashes

Pros

The GxHash base version now returns the same hashes, independently of the platform it runs on. The GxHash_VAES version will perform better, but requires VAES and AVX2 and return different hashes than GxHash (but now it makes sense since we can consider it's not exactly the same algorithm)

Cons

GxHash will not leverage the full width of SIMD on hardware that supports VAES and AVX2. In this case GxHash_VAES can be used instead for maximum throughput, but then usage is platform specific and user must be aware that it even exists. But possibly the need for that extreme performance is so niche that we can assume that someone looking for crazy optimization is well-informed of the variants available.

Option 2: Accept the SIMD width discrepancy as a design choice (it was initially)

There will be a single version of GxHash, that will choose the highest SIMD width possible.

Pros

An user always gets the maximum performance on his platform

Cons

The algorithm is not "portable" (not sure if that's the right terminology). I mean that algorithm may output different hashes depending on the platforms in runs on.

Other Options ?

ogxd · 2023-10-16T23:32:43Z

After the latest tweaks I was able to squeeze out a little more performance while still passing all tests. It would be difficult for me to make it any faster without affecting quality. I think I'll stick with that implementation.

	xxHash64	gxHash64
Small key speed test (average)	29 cycles/hash	18 cycles/hash
Bulk speed test (average)	27458 MiB/sec	89243 MiB/sec

Now the question is whether I should opt for option 1 (like t1ha) or option 2 (and how to fix the CI 😅)

This reverts commit 0e4ce59.

ogxd · 2023-10-27T19:16:57Z

I just pushed one more optimization (still passing the SMHasher tests ofc). I also took the opportunity to run https://github.com/tkaitchuck/aHash since it was advertised as "the fastest, DOS-resistant hash currently available in Rust".

Here are the results, still running on my Macbook M1 pro (ARM64):

	aHash64	xxHash64	gxHash64
Small key speed test (average)	40 cycles/hash	29 cycles/hash	19 cycles/hash
Bulk speed test (average)	21706 MiB/sec	27458 MiB/sec	120124 MiB/sec

Here are the numbers I got for the same 128-bit state version of the algorithm but on x86 (Ryzen 5).

	xxHash64	gxHash64
Small key speed test (average)	23 cycles/hash	20 cycles/hash
Bulk speed test (average)	15711 MiB/sec	88662 MiB/sec

The 256-bit state version should be even faster, but my PC does not support VAES instructions, so I need to find another beast to experiment it.

…tforms)

ogxd · 2023-11-03T13:21:04Z

Hi @rurban, while finalizing my work on gxHash64 I noticed very different timings from my benchmark setup compared to the SMHasher speed test. For instance, I get worse performance for FNV1a for small sizes until inputs of size 128 bytes, and then throughput degrades as size continues to increase. From all FNV1a benchmarks I've seen FNV1a 64 performs the best for inputs for size 8 and then throughput decreases as size increases.

Input size (bytes)	4	8	16	32	64	128	256	512	1024	2048	4096	8192	16384	32768	65536	131072
FNV1a	865.39	1090.12	1238.12	1349.01	1407.81	2201.56	1768.16	1612.67	1528.07	1492.53	1478.12	1468.69	1462.33	1454.60	1460.20	1456.04

I think this is because of the loop overhead itself in the timehash_small.
I have changed on my side the speed test to always use timehash_small but unroll manually 10 times instead of using a loop, and I get very different but I think more representative results:

Input size (bytes)	4	8	16	32	64	128	256	512	1024	2048	4096	8192	16384	32768	65536	131072
FNV1a	1907.35	1978.48	1700.04	1523.83	1481.14	1476.72	1462.21	1470.60	1460.58	1466.06	1460.80	1463.39	1458.17	1451.01	1458.13
metroHash54	1907.35	3814.70	4332.77	4715.88	7344.15	11374.00	17087.64	23065.38	27846.92	31166.06	33266.34	34663.77	35170.26	35584.41	35896.26	35933.13
xxHash64	616.31	1124.22	2041.58	3185.89	5683.20	9258.81	14101.60	18999.46	22045.01	24024.22	25137.27	25972.35	26499.82	26554.76	26752.84	26853.30
gxHash64	1907.35	3814.70	7629.39	15258.79	18947.76	22370.52	29933.30	45861.16	65807.79	91524.45	110892.09	126706.16	136029.49	141768.25	145343.30	145712.61

Benchmarking method that execute in a very small number of cycles seems quite difficult. Unrolling may prevent inlining at some point but I'm guessing there are other ways, like substracting the loop overhead to the timing (getting the overhead with an empty warmup of by estimating it, I'm not sure what is the most accurate way)

What do you think?

Note: This is ran on an a Macbook so aarch64

Add gxhash

1544c07

Fix X86 build

a6e9f0e

ogxd force-pushed the gxhash branch from 0625dbc to a6e9f0e Compare October 15, 2023 16:00

List CPU features in CI

0e4ce59

ogxd marked this pull request as draft October 15, 2023 21:46

Add no VAES X86 version

efa649e

ogxd force-pushed the gxhash branch from 6fe999b to efa649e Compare October 15, 2023 22:19

ogxd added 2 commits October 16, 2023 00:23

Fix preprocessor and expected hash for sanity test

2dda43f

Try fix armv7 build

b99764e

rurban self-assigned this Oct 16, 2023

rurban added the hash_new label Oct 16, 2023

rurban and others added 3 commits October 17, 2023 01:18

gxhash: fix MSVC cflags

a895848

chmod -x aesni-hash.h

773bc7c

Tweaks for +20% performance for small and large inputs (on ARM)

bce1da3

ogxd added 2 commits October 17, 2023 01:35

Revert "List CPU features in CI"

169d6be

This reverts commit 0e4ce59.

Improve performances for small and large hashes

f261be5

ogxd and others added 3 commits October 27, 2023 22:14

Don't use distinct state and output

688f38e

Fix several issues with gxHash on x86

4baff5b

Fix endianness (128-bit state hashes are now the same accross all pla…

80391a4

…tforms)

This was referenced Nov 16, 2023

Is it time to replace Marvin? dotnet/runtime#85206

Open

Add interfaces to popular Rust hash libraries #276

Open

rurban marked this pull request as ready for review November 23, 2024 12:06

rurban merged commit b081a0a into rurban:master Nov 23, 2024
8 of 10 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add gxhash #279

Add gxhash #279

ogxd commented Oct 15, 2023

rurban commented Oct 15, 2023

ogxd commented Oct 15, 2023

ogxd commented Oct 16, 2023 •

edited

Loading

ogxd commented Oct 16, 2023 •

edited

Loading

ogxd commented Oct 27, 2023 •

edited

Loading

ogxd commented Nov 3, 2023 •

edited

Loading

Add gxhash #279

Add gxhash #279

Conversation

ogxd commented Oct 15, 2023

rurban commented Oct 15, 2023

ogxd commented Oct 15, 2023

ogxd commented Oct 16, 2023 • edited Loading

Option 1: Make GxHash only use 128-bit SIMD, with a derived version GxHash_VAES using 256-bit SIMD

Option 2: Accept the SIMD width discrepancy as a design choice (it was initially)

Other Options ?

ogxd commented Oct 16, 2023 • edited Loading

ogxd commented Oct 27, 2023 • edited Loading

ogxd commented Nov 3, 2023 • edited Loading

ogxd commented Oct 16, 2023 •

edited

Loading

Option 1: Make `GxHash` only use 128-bit SIMD, with a derived version `GxHash_VAES` using 256-bit SIMD

ogxd commented Oct 16, 2023 •

edited

Loading

ogxd commented Oct 27, 2023 •

edited

Loading

ogxd commented Nov 3, 2023 •

edited

Loading