-
Notifications
You must be signed in to change notification settings - Fork 786
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
XXH32/XXH64 modernization #607
Comments
We can probably make multiple small steps progressively in this direction. Also, this effort might be partially linked to #550 . |
(╯°□°)╯︵ ┻━┻
Why is msvc x86 allergic to unrolling fixed iteration loops? Edit: Outlining and extracting without rerolling seems to be fine though... |
This is what I was thinking. It uses some of the naming styles from XXH3. /*!
* @internal
* @brief Seeds the accumulator lanes for @ref XXH32().
*
* @param acc The 4 accumulator lanes from XXH32's internal state
* @param seed The initial seed for the hash.
*/
XXH_FORCE_INLINE void
XXH32_resetAccs(xxh_u32 acc[4], xxh_u32 const seed)
{
XXH_ASSERT(acc != NULL);
acc[0] = seed + XXH_PRIME32_1 + XXH_PRIME32_2;
acc[1] = seed + XXH_PRIME32_2;
acc[2] = seed + 0;
acc[3] = seed - XXH_PRIME32_1;
}
/*!
* @internal
* @brief The core bulk processing loop for @ref XXH32().
*
* @param input, len Directly passed from @ref XXH32(). @p len must be >= 16.
* @param acc The 4 accumulator lanes from XXH32's internal state
* @param align Whether @p input is aligned.
* @return `&input[len - len % 16]`
*/
XXH_FORCE_INLINE xxh_u8 const*
XXH32_hashLong(xxh_u8 const* input, size_t len, xxh_u32 acc[4], XXH_alignment align)
{
size_t nbBlocks = len / 16;
XXH_ASSERT(nbBlocks != 0 && input != NULL && lanes != NULL);
do {
/* Note: MSVC x86 refuses to unroll this automatically. */
acc[0] = XXH32_round(acc[0], XXH_get32bits(input + 0));
acc[1] = XXH32_round(acc[1], XXH_get32bits(input + 4));
acc[2] = XXH32_round(acc[2], XXH_get32bits(input + 8));
acc[3] = XXH32_round(acc[3], XXH_get32bits(input + 12));
input += 16;
} while (--nbBlocks);
return input;
}
/*!
* @internal
* @brief Merges the accumulator lanes to a single value for @ref XXH32()
*
* @param acc The 4 accumulator lanes from XXH32's internal state
* @return The merged value
*/
XXH_FORCE_INLINE xxh_u32
XXH32_mergeAccs(xxh_u32 const acc[4])
{
XXH_ASSERT(acc != NULL);
return XXH_rotl32(acc[0], 1) + XXH_rotl32(acc[1], 7)
+ XXH_rotl32(acc[2], 12) + XXH_rotl32(acc[3], 18);
}
/*!
* @internal
* @brief The implementation for @ref XXH32().
*
* @param input , len , seed Directly passed from @ref XXH32().
* @param align Whether @p input is aligned.
* @return The calculated hash.
*/
XXH_FORCE_INLINE xxh_u32
XXH32_endian_align(xxh_u8 const* input, size_t len, xxh_u32 seed, XXH_alignment align)
{
xxh_u32 h32;
if (input == NULL) XXH_ASSERT(len == 0);
if (len >= 16) {
xxh_u32 acc[4];
XXH32_resetAccs(acc, seed);
input = XXH32_hashLong(input, len, acc, align);
h32 = XXH32_mergeAccs(acc);
} else {
h32 = seed + XXH_PRIME32_5;
}
h32 += (xxh_u32)len;
return XXH32_finalize(h32, input, len % 16, align);
} |
It looks good to me |
I think for XXH64, we should just use a nested loop for the bulk loop, as long as MSVC x64 unrolls it (but MSVC x64 is more liberal in unrolling anyways) 64-bit arithmetic is going to be hot garbage on MSVC x86 anyways thanks to Side note: Extracting XXH64's internals in the same way somehow gave a slight boost to ARMv7-a with Clang 13 (1.5GB/s -> 1.7GB/s), even though it was inlined and unrolled just like before. 🤔 |
- Extract copy/pasted blocks into clean, inlined subroutines. - Reroll XXH64's internal loop. Compilers unroll this anyways, and it allows the code to be more concise. - XXH32 is still unrolled because MSVC x86 refuses to do it automatically. :( - Apply some stylistic choices from XXH3 (e.g. east const) - Rename some state fields to match XXH3's state - Convert the mem32 and mem64 fields to unsigned char arrays. - Remove some dead macros. None of these changes should break ABI, since the fields are the same size.
Draft at I also changed the mem32/mem64 fields to unsigned char arrays which shouldn't break binary ABI. |
Should we remove XXH_OLD_NAMES as well? |
Let's plan that for |
On a side note, I was toying with a mixed NEON/scalar XXH64. On my Pixel 4a, clang and GCC get the same 2804 MB/s normally, but with half NEON and half scalar, Clang gets 3156 MB/s and GCC gets 2925 MB/s. Since I already have the code I might as well make ARMv7-A do full NEON, and that actually gets 2704 MB/s on Clang compared to ~1GB/s normally. However, the implementation is pretty ugly: hybrid xxh64 neon#if defined(__ARM_NEON) || defined(__ARM_NEON__) || defined(_M_ARM) || defined(_M_ARM64) || defined(_M_ARM64EC)
# define XXH_HAS_NEON
# if defined(__GNUC__) || defined(__clang__)
# include <arm_neon.h>
# else
# include <intrin.h>
# endif
XXH_FORCE_INLINE uint64x2_t XXH_neon_mul64(uint32x2x2_t x, uint32x2_t y)
{
uint64x2_t cross;
/* grade school truncating multiply */
cross = vmull_lane_u32(x.val[0], y, 1);
cross = vmlal_lane_u32(cross, x.val[1], y, 0);
cross = vshlq_n_u64(cross, 32);
return vmlal_lane_u32(cross, x.val[0], y, 0);
}
#endif
#if defined(__aarch64__) || defined(__arm64__) || defined(_M_ARM64) || defined(_M_ARM64EC)
/* aarch64 does half NEON and half scalar */
# define XXH64_SCALAR_ROUNDS 2
# define XXH64_NEON_ROUNDS 1
#elif defined(XXH_HAS_NEON)
/* armv7-a uses full NEON */
# define XXH64_SCALAR_ROUNDS 0
# define XXH64_NEON_ROUNDS 2
#else
/* Everything else uses full scalar */
# define XXH64_SCALAR_ROUNDS 4
#endif
/*!
* @internal
* @brief The core bulk processing loop for @ref XXH64().
*
* @param input, len Directly passed from @ref XXH64(). @p len must be >= 16.
* @param acc The 4 accumulator lanes from XXH64's internal state
* @param align Whether @p input is aligned.
* @return `&input[len - len % 32]`
*/
static xxh_u8 const*
XXH64_hashLong(xxh_u8 const* input, size_t len, xxh_u64 acc[4], XXH_alignment align)
{
size_t nbBlocks = len / 32;
XXH_ASSERT(nbBlocks != 0 && input != NULL && acc != NULL);
{
size_t i;
#ifdef XXH_HAS_NEON
uint64x2_t accNeon[XXH64_NEON_ROUNDS];
uint32x2_t const prime2 = vreinterpret_u64_u32(vdup_n_u64(XXH_PRIME64_2));
uint32x2_t const prime1 = vreinterpret_u64_u32(vdup_n_u64(XXH_PRIME64_1));
/* Load NEON lanes */
for (i = 0; i < XXH64_NEON_ROUNDS; i++) {
accNeon[i] = vld1q_u64(&acc[XXH64_SCALAR_ROUNDS + 2 * i]);
}
#endif
do {
for (i = 0; i < XXH64_SCALAR_ROUNDS; i++) {
acc[i] = XXH64_round(acc[i], XXH_get64bits(input));
input += 8;
}
#ifdef XXH_HAS_NEON
for (i = 0; i < XXH64_NEON_ROUNDS; i++) {
/* interleaved load, putting input in place for mul64 */
uint32x2x2_t pair = vld2_u32((uint32_t const *)input);
/* input * PRIME64_2 */
uint64x2_t tmp = XXH_neon_mul64(pair, prime2);
uint64x2_t xacc = accNeon[i];
/* acc += input */
xacc = vaddq_u64(xacc, tmp);
/* rotl(xacc, 31) >> 32 without dependency */
pair.val[1] = vshrn_n_u64(xacc, 64 - 31 - 32);
/* rotl(xacc, 31) */
tmp = vshlq_n_u64(xacc, 31);
xacc = vsriq_n_u64(tmp, xacc, 64 - 31);
/* xacc & 0xFFFFFFFF */
pair.val[0] = vmovn_u64(xacc);
/* xacc *= PRIME64_1 */
accNeon[i] = XXH_neon_mul64(pair, prime1);
input += 16;
}
#endif
} while (--nbBlocks);
#ifdef XXH_HAS_NEON
/* Store NEON lanes back */
for (i = 0; i < XXH64_NEON_ROUNDS; i++) {
vst1q_u64(&acc[XXH64_SCALAR_ROUNDS + 2 * i], accNeon[i]);
}
#endif
}
return input;
} Side side note: Would a mixed SIMD/scalar benefit XXH3 as well? The integer pipeline is basically unused during hashLong, and we might benefit from doing a few lanes scalar. Edit: Holy shit, it does (at least on aarch64). Doing a 6:2 split on the NEON path on clang makes it jump from 8.8 GB/s to 10.2 GB/s. |
For For |
It only seems to affect AArch64, but XXH3 runs incredibly with a 6:2 ratio in #632, even (mostly) fixing the lackluster performance from GCC (30% faster, but still slower than clang lol). XXH64 definitely isn't worth it especially if it still can't beat XXH32. |
Is this topic still active ? Should we keep this issue opened ? |
Idea: XXH32 and XXH64 could be enhanced like so:
mem32/mem64
not beingunsigned char
Pros:
Cons:
The text was updated successfully, but these errors were encountered: