XXH32/XXH64 modernization #607

easyaspi314 · 2021-11-27T19:05:40Z

Idea: XXH32 and XXH64 could be enhanced like so:

Use an array for the single shot code as well
Use small, fixed iteration loops instead of manual unrolling
Extract common code into smaller sub-functions
Fix some extra cruft in the state like mem32/mem64 not being unsigned char

Pros:

Follows the "trust the compiler by default" design of XXH3.
- Modern GCC/Clang can pick up on the patterns and appear to emit code that is equivalent speed
More readable (and we can rename variables to be more clear)
Better for size optimization
If we keep the function signatures the same it shouldn't break ABI

Cons:

Might be slower on dumb compilers

The text was updated successfully, but these errors were encountered:

Cyan4973 · 2021-11-27T19:28:10Z

We can probably make multiple small steps progressively in this direction.
Performance matters, since it's an essential property attached to xxhash.
I'm sure there are several improvements or code simplifications that wouldn't impact performance, or barely,
but if a "dumb" compiler, say MSVC /O2, see important speed regressions, it matters too.
So, as usual, it's a matter of balance.

Also, this effort might be partially linked to #550 .

easyaspi314 · 2021-11-29T02:57:20Z

but if a "dumb" compiler, say MSVC /O2, see important speed regressions...

(╯°□°）╯︵ ┻━┻

C:\code\xxhash> xxhsum.exe -b1
xxhsum.exe 0.8.1 by Yann Collet
compiled as 32-bit i386 + SSE2 little endian with MSVC 19.29.30137.00
Sample of 100 KB...
 1#XXH32      :     102400 ->    48731  it/s ( 4758.9 MB/s)
C:\code\xxhash> xxhsum-outline-reroll.exe -b1
xxhsum-outline-reroll.exe 0.8.1 by Yann Collet
compiled as 32-bit i386 + SSE2 little endian with MSVC 19.29.30137.00
Sample of 100 KB...
 1#XXH32      :     102400 ->    26091  it/s ( 2548.0 MB/s)

Why is msvc x86 allergic to unrolling fixed iteration loops?

Edit: Outlining and extracting without rerolling seems to be fine though...

easyaspi314 · 2021-11-29T04:40:32Z

This is what I was thinking. It uses some of the naming styles from XXH3.

/*!
 * @internal
 * @brief Seeds the accumulator lanes for @ref XXH32().
 *
 * @param acc The 4 accumulator lanes from XXH32's internal state
 * @param seed The initial seed for the hash.
 */
XXH_FORCE_INLINE void
XXH32_resetAccs(xxh_u32 acc[4], xxh_u32 const seed)
{
    XXH_ASSERT(acc != NULL);
    acc[0] = seed + XXH_PRIME32_1 + XXH_PRIME32_2;
    acc[1] = seed + XXH_PRIME32_2;
    acc[2] = seed + 0;
    acc[3] = seed - XXH_PRIME32_1;
}

/*!
 * @internal
 * @brief The core bulk processing loop for @ref XXH32().
 *
 * @param input, len Directly passed from @ref XXH32(). @p len must be >= 16.
 * @param acc The 4 accumulator lanes from XXH32's internal state
 * @param align Whether @p input is aligned.
 * @return `&input[len - len % 16]`
 */
XXH_FORCE_INLINE xxh_u8 const*
XXH32_hashLong(xxh_u8 const* input, size_t len, xxh_u32 acc[4], XXH_alignment align)
{
    size_t nbBlocks = len / 16;
    XXH_ASSERT(nbBlocks != 0 && input != NULL && lanes != NULL);
    do {
        /* Note: MSVC x86 refuses to unroll this automatically. */
        acc[0] = XXH32_round(acc[0], XXH_get32bits(input +  0));
        acc[1] = XXH32_round(acc[1], XXH_get32bits(input +  4));
        acc[2] = XXH32_round(acc[2], XXH_get32bits(input +  8));
        acc[3] = XXH32_round(acc[3], XXH_get32bits(input + 12));
        input += 16;
    } while (--nbBlocks);
    return input;
}

/*!
 * @internal
 * @brief Merges the accumulator lanes to a single value for @ref XXH32()
 *
 * @param acc The 4 accumulator lanes from XXH32's internal state
 * @return The merged value
 */
XXH_FORCE_INLINE xxh_u32
XXH32_mergeAccs(xxh_u32 const acc[4])
{
    XXH_ASSERT(acc != NULL);
    return XXH_rotl32(acc[0],  1) + XXH_rotl32(acc[1],  7)
         + XXH_rotl32(acc[2], 12) + XXH_rotl32(acc[3], 18);
}

/*!
 * @internal
 * @brief The implementation for @ref XXH32().
 *
 * @param input , len , seed Directly passed from @ref XXH32().
 * @param align Whether @p input is aligned.
 * @return The calculated hash.
 */
XXH_FORCE_INLINE xxh_u32
XXH32_endian_align(xxh_u8 const* input, size_t len, xxh_u32 seed, XXH_alignment align)
{
    xxh_u32 h32;

    if (input == NULL) XXH_ASSERT(len == 0);

    if (len >= 16) {
        xxh_u32 acc[4];
        XXH32_resetAccs(acc, seed);
        input = XXH32_hashLong(input, len, acc, align);
        h32 = XXH32_mergeAccs(acc);
    } else {
        h32  = seed + XXH_PRIME32_5;
    }

    h32 += (xxh_u32)len;

    return XXH32_finalize(h32, input, len % 16, align);
}

Cyan4973 · 2021-11-29T04:44:27Z

It looks good to me

easyaspi314 · 2021-11-29T06:04:26Z

I think for XXH64, we should just use a nested loop for the bulk loop, as long as MSVC x64 unrolls it (but MSVC x64 is more liberal in unrolling anyways)

64-bit arithmetic is going to be hot garbage on MSVC x86 anyways thanks to _allmul calls, and GCC and Clang know how to unroll it.

Side note: Extracting XXH64's internals in the same way somehow gave a slight boost to ARMv7-a with Clang 13 (1.5GB/s -> 1.7GB/s), even though it was inlined and unrolled just like before. 🤔

- Extract copy/pasted blocks into clean, inlined subroutines. - Reroll XXH64's internal loop. Compilers unroll this anyways, and it allows the code to be more concise. - XXH32 is still unrolled because MSVC x86 refuses to do it automatically. :( - Apply some stylistic choices from XXH3 (e.g. east const) - Rename some state fields to match XXH3's state - Convert the mem32 and mem64 fields to unsigned char arrays. - Remove some dead macros. None of these changes should break ABI, since the fields are the same size.

easyaspi314 · 2021-11-29T15:30:59Z

Draft at easyaspi314:modern_xxh32_xxh64. I will make a PR once I do some benchmarking.

I also changed the mem32/mem64 fields to unsigned char arrays which shouldn't break binary ABI.

easyaspi314 · 2021-11-30T12:33:09Z

Should we remove XXH_OLD_NAMES as well?

Cyan4973 · 2021-11-30T16:11:54Z

Should we remove XXH_OLD_NAMES as well?

Let's plan that for v0.9.0

easyaspi314 · 2021-11-30T19:49:34Z

On a side note, I was toying with a mixed NEON/scalar XXH64.

On my Pixel 4a, clang and GCC get the same 2804 MB/s normally, but with half NEON and half scalar, Clang gets 3156 MB/s and GCC gets 2925 MB/s.

Since I already have the code I might as well make ARMv7-A do full NEON, and that actually gets 2704 MB/s on Clang compared to ~1GB/s normally.

However, the implementation is pretty ugly:

hybrid xxh64 neon

#if defined(__ARM_NEON) || defined(__ARM_NEON__) || defined(_M_ARM) || defined(_M_ARM64) || defined(_M_ARM64EC)
#  define XXH_HAS_NEON
#  if defined(__GNUC__) || defined(__clang__)
#     include <arm_neon.h>
#  else
#     include <intrin.h>
#  endif
XXH_FORCE_INLINE uint64x2_t XXH_neon_mul64(uint32x2x2_t x, uint32x2_t y)
{
    uint64x2_t cross;
    /* grade school truncating multiply */
    cross = vmull_lane_u32(x.val[0], y, 1);
    cross = vmlal_lane_u32(cross, x.val[1], y, 0);
    cross = vshlq_n_u64(cross, 32);
    return vmlal_lane_u32(cross, x.val[0], y, 0);
}
#endif
#if defined(__aarch64__) || defined(__arm64__) || defined(_M_ARM64) || defined(_M_ARM64EC)
/* aarch64 does half NEON and half scalar */
#  define XXH64_SCALAR_ROUNDS 2
#  define XXH64_NEON_ROUNDS 1
#elif defined(XXH_HAS_NEON)
/* armv7-a uses full NEON */
#  define XXH64_SCALAR_ROUNDS 0
#  define XXH64_NEON_ROUNDS 2
#else
/* Everything else uses full scalar */
#  define XXH64_SCALAR_ROUNDS 4
#endif

/*!
 * @internal
 * @brief The core bulk processing loop for @ref XXH64().
 *
 * @param input, len Directly passed from @ref XXH64(). @p len must be >= 16.
 * @param acc The 4 accumulator lanes from XXH64's internal state
 * @param align Whether @p input is aligned.
 * @return `&input[len - len % 32]`
 */
static xxh_u8 const*
XXH64_hashLong(xxh_u8 const* input, size_t len, xxh_u64 acc[4], XXH_alignment align)
{
    size_t nbBlocks = len / 32;
    XXH_ASSERT(nbBlocks != 0 && input != NULL && acc != NULL);
    {
        size_t i;
#ifdef XXH_HAS_NEON
        uint64x2_t accNeon[XXH64_NEON_ROUNDS];
        uint32x2_t const prime2 = vreinterpret_u64_u32(vdup_n_u64(XXH_PRIME64_2));
        uint32x2_t const prime1 = vreinterpret_u64_u32(vdup_n_u64(XXH_PRIME64_1));
        /* Load NEON lanes */
        for (i = 0; i < XXH64_NEON_ROUNDS; i++) {
            accNeon[i] = vld1q_u64(&acc[XXH64_SCALAR_ROUNDS + 2 * i]);
        }
#endif
        do {
            for (i = 0; i < XXH64_SCALAR_ROUNDS; i++) {
                acc[i] = XXH64_round(acc[i], XXH_get64bits(input));
                input += 8;
            }
#ifdef XXH_HAS_NEON
            for (i = 0; i < XXH64_NEON_ROUNDS; i++) {
                /* interleaved load, putting input in place for mul64 */
                uint32x2x2_t pair = vld2_u32((uint32_t const *)input);
                /* input * PRIME64_2 */
                uint64x2_t tmp = XXH_neon_mul64(pair, prime2);
                uint64x2_t xacc = accNeon[i];
                /* acc += input */
                xacc = vaddq_u64(xacc, tmp);
                /* rotl(xacc, 31) >> 32 without dependency */
                pair.val[1] = vshrn_n_u64(xacc, 64 - 31 - 32);
                /* rotl(xacc, 31) */
                tmp = vshlq_n_u64(xacc, 31);
                xacc = vsriq_n_u64(tmp, xacc, 64 - 31);
                /* xacc & 0xFFFFFFFF */
                pair.val[0] = vmovn_u64(xacc);
                /* xacc *= PRIME64_1 */
                accNeon[i] = XXH_neon_mul64(pair, prime1);
                input += 16;
           }
#endif
        } while (--nbBlocks);
#ifdef XXH_HAS_NEON
        /* Store NEON lanes back */
        for (i = 0; i < XXH64_NEON_ROUNDS; i++) {
            vst1q_u64(&acc[XXH64_SCALAR_ROUNDS + 2 * i], accNeon[i]);
        }
#endif
    }
    return input;
}

Side side note: Would a mixed SIMD/scalar benefit XXH3 as well? The integer pipeline is basically unused during hashLong, and we might benefit from doing a few lanes scalar.

Edit: Holy shit, it does (at least on aarch64). Doing a 6:2 split on the NEON path on clang makes it jump from 8.8 GB/s to 10.2 GB/s.

Cyan4973 · 2021-12-01T01:45:28Z

For XXH64, I would rather preserve code simplicity, the very minor performance difference seems not worth it,

For XXH3 on the other hand, since we already manage multiple specialized code paths, a ~+15% performance increase is definitely large enough to justify updating the aarch64 implementation. A complex bonus question though is, will it be beneficial (with various degrees) on all arch64, or beneficial for some, detrimental for others ? Difficult to tell.

easyaspi314 · 2021-12-01T01:58:56Z

It only seems to affect AArch64, but XXH3 runs incredibly with a 6:2 ratio in #632, even (mostly) fixing the lackluster performance from GCC (30% faster, but still slower than clang lol).

XXH64 definitely isn't worth it especially if it still can't beat XXH32.

Cyan4973 · 2022-05-06T21:55:00Z

Is this topic still active ? Should we keep this issue opened ?
Referring to the XXH32/XXH64 modernization effort in the title, not later topics appearing in the thread.

easyaspi314 mentioned this issue Nov 30, 2021

[Discussion] Hybrid SIMD+scalar #630

Closed

Cyan4973 closed this as completed Sep 27, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

XXH32/XXH64 modernization #607

XXH32/XXH64 modernization #607

easyaspi314 commented Nov 27, 2021

Cyan4973 commented Nov 27, 2021 •

edited

Loading

easyaspi314 commented Nov 29, 2021 •

edited

Loading

easyaspi314 commented Nov 29, 2021 •

edited

Loading

Cyan4973 commented Nov 29, 2021

easyaspi314 commented Nov 29, 2021

easyaspi314 commented Nov 29, 2021 •

edited

Loading

easyaspi314 commented Nov 30, 2021

Cyan4973 commented Nov 30, 2021 •

edited

Loading

easyaspi314 commented Nov 30, 2021 •

edited

Loading

Cyan4973 commented Dec 1, 2021

easyaspi314 commented Dec 1, 2021 •

edited

Loading

Cyan4973 commented May 6, 2022

XXH32/XXH64 modernization #607

XXH32/XXH64 modernization #607

Comments

easyaspi314 commented Nov 27, 2021

Cyan4973 commented Nov 27, 2021 • edited Loading

easyaspi314 commented Nov 29, 2021 • edited Loading

easyaspi314 commented Nov 29, 2021 • edited Loading

Cyan4973 commented Nov 29, 2021

easyaspi314 commented Nov 29, 2021

easyaspi314 commented Nov 29, 2021 • edited Loading

easyaspi314 commented Nov 30, 2021

Cyan4973 commented Nov 30, 2021 • edited Loading

easyaspi314 commented Nov 30, 2021 • edited Loading

Cyan4973 commented Dec 1, 2021

easyaspi314 commented Dec 1, 2021 • edited Loading

Cyan4973 commented May 6, 2022

Cyan4973 commented Nov 27, 2021 •

edited

Loading

easyaspi314 commented Nov 29, 2021 •

edited

Loading

easyaspi314 commented Nov 29, 2021 •

edited

Loading

easyaspi314 commented Nov 29, 2021 •

edited

Loading

Cyan4973 commented Nov 30, 2021 •

edited

Loading

easyaspi314 commented Nov 30, 2021 •

edited

Loading

easyaspi314 commented Dec 1, 2021 •

edited

Loading