Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

XXH32/XXH64 modernization #607

Closed
easyaspi314 opened this issue Nov 27, 2021 · 12 comments
Closed

XXH32/XXH64 modernization #607

easyaspi314 opened this issue Nov 27, 2021 · 12 comments

Comments

@easyaspi314
Copy link
Contributor

Idea: XXH32 and XXH64 could be enhanced like so:

  • Use an array for the single shot code as well
  • Use small, fixed iteration loops instead of manual unrolling
  • Extract common code into smaller sub-functions
  • Fix some extra cruft in the state like mem32/mem64 not being unsigned char

Pros:

  • Follows the "trust the compiler by default" design of XXH3.
    • Modern GCC/Clang can pick up on the patterns and appear to emit code that is equivalent speed
  • More readable (and we can rename variables to be more clear)
  • Better for size optimization
  • If we keep the function signatures the same it shouldn't break ABI

Cons:

  • Might be slower on dumb compilers
@Cyan4973
Copy link
Owner

Cyan4973 commented Nov 27, 2021

We can probably make multiple small steps progressively in this direction.
Performance matters, since it's an essential property attached to xxhash.
I'm sure there are several improvements or code simplifications that wouldn't impact performance, or barely,
but if a "dumb" compiler, say MSVC /O2, see important speed regressions, it matters too.
So, as usual, it's a matter of balance.

Also, this effort might be partially linked to #550 .

@easyaspi314
Copy link
Contributor Author

easyaspi314 commented Nov 29, 2021

but if a "dumb" compiler, say MSVC /O2, see important speed regressions...

(╯°□°)╯︵ ┻━┻

C:\code\xxhash> xxhsum.exe -b1
xxhsum.exe 0.8.1 by Yann Collet
compiled as 32-bit i386 + SSE2 little endian with MSVC 19.29.30137.00
Sample of 100 KB...
 1#XXH32      :     102400 ->    48731  it/s ( 4758.9 MB/s)
C:\code\xxhash> xxhsum-outline-reroll.exe -b1
xxhsum-outline-reroll.exe 0.8.1 by Yann Collet
compiled as 32-bit i386 + SSE2 little endian with MSVC 19.29.30137.00
Sample of 100 KB...
 1#XXH32      :     102400 ->    26091  it/s ( 2548.0 MB/s)

Why is msvc x86 allergic to unrolling fixed iteration loops?

Edit: Outlining and extracting without rerolling seems to be fine though...

@easyaspi314
Copy link
Contributor Author

easyaspi314 commented Nov 29, 2021

This is what I was thinking. It uses some of the naming styles from XXH3.

/*!
 * @internal
 * @brief Seeds the accumulator lanes for @ref XXH32().
 *
 * @param acc The 4 accumulator lanes from XXH32's internal state
 * @param seed The initial seed for the hash.
 */
XXH_FORCE_INLINE void
XXH32_resetAccs(xxh_u32 acc[4], xxh_u32 const seed)
{
    XXH_ASSERT(acc != NULL);
    acc[0] = seed + XXH_PRIME32_1 + XXH_PRIME32_2;
    acc[1] = seed + XXH_PRIME32_2;
    acc[2] = seed + 0;
    acc[3] = seed - XXH_PRIME32_1;
}

/*!
 * @internal
 * @brief The core bulk processing loop for @ref XXH32().
 *
 * @param input, len Directly passed from @ref XXH32(). @p len must be >= 16.
 * @param acc The 4 accumulator lanes from XXH32's internal state
 * @param align Whether @p input is aligned.
 * @return `&input[len - len % 16]`
 */
XXH_FORCE_INLINE xxh_u8 const*
XXH32_hashLong(xxh_u8 const* input, size_t len, xxh_u32 acc[4], XXH_alignment align)
{
    size_t nbBlocks = len / 16;
    XXH_ASSERT(nbBlocks != 0 && input != NULL && lanes != NULL);
    do {
        /* Note: MSVC x86 refuses to unroll this automatically. */
        acc[0] = XXH32_round(acc[0], XXH_get32bits(input +  0));
        acc[1] = XXH32_round(acc[1], XXH_get32bits(input +  4));
        acc[2] = XXH32_round(acc[2], XXH_get32bits(input +  8));
        acc[3] = XXH32_round(acc[3], XXH_get32bits(input + 12));
        input += 16;
    } while (--nbBlocks);
    return input;
}

/*!
 * @internal
 * @brief Merges the accumulator lanes to a single value for @ref XXH32()
 *
 * @param acc The 4 accumulator lanes from XXH32's internal state
 * @return The merged value
 */
XXH_FORCE_INLINE xxh_u32
XXH32_mergeAccs(xxh_u32 const acc[4])
{
    XXH_ASSERT(acc != NULL);
    return XXH_rotl32(acc[0],  1) + XXH_rotl32(acc[1],  7)
         + XXH_rotl32(acc[2], 12) + XXH_rotl32(acc[3], 18);
}

/*!
 * @internal
 * @brief The implementation for @ref XXH32().
 *
 * @param input , len , seed Directly passed from @ref XXH32().
 * @param align Whether @p input is aligned.
 * @return The calculated hash.
 */
XXH_FORCE_INLINE xxh_u32
XXH32_endian_align(xxh_u8 const* input, size_t len, xxh_u32 seed, XXH_alignment align)
{
    xxh_u32 h32;

    if (input == NULL) XXH_ASSERT(len == 0);

    if (len >= 16) {
        xxh_u32 acc[4];
        XXH32_resetAccs(acc, seed);
        input = XXH32_hashLong(input, len, acc, align);
        h32 = XXH32_mergeAccs(acc);
    } else {
        h32  = seed + XXH_PRIME32_5;
    }

    h32 += (xxh_u32)len;

    return XXH32_finalize(h32, input, len % 16, align);
}

@Cyan4973
Copy link
Owner

It looks good to me

@easyaspi314
Copy link
Contributor Author

I think for XXH64, we should just use a nested loop for the bulk loop, as long as MSVC x64 unrolls it (but MSVC x64 is more liberal in unrolling anyways)

64-bit arithmetic is going to be hot garbage on MSVC x86 anyways thanks to _allmul calls, and GCC and Clang know how to unroll it.

Side note: Extracting XXH64's internals in the same way somehow gave a slight boost to ARMv7-a with Clang 13 (1.5GB/s -> 1.7GB/s), even though it was inlined and unrolled just like before. 🤔

easyaspi314 added a commit to easyaspi314/xxHash that referenced this issue Nov 29, 2021
 - Extract copy/pasted blocks into clean, inlined subroutines.
 - Reroll XXH64's internal loop. Compilers unroll this anyways, and it allows the
code to be more concise.
  - XXH32 is still unrolled because MSVC x86 refuses to do it automatically. :(
 - Apply some stylistic choices from XXH3 (e.g. east const)
 - Rename some state fields to match XXH3's state
 - Convert the mem32 and mem64 fields to unsigned char arrays.
 - Remove some dead macros.

None of these changes should break ABI, since the fields are the same size.
@easyaspi314
Copy link
Contributor Author

easyaspi314 commented Nov 29, 2021

Draft at easyaspi314:modern_xxh32_xxh64. I will make a PR once I do some benchmarking.

I also changed the mem32/mem64 fields to unsigned char arrays which shouldn't break binary ABI.

@easyaspi314
Copy link
Contributor Author

Should we remove XXH_OLD_NAMES as well?

@Cyan4973
Copy link
Owner

Cyan4973 commented Nov 30, 2021

Should we remove XXH_OLD_NAMES as well?

Let's plan that for v0.9.0

@easyaspi314
Copy link
Contributor Author

easyaspi314 commented Nov 30, 2021

On a side note, I was toying with a mixed NEON/scalar XXH64.

On my Pixel 4a, clang and GCC get the same 2804 MB/s normally, but with half NEON and half scalar, Clang gets 3156 MB/s and GCC gets 2925 MB/s.

Since I already have the code I might as well make ARMv7-A do full NEON, and that actually gets 2704 MB/s on Clang compared to ~1GB/s normally.

However, the implementation is pretty ugly:

hybrid xxh64 neon
#if defined(__ARM_NEON) || defined(__ARM_NEON__) || defined(_M_ARM) || defined(_M_ARM64) || defined(_M_ARM64EC)
#  define XXH_HAS_NEON
#  if defined(__GNUC__) || defined(__clang__)
#     include <arm_neon.h>
#  else
#     include <intrin.h>
#  endif
XXH_FORCE_INLINE uint64x2_t XXH_neon_mul64(uint32x2x2_t x, uint32x2_t y)
{
    uint64x2_t cross;
    /* grade school truncating multiply */
    cross = vmull_lane_u32(x.val[0], y, 1);
    cross = vmlal_lane_u32(cross, x.val[1], y, 0);
    cross = vshlq_n_u64(cross, 32);
    return vmlal_lane_u32(cross, x.val[0], y, 0);
}
#endif
#if defined(__aarch64__) || defined(__arm64__) || defined(_M_ARM64) || defined(_M_ARM64EC)
/* aarch64 does half NEON and half scalar */
#  define XXH64_SCALAR_ROUNDS 2
#  define XXH64_NEON_ROUNDS 1
#elif defined(XXH_HAS_NEON)
/* armv7-a uses full NEON */
#  define XXH64_SCALAR_ROUNDS 0
#  define XXH64_NEON_ROUNDS 2
#else
/* Everything else uses full scalar */
#  define XXH64_SCALAR_ROUNDS 4
#endif

/*!
 * @internal
 * @brief The core bulk processing loop for @ref XXH64().
 *
 * @param input, len Directly passed from @ref XXH64(). @p len must be >= 16.
 * @param acc The 4 accumulator lanes from XXH64's internal state
 * @param align Whether @p input is aligned.
 * @return `&input[len - len % 32]`
 */
static xxh_u8 const*
XXH64_hashLong(xxh_u8 const* input, size_t len, xxh_u64 acc[4], XXH_alignment align)
{
    size_t nbBlocks = len / 32;
    XXH_ASSERT(nbBlocks != 0 && input != NULL && acc != NULL);
    {
        size_t i;
#ifdef XXH_HAS_NEON
        uint64x2_t accNeon[XXH64_NEON_ROUNDS];
        uint32x2_t const prime2 = vreinterpret_u64_u32(vdup_n_u64(XXH_PRIME64_2));
        uint32x2_t const prime1 = vreinterpret_u64_u32(vdup_n_u64(XXH_PRIME64_1));
        /* Load NEON lanes */
        for (i = 0; i < XXH64_NEON_ROUNDS; i++) {
            accNeon[i] = vld1q_u64(&acc[XXH64_SCALAR_ROUNDS + 2 * i]);
        }
#endif
        do {
            for (i = 0; i < XXH64_SCALAR_ROUNDS; i++) {
                acc[i] = XXH64_round(acc[i], XXH_get64bits(input));
                input += 8;
            }
#ifdef XXH_HAS_NEON
            for (i = 0; i < XXH64_NEON_ROUNDS; i++) {
                /* interleaved load, putting input in place for mul64 */
                uint32x2x2_t pair = vld2_u32((uint32_t const *)input);
                /* input * PRIME64_2 */
                uint64x2_t tmp = XXH_neon_mul64(pair, prime2);
                uint64x2_t xacc = accNeon[i];
                /* acc += input */
                xacc = vaddq_u64(xacc, tmp);
                /* rotl(xacc, 31) >> 32 without dependency */
                pair.val[1] = vshrn_n_u64(xacc, 64 - 31 - 32);
                /* rotl(xacc, 31) */
                tmp = vshlq_n_u64(xacc, 31);
                xacc = vsriq_n_u64(tmp, xacc, 64 - 31);
                /* xacc & 0xFFFFFFFF */
                pair.val[0] = vmovn_u64(xacc);
                /* xacc *= PRIME64_1 */
                accNeon[i] = XXH_neon_mul64(pair, prime1);
                input += 16;
           }
#endif
        } while (--nbBlocks);
#ifdef XXH_HAS_NEON
        /* Store NEON lanes back */
        for (i = 0; i < XXH64_NEON_ROUNDS; i++) {
            vst1q_u64(&acc[XXH64_SCALAR_ROUNDS + 2 * i], accNeon[i]);
        }
#endif
    }
    return input;
}

Side side note: Would a mixed SIMD/scalar benefit XXH3 as well? The integer pipeline is basically unused during hashLong, and we might benefit from doing a few lanes scalar.

Edit: Holy shit, it does (at least on aarch64). Doing a 6:2 split on the NEON path on clang makes it jump from 8.8 GB/s to 10.2 GB/s.

@Cyan4973
Copy link
Owner

Cyan4973 commented Dec 1, 2021

For XXH64, I would rather preserve code simplicity, the very minor performance difference seems not worth it,

For XXH3 on the other hand, since we already manage multiple specialized code paths, a ~+15% performance increase is definitely large enough to justify updating the aarch64 implementation. A complex bonus question though is, will it be beneficial (with various degrees) on all arch64, or beneficial for some, detrimental for others ? Difficult to tell.

@easyaspi314
Copy link
Contributor Author

easyaspi314 commented Dec 1, 2021

It only seems to affect AArch64, but XXH3 runs incredibly with a 6:2 ratio in #632, even (mostly) fixing the lackluster performance from GCC (30% faster, but still slower than clang lol).

XXH64 definitely isn't worth it especially if it still can't beat XXH32.

@Cyan4973
Copy link
Owner

Cyan4973 commented May 6, 2022

Is this topic still active ? Should we keep this issue opened ?
Referring to the XXH32/XXH64 modernization effort in the title, not later topics appearing in the thread.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants