feat: Implement more efficient version of xxhash64 #575

andygrove · 2024-06-14T20:31:29Z

Which issue does this PR close?

Closes #547

Rationale for this change

The twox-hash crate seems to be designed for use cases such as processing files where it is important to process data in chunks and not need to load data fully into memory first. In our use case, our data is already in memory, so we can optimize for that.

Before

Running on MacBook Pro M3 Max.

hash/xxhash64/8192      time:   [892.53 µs 893.12 µs 893.85 µs]

After

hash/xxhash64/8192      time:   [209.25 µs 209.50 µs 209.78 µs]
                        change: [-76.607% -76.579% -76.552%] (p = 0.00 < 0.05)
                        Performance has improved.

What changes are included in this PR?

Implement simpler version of xxhash64

How are these changes tested?

…comet into xxhash64-stateless

andygrove · 2024-06-18T16:57:28Z

@parthchandra @advancedxy This is ready for review now

kazuyukitanimura · 2024-06-19T07:00:17Z

NOTICE.txt

+This product includes software from the twox-hash project
+ * Copyright https://github.com/shepmaster/twox-hash
+ * Licensed under the MIT License;


hmmm so does this mean that comet would be dual licensed?

Not sure the legal part... Especially the Copyright github url part...

Apache licensed projects can include MIT licensed software without being MIT licensed. Apache Arrow already does this, for example.

I copied the Copyright URL part from Apache Arrow as well (https://github.com/apache/arrow/blob/main/NOTICE.txt)

It would be good to get another opinion on this though. Perhaps @alamb could offer some thoughts.

advancedxy

Thanks @andygrove. The code looks correct to me and the performance improvement is exciting. Left some style issue comment.

advancedxy · 2024-06-19T12:50:03Z

core/src/execution/datafusion/spark_hash.rs

+            offset_u64_4 += 1;
+        }
+    }
+    let total_len = data.len() as u64;


This seems a duplicate calculation as length bytes has already been calculated. I think this is purely to cast to u64?
How about let total_len = length_bytes as u64;?

advancedxy · 2024-06-19T12:52:44Z

core/src/execution/datafusion/spark_hash.rs

+pub const PRIME_1: u64 = 11_400_714_785_074_694_791;
+pub const PRIME_2: u64 = 14_029_467_366_897_019_727;
+pub const PRIME_3: u64 = 1_609_587_929_392_839_161;
+pub const PRIME_4: u64 = 9_650_029_242_287_828_579;
+pub const PRIME_5: u64 = 2_870_177_450_012_600_261;


FYI, Spark also has its own variant of XXHash64, see org.apache.spark.sql.catalyst.expressions.XXH64.

I checked all the steps in your pr, they should be the same as twox-hash or Spark's XXHash's process if I'm not wrong. That's great and we should be good to go. Of course it's always good to have more eyes on that.

I have one small nit about this function, it gets quite big now. It would be better if we can grouped the XXHash64 related functions and constants into a separate file and divide that into small functions. I believe that would be helpful for understanding and maintaining.

Thanks for the review @advancedxy. I moved xxhash64 into it's own file under execution.datafusion.expressions which makes sense also because this is a regular SQL function that users can call.

advancedxy

lgtm, pending ci passes

alamb

I think deriving code from MIT licensed code is fine, and consistent with how the Arrow project does things: https://github.com/apache/arrow/blob/main/LICENSE.txt

FYI @shepmaster

andygrove · 2024-06-20T16:35:45Z

@parthchandra @kazuyukitanimura could I get a commiter review?

viirya

I read it and compare with https://github.com/shepmaster/twox-hash/blob/5f3f3a345be5f65680c5f2b9ed5950c85e9a0ccf/src/sixty_four.rs#L27. Looks the same logic. The random comparison test is also passed.

viirya · 2024-06-20T18:31:37Z

The improvement is nice. 💯

kazuyukitanimura

Just some comments on license.

kazuyukitanimura · 2024-06-20T17:17:24Z

LICENSE.txt

@@ -210,3 +211,26 @@ This project includes code from Apache Aurora.
 Copyright: 2016 The Apache Software Foundation.
 Home page: https://aurora.apache.org/
 License: http://www.apache.org/licenses/LICENSE-2.0
+
+--------------------------------------------------------------------------------
+


Looks like arrow-rs mentions the file names. Should we mention core/src/execution/datafusion/expressions/xxhash64.rs here?

LICENSE.txt

parthchandra

This looks good to me.
I am not sure we have exactly the same implementation as the original (or the same as Spark, for that matter). If the fuzz tests pass I have no concerns.

parthchandra · 2024-06-21T20:41:58Z

core/src/execution/datafusion/expressions/xxhash64.rs

+    let ptr_u64 = data.as_ptr() as *const u64;
+    unsafe {
+        while offset_u64_4 * CHUNK_SIZE + CHUNK_SIZE <= length_bytes {
+            v1 = ingest_one_number(v1, ptr_u64.add(offset_u64_4 * 4).read_unaligned().to_le());


Does this produce the right result? The original implementation of XXHash64 processes 64 bits at a time.

This is also processing 64 bits at a time. We are calling read_unaligned() on a ptr_u64.

parthchandra · 2024-06-21T20:51:04Z

core/src/execution/datafusion/expressions/xxhash64.rs

+
+#[inline(always)]
+fn ingest_one_number(mut current_value: u64, mut value: u64) -> u64 {
+    value = value.wrapping_mul(PRIME_2);


Should this be current_value = value.wrapping_mul(PRIME_2) ?

Oh, nvm. I was reading this wrong.

parthchandra · 2024-06-21T21:21:35Z

core/src/execution/datafusion/expressions/xxhash64.rs

+}
+
+#[inline(always)]
+fn mix_one(mut hash: u64, mut value: u64) -> u64 {


I know this is based on twox-hash which also does what you're doing here. However, this looks like this is the XXH64_mergeRound function which calls XXH64_Round which is slightly different.
XXH64_Round seems to be ingest_one_number

I think the logic looks the same?

XXH64_mergeRound calles XXH64_Round which does this:

acc += input * XXH_PRIME64_2; acc = XXH_rotl64(acc, 31); acc *= XXH_PRIME64_1;

Then XXH64_mergeRound does this:

acc ^= val; acc = acc * XXH_PRIME64_1 + XXH_PRIME64_4;

parthchandra · 2024-06-21T21:25:47Z

core/src/execution/datafusion/expressions/xxhash64.rs

+
+const CHUNK_SIZE: usize = 32;
+
+const PRIME_1: u64 = 11_400_714_785_074_694_791;


Just curious, is this the preferred way to write these or would hex be preferred (one is just as unreadable as the other).

🤷‍♂️ I copied this directly from xxhash64. I think either style works in this case

shepmaster · 2024-06-24T20:30:43Z

Are your benchmarks available for me to try out? I actually reimplemented the algorithm from scratch to see if there were any changes I would make in today's Rust. When I benchmarked it, I didn't have much difference compared to my existing implementation, but the numbers do noticeably differ from yours.

Before

Running on MacBook Pro M3 Max.
hash/xxhash64/8192      time:   [892.53 µs 893.12 µs 893.85 µs]

My guess is that this is hashing a buffer of 8192 bytes, but that would result in ~9 MiB/s which would be really slow. Maybe it's some other unit?

I grabbed the implementation code from this PR and popped it into my benchmark and got:

head-to-head/twox-hash/8192
                        time:   [565.59 ns 567.14 ns 569.04 ns]
                        thrpt:  [13.408 GiB/s 13.452 GiB/s 13.489 GiB/s]
Found 8 outliers among 100 measurements (8.00%)
  3 (3.00%) high mild
  5 (5.00%) high severe

head-to-head/spark/8192 time:   [565.39 ns 566.90 ns 568.63 ns]
                        thrpt:  [13.417 GiB/s 13.458 GiB/s 13.494 GiB/s]
Found 6 outliers among 100 measurements (6.00%)
  4 (4.00%) high mild
  2 (2.00%) high severe

This is on a MacBook Pro M1 Max, so if anything I'd expect it to be slower than your test machine.

Benchmark code

use criterion::{criterion_group, criterion_main, Criterion, Throughput};
use rand::{Rng, RngCore, SeedableRng};
use std::hash::Hasher;
use std::hint::black_box;
use twox_hash::XxHash64;

fn head_to_head(c: &mut Criterion) {
    let (seed, data) = gen_data();
    let mut g = c.benchmark_group("head-to-head");

    for size in [8192] {
        let data = &data[..size];
        g.throughput(Throughput::Bytes(data.len() as _));

        let id = format!("twox-hash/{size}");
        g.bench_function(id, |b| {
            b.iter(|| {
                let hash = {
                    let mut hasher = XxHash64::with_seed(seed);
                    hasher.write(&data);
                    hasher.finish()
                };
                black_box(hash);
            })
        });

        let id = format!("spark/{size}");
        g.bench_function(id, |b| {
            b.iter(|| {
                let hash = extracto::spark_compatible_xxhash64(&data, seed);
                black_box(hash);
            })
        });
    }

    g.finish();
}

const SEED: u64 = 0xc651_4843_1995_363f;
const DATA_SIZE: usize = 100 * 1024 * 1024;

fn gen_data() -> (u64, Vec<u8>) {
    let mut rng = rand::rngs::StdRng::seed_from_u64(SEED);

    let seed = rng.gen();

    let mut data = vec![0; DATA_SIZE];
    rng.fill_bytes(&mut data);

    (seed, data)
}

criterion_group!(benches, head_to_head);
criterion_main!(benches);

advancedxy · 2024-06-25T08:47:26Z

Are your benchmarks available for me to try out?

I think the code is available in core/benches/hash.rs.

andygrove · 2024-06-25T19:33:47Z

Thanks for spending time looking at this @shepmaster. I just ran the benchmarks again using the following steps and still see the same results I initially posted here.

checkout andygrove:xxhash64-stateless (this PR)
run cargo bench xxhash64
take backup of core/benches/hash.rs
checkout main branch
restore backup of core/benches/hash.rs
run cargo bench xxhash64

Results from step 2:

hash/xxhash64/8192      time:   [210.40 µs 210.55 µs 210.72 µs]
                        change: [-0.1589% -0.0179% +0.1328%] (p = 0.81 > 0.05)
                        No change in performance detected.

Results from step 6:

hash/xxhash64/8192      time:   [891.98 µs 894.49 µs 898.04 µs]
                        change: [+323.31% +324.24% +325.23%] (p = 0.00 < 0.05)
                        Performance has regressed.

shepmaster · 2024-06-25T20:39:48Z

Thank you! I was able to run the benchmarks and tweak my local version of the code to match the performance of your PR. I'll work on rolling that into twox-hash proper and release a new version at some point. Thanks for the fresh perspective on how to improve performance!

andygrove · 2024-06-25T23:32:12Z

Thanks for all the reviews @viirya @parthchandra @kazuyukitanimura @advancedxy. I will go ahead and merge this now but let me know if you have any more comments or questions.

* reimplement xxhash64 * move twox_hash to build dep * bug fix * more tests * clippy * bug fix * attribution * improve test * bug fix * test with random seed * remove comment * more updated to license/notice * remove redundant variable * refactor to move xxhash64 into separate file * refactor * add copyright

andygrove added 4 commits June 14, 2024 14:05

reimplement xxhash64

e770048

move twox_hash to build dep

47bb4c1

bug fix

6fd6930

more tests

07f5b55

andygrove changed the title ~~Implement more efficient version of xxhash64~~ feat: Implement more efficient version of xxhash64 Jun 14, 2024

clippy

91c9aea

andygrove mentioned this pull request Jun 14, 2024

TPC-H q8 hangs with xxhash64 enabled #517

Closed

andygrove added 6 commits June 18, 2024 08:48

Merge branch 'xxhash64-stateless' of github.com:andygrove/datafusion-…

f64dc90

…comet into xxhash64-stateless

bug fix

50539c4

attribution

d1b975e

improve test

5869553

bug fix

edcde27

test with random seed

0ba9134

andygrove marked this pull request as ready for review June 18, 2024 16:08

remove comment

3c20411

kazuyukitanimura reviewed Jun 19, 2024

View reviewed changes

more updated to license/notice

e979ee9

advancedxy reviewed Jun 19, 2024

View reviewed changes

andygrove added 4 commits June 19, 2024 08:26

remove redundant variable

45187bd

refactor to move xxhash64 into separate file

420a87b

Merge remote-tracking branch 'apache/main' into xxhash64-stateless

3ffa90d

refactor

e57c8d6

advancedxy approved these changes Jun 19, 2024

View reviewed changes

alamb reviewed Jun 19, 2024

View reviewed changes

viirya approved these changes Jun 20, 2024

View reviewed changes

kazuyukitanimura approved these changes Jun 20, 2024

View reviewed changes

add copyright

96c2bcf

parthchandra approved these changes Jun 21, 2024

View reviewed changes

andygrove merged commit 91b14ef into apache:main Jun 25, 2024
66 checks passed

andygrove deleted the xxhash64-stateless branch June 25, 2024 23:32

NoeB mentioned this pull request Oct 27, 2024

chore: Use twox-hash 2.0 xxhash64 oneshot api instead of custom implementation #1041

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Implement more efficient version of xxhash64 #575

feat: Implement more efficient version of xxhash64 #575

andygrove commented Jun 14, 2024 •

edited

Loading

andygrove commented Jun 18, 2024

kazuyukitanimura Jun 19, 2024

andygrove Jun 19, 2024

andygrove Jun 19, 2024

advancedxy left a comment

advancedxy Jun 19, 2024

advancedxy Jun 19, 2024

advancedxy Jun 19, 2024

andygrove Jun 19, 2024

advancedxy left a comment

alamb left a comment

andygrove commented Jun 20, 2024

viirya left a comment

viirya commented Jun 20, 2024

kazuyukitanimura left a comment

kazuyukitanimura Jun 20, 2024

parthchandra left a comment

parthchandra Jun 21, 2024

andygrove Jun 24, 2024

parthchandra Jun 21, 2024

parthchandra Jun 21, 2024

parthchandra Jun 21, 2024

andygrove Jun 24, 2024

parthchandra Jun 21, 2024

andygrove Jun 23, 2024

shepmaster commented Jun 24, 2024

Before

advancedxy commented Jun 25, 2024

andygrove commented Jun 25, 2024

shepmaster commented Jun 25, 2024

andygrove commented Jun 25, 2024


		const CHUNK_SIZE: usize = 32;

		const PRIME_1: u64 = 11_400_714_785_074_694_791;

feat: Implement more efficient version of xxhash64 #575

feat: Implement more efficient version of xxhash64 #575

Conversation

andygrove commented Jun 14, 2024 • edited Loading

Which issue does this PR close?

Rationale for this change

Before

After

What changes are included in this PR?

How are these changes tested?

andygrove commented Jun 18, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

advancedxy left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

advancedxy left a comment

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

andygrove commented Jun 20, 2024

viirya left a comment

Choose a reason for hiding this comment

viirya commented Jun 20, 2024

kazuyukitanimura left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

parthchandra left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shepmaster commented Jun 24, 2024

Before

advancedxy commented Jun 25, 2024

andygrove commented Jun 25, 2024

shepmaster commented Jun 25, 2024

andygrove commented Jun 25, 2024

andygrove commented Jun 14, 2024 •

edited

Loading