Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Implement more efficient version of xxhash64 #575

Merged
merged 18 commits into from
Jun 25, 2024

Conversation

andygrove
Copy link
Member

@andygrove andygrove commented Jun 14, 2024

Which issue does this PR close?

Closes #547

Rationale for this change

The twox-hash crate seems to be designed for use cases such as processing files where it is important to process data in chunks and not need to load data fully into memory first. In our use case, our data is already in memory, so we can optimize for that.

Before

Running on MacBook Pro M3 Max.

hash/xxhash64/8192      time:   [892.53 µs 893.12 µs 893.85 µs]

After

hash/xxhash64/8192      time:   [209.25 µs 209.50 µs 209.78 µs]
                        change: [-76.607% -76.579% -76.552%] (p = 0.00 < 0.05)
                        Performance has improved.

What changes are included in this PR?

  • Implement simpler version of xxhash64

How are these changes tested?

@andygrove andygrove changed the title Implement more efficient version of xxhash64 feat: Implement more efficient version of xxhash64 Jun 14, 2024
@andygrove andygrove marked this pull request as ready for review June 18, 2024 16:08
@andygrove
Copy link
Member Author

@parthchandra @advancedxy This is ready for review now

NOTICE.txt Outdated
Comment on lines 1 to 3
This product includes software from the twox-hash project
* Copyright https://github.com/shepmaster/twox-hash
* Licensed under the MIT License;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmm so does this mean that comet would be dual licensed?

Not sure the legal part... Especially the Copyright github url part...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apache licensed projects can include MIT licensed software without being MIT licensed. Apache Arrow already does this, for example.

I copied the Copyright URL part from Apache Arrow as well (https://github.com/apache/arrow/blob/main/NOTICE.txt)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to get another opinion on this though. Perhaps @alamb could offer some thoughts.

Copy link
Contributor

@advancedxy advancedxy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @andygrove. The code looks correct to me and the performance improvement is exciting. Left some style issue comment.

offset_u64_4 += 1;
}
}
let total_len = data.len() as u64;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems a duplicate calculation as length bytes has already been calculated. I think this is purely to cast to u64?
How about let total_len = length_bytes as u64;?

Comment on lines 103 to 107
pub const PRIME_1: u64 = 11_400_714_785_074_694_791;
pub const PRIME_2: u64 = 14_029_467_366_897_019_727;
pub const PRIME_3: u64 = 1_609_587_929_392_839_161;
pub const PRIME_4: u64 = 9_650_029_242_287_828_579;
pub const PRIME_5: u64 = 2_870_177_450_012_600_261;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, Spark also has its own variant of XXHash64, see org.apache.spark.sql.catalyst.expressions.XXH64.

I checked all the steps in your pr, they should be the same as twox-hash or Spark's XXHash's process if I'm not wrong. That's great and we should be good to go. Of course it's always good to have more eyes on that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have one small nit about this function, it gets quite big now. It would be better if we can grouped the XXHash64 related functions and constants into a separate file and divide that into small functions. I believe that would be helpful for understanding and maintaining.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review @advancedxy. I moved xxhash64 into it's own file under execution.datafusion.expressions which makes sense also because this is a regular SQL function that users can call.

Copy link
Contributor

@advancedxy advancedxy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, pending ci passes

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think deriving code from MIT licensed code is fine, and consistent with how the Arrow project does things: https://github.com/apache/arrow/blob/main/LICENSE.txt

FYI @shepmaster

@andygrove
Copy link
Member Author

@parthchandra @kazuyukitanimura could I get a commiter review?

Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I read it and compare with https://github.com/shepmaster/twox-hash/blob/5f3f3a345be5f65680c5f2b9ed5950c85e9a0ccf/src/sixty_four.rs#L27. Looks the same logic. The random comparison test is also passed.

@viirya
Copy link
Member

viirya commented Jun 20, 2024

The improvement is nice. 💯

Copy link
Contributor

@kazuyukitanimura kazuyukitanimura left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some comments on license.

@@ -210,3 +211,26 @@ This project includes code from Apache Aurora.
Copyright: 2016 The Apache Software Foundation.
Home page: https://aurora.apache.org/
License: http://www.apache.org/licenses/LICENSE-2.0

--------------------------------------------------------------------------------

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like arrow-rs mentions the file names. Should we mention core/src/execution/datafusion/expressions/xxhash64.rs here?

LICENSE.txt Show resolved Hide resolved
Copy link
Contributor

@parthchandra parthchandra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me.
I am not sure we have exactly the same implementation as the original (or the same as Spark, for that matter). If the fuzz tests pass I have no concerns.

let ptr_u64 = data.as_ptr() as *const u64;
unsafe {
while offset_u64_4 * CHUNK_SIZE + CHUNK_SIZE <= length_bytes {
v1 = ingest_one_number(v1, ptr_u64.add(offset_u64_4 * 4).read_unaligned().to_le());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this produce the right result? The original implementation of XXHash64 processes 64 bits at a time.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is also processing 64 bits at a time. We are calling read_unaligned() on a ptr_u64.


#[inline(always)]
fn ingest_one_number(mut current_value: u64, mut value: u64) -> u64 {
value = value.wrapping_mul(PRIME_2);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be current_value = value.wrapping_mul(PRIME_2) ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, nvm. I was reading this wrong.

}

#[inline(always)]
fn mix_one(mut hash: u64, mut value: u64) -> u64 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this is based on twox-hash which also does what you're doing here. However, this looks like this is the XXH64_mergeRound function which calls XXH64_Round which is slightly different.
XXH64_Round seems to be ingest_one_number

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the logic looks the same?

XXH64_mergeRound calles XXH64_Round which does this:

acc += input * XXH_PRIME64_2;
acc  = XXH_rotl64(acc, 31);
acc *= XXH_PRIME64_1;

Then XXH64_mergeRound does this:

acc ^= val;
acc  = acc * XXH_PRIME64_1 + XXH_PRIME64_4;


const CHUNK_SIZE: usize = 32;

const PRIME_1: u64 = 11_400_714_785_074_694_791;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious, is this the preferred way to write these or would hex be preferred (one is just as unreadable as the other).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤷‍♂️ I copied this directly from xxhash64. I think either style works in this case

@shepmaster
Copy link

Are your benchmarks available for me to try out? I actually reimplemented the algorithm from scratch to see if there were any changes I would make in today's Rust. When I benchmarked it, I didn't have much difference compared to my existing implementation, but the numbers do noticeably differ from yours.

Before

Running on MacBook Pro M3 Max.

hash/xxhash64/8192      time:   [892.53 µs 893.12 µs 893.85 µs]

My guess is that this is hashing a buffer of 8192 bytes, but that would result in ~9 MiB/s which would be really slow. Maybe it's some other unit?

I grabbed the implementation code from this PR and popped it into my benchmark and got:

head-to-head/twox-hash/8192
                        time:   [565.59 ns 567.14 ns 569.04 ns]
                        thrpt:  [13.408 GiB/s 13.452 GiB/s 13.489 GiB/s]
Found 8 outliers among 100 measurements (8.00%)
  3 (3.00%) high mild
  5 (5.00%) high severe

head-to-head/spark/8192 time:   [565.39 ns 566.90 ns 568.63 ns]
                        thrpt:  [13.417 GiB/s 13.458 GiB/s 13.494 GiB/s]
Found 6 outliers among 100 measurements (6.00%)
  4 (4.00%) high mild
  2 (2.00%) high severe

This is on a MacBook Pro M1 Max, so if anything I'd expect it to be slower than your test machine.

Benchmark code
use criterion::{criterion_group, criterion_main, Criterion, Throughput};
use rand::{Rng, RngCore, SeedableRng};
use std::hash::Hasher;
use std::hint::black_box;
use twox_hash::XxHash64;

fn head_to_head(c: &mut Criterion) {
    let (seed, data) = gen_data();
    let mut g = c.benchmark_group("head-to-head");

    for size in [8192] {
        let data = &data[..size];
        g.throughput(Throughput::Bytes(data.len() as _));

        let id = format!("twox-hash/{size}");
        g.bench_function(id, |b| {
            b.iter(|| {
                let hash = {
                    let mut hasher = XxHash64::with_seed(seed);
                    hasher.write(&data);
                    hasher.finish()
                };
                black_box(hash);
            })
        });

        let id = format!("spark/{size}");
        g.bench_function(id, |b| {
            b.iter(|| {
                let hash = extracto::spark_compatible_xxhash64(&data, seed);
                black_box(hash);
            })
        });
    }

    g.finish();
}

const SEED: u64 = 0xc651_4843_1995_363f;
const DATA_SIZE: usize = 100 * 1024 * 1024;

fn gen_data() -> (u64, Vec<u8>) {
    let mut rng = rand::rngs::StdRng::seed_from_u64(SEED);

    let seed = rng.gen();

    let mut data = vec![0; DATA_SIZE];
    rng.fill_bytes(&mut data);

    (seed, data)
}

criterion_group!(benches, head_to_head);
criterion_main!(benches);

@advancedxy
Copy link
Contributor

Are your benchmarks available for me to try out?

I think the code is available in core/benches/hash.rs.

@andygrove
Copy link
Member Author

Thanks for spending time looking at this @shepmaster. I just ran the benchmarks again using the following steps and still see the same results I initially posted here.

  1. checkout andygrove:xxhash64-stateless (this PR)
  2. run cargo bench xxhash64
  3. take backup of core/benches/hash.rs
  4. checkout main branch
  5. restore backup of core/benches/hash.rs
  6. run cargo bench xxhash64

Results from step 2:

hash/xxhash64/8192      time:   [210.40 µs 210.55 µs 210.72 µs]
                        change: [-0.1589% -0.0179% +0.1328%] (p = 0.81 > 0.05)
                        No change in performance detected.

Results from step 6:

hash/xxhash64/8192      time:   [891.98 µs 894.49 µs 898.04 µs]
                        change: [+323.31% +324.24% +325.23%] (p = 0.00 < 0.05)
                        Performance has regressed.

@shepmaster
Copy link

Thank you! I was able to run the benchmarks and tweak my local version of the code to match the performance of your PR. I'll work on rolling that into twox-hash proper and release a new version at some point. Thanks for the fresh perspective on how to improve performance!

@andygrove
Copy link
Member Author

Thanks for all the reviews @viirya @parthchandra @kazuyukitanimura @advancedxy. I will go ahead and merge this now but let me know if you have any more comments or questions.

@andygrove andygrove merged commit 91b14ef into apache:main Jun 25, 2024
66 checks passed
@andygrove andygrove deleted the xxhash64-stateless branch June 25, 2024 23:32
himadripal pushed a commit to himadripal/datafusion-comet that referenced this pull request Sep 7, 2024
* reimplement xxhash64

* move twox_hash to build dep

* bug fix

* more tests

* clippy

* bug fix

* attribution

* improve test

* bug fix

* test with random seed

* remove comment

* more updated to license/notice

* remove redundant variable

* refactor to move xxhash64 into separate file

* refactor

* add copyright
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Avoid stack allocation in xxhash64
7 participants