-
Notifications
You must be signed in to change notification settings - Fork 172
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Implement more efficient version of xxhash64 #575
Conversation
@parthchandra @advancedxy This is ready for review now |
NOTICE.txt
Outdated
This product includes software from the twox-hash project | ||
* Copyright https://github.com/shepmaster/twox-hash | ||
* Licensed under the MIT License; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmmm so does this mean that comet would be dual licensed?
Not sure the legal part... Especially the Copyright github url part...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apache licensed projects can include MIT licensed software without being MIT licensed. Apache Arrow already does this, for example.
I copied the Copyright URL part from Apache Arrow as well (https://github.com/apache/arrow/blob/main/NOTICE.txt)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be good to get another opinion on this though. Perhaps @alamb could offer some thoughts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @andygrove. The code looks correct to me and the performance improvement is exciting. Left some style issue comment.
offset_u64_4 += 1; | ||
} | ||
} | ||
let total_len = data.len() as u64; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems a duplicate calculation as length bytes has already been calculated. I think this is purely to cast to u64?
How about let total_len = length_bytes as u64;
?
pub const PRIME_1: u64 = 11_400_714_785_074_694_791; | ||
pub const PRIME_2: u64 = 14_029_467_366_897_019_727; | ||
pub const PRIME_3: u64 = 1_609_587_929_392_839_161; | ||
pub const PRIME_4: u64 = 9_650_029_242_287_828_579; | ||
pub const PRIME_5: u64 = 2_870_177_450_012_600_261; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI, Spark also has its own variant of XXHash64, see org.apache.spark.sql.catalyst.expressions.XXH64.
I checked all the steps in your pr, they should be the same as twox-hash or Spark's XXHash's process if I'm not wrong. That's great and we should be good to go. Of course it's always good to have more eyes on that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have one small nit about this function, it gets quite big now. It would be better if we can grouped the XXHash64 related functions and constants into a separate file and divide that into small functions. I believe that would be helpful for understanding and maintaining.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review @advancedxy. I moved xxhash64 into it's own file under execution.datafusion.expressions
which makes sense also because this is a regular SQL function that users can call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, pending ci passes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think deriving code from MIT licensed code is fine, and consistent with how the Arrow project does things: https://github.com/apache/arrow/blob/main/LICENSE.txt
FYI @shepmaster
@parthchandra @kazuyukitanimura could I get a commiter review? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I read it and compare with https://github.com/shepmaster/twox-hash/blob/5f3f3a345be5f65680c5f2b9ed5950c85e9a0ccf/src/sixty_four.rs#L27. Looks the same logic. The random comparison test is also passed.
The improvement is nice. 💯 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just some comments on license.
@@ -210,3 +211,26 @@ This project includes code from Apache Aurora. | |||
Copyright: 2016 The Apache Software Foundation. | |||
Home page: https://aurora.apache.org/ | |||
License: http://www.apache.org/licenses/LICENSE-2.0 | |||
|
|||
-------------------------------------------------------------------------------- | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like arrow-rs mentions the file names. Should we mention core/src/execution/datafusion/expressions/xxhash64.rs
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me.
I am not sure we have exactly the same implementation as the original (or the same as Spark, for that matter). If the fuzz tests pass I have no concerns.
let ptr_u64 = data.as_ptr() as *const u64; | ||
unsafe { | ||
while offset_u64_4 * CHUNK_SIZE + CHUNK_SIZE <= length_bytes { | ||
v1 = ingest_one_number(v1, ptr_u64.add(offset_u64_4 * 4).read_unaligned().to_le()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this produce the right result? The original implementation of XXHash64 processes 64 bits at a time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is also processing 64 bits at a time. We are calling read_unaligned()
on a ptr_u64
.
|
||
#[inline(always)] | ||
fn ingest_one_number(mut current_value: u64, mut value: u64) -> u64 { | ||
value = value.wrapping_mul(PRIME_2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be current_value = value.wrapping_mul(PRIME_2)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, nvm. I was reading this wrong.
} | ||
|
||
#[inline(always)] | ||
fn mix_one(mut hash: u64, mut value: u64) -> u64 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this is based on twox-hash
which also does what you're doing here. However, this looks like this is the XXH64_mergeRound function which calls XXH64_Round which is slightly different.
XXH64_Round seems to be ingest_one_number
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the logic looks the same?
XXH64_mergeRound calles XXH64_Round which does this:
acc += input * XXH_PRIME64_2;
acc = XXH_rotl64(acc, 31);
acc *= XXH_PRIME64_1;
Then XXH64_mergeRound does this:
acc ^= val;
acc = acc * XXH_PRIME64_1 + XXH_PRIME64_4;
|
||
const CHUNK_SIZE: usize = 32; | ||
|
||
const PRIME_1: u64 = 11_400_714_785_074_694_791; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious, is this the preferred way to write these or would hex be preferred (one is just as unreadable as the other).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤷♂️ I copied this directly from xxhash64. I think either style works in this case
Are your benchmarks available for me to try out? I actually reimplemented the algorithm from scratch to see if there were any changes I would make in today's Rust. When I benchmarked it, I didn't have much difference compared to my existing implementation, but the numbers do noticeably differ from yours.
My guess is that this is hashing a buffer of 8192 bytes, but that would result in ~9 MiB/s which would be really slow. Maybe it's some other unit? I grabbed the implementation code from this PR and popped it into my benchmark and got:
This is on a MacBook Pro M1 Max, so if anything I'd expect it to be slower than your test machine. Benchmark codeuse criterion::{criterion_group, criterion_main, Criterion, Throughput};
use rand::{Rng, RngCore, SeedableRng};
use std::hash::Hasher;
use std::hint::black_box;
use twox_hash::XxHash64;
fn head_to_head(c: &mut Criterion) {
let (seed, data) = gen_data();
let mut g = c.benchmark_group("head-to-head");
for size in [8192] {
let data = &data[..size];
g.throughput(Throughput::Bytes(data.len() as _));
let id = format!("twox-hash/{size}");
g.bench_function(id, |b| {
b.iter(|| {
let hash = {
let mut hasher = XxHash64::with_seed(seed);
hasher.write(&data);
hasher.finish()
};
black_box(hash);
})
});
let id = format!("spark/{size}");
g.bench_function(id, |b| {
b.iter(|| {
let hash = extracto::spark_compatible_xxhash64(&data, seed);
black_box(hash);
})
});
}
g.finish();
}
const SEED: u64 = 0xc651_4843_1995_363f;
const DATA_SIZE: usize = 100 * 1024 * 1024;
fn gen_data() -> (u64, Vec<u8>) {
let mut rng = rand::rngs::StdRng::seed_from_u64(SEED);
let seed = rng.gen();
let mut data = vec![0; DATA_SIZE];
rng.fill_bytes(&mut data);
(seed, data)
}
criterion_group!(benches, head_to_head);
criterion_main!(benches); |
I think the code is available in core/benches/hash.rs. |
Thanks for spending time looking at this @shepmaster. I just ran the benchmarks again using the following steps and still see the same results I initially posted here.
Results from step 2:
Results from step 6:
|
Thank you! I was able to run the benchmarks and tweak my local version of the code to match the performance of your PR. I'll work on rolling that into twox-hash proper and release a new version at some point. Thanks for the fresh perspective on how to improve performance! |
Thanks for all the reviews @viirya @parthchandra @kazuyukitanimura @advancedxy. I will go ahead and merge this now but let me know if you have any more comments or questions. |
* reimplement xxhash64 * move twox_hash to build dep * bug fix * more tests * clippy * bug fix * attribution * improve test * bug fix * test with random seed * remove comment * more updated to license/notice * remove redundant variable * refactor to move xxhash64 into separate file * refactor * add copyright
Which issue does this PR close?
Closes #547
Rationale for this change
The twox-hash crate seems to be designed for use cases such as processing files where it is important to process data in chunks and not need to load data fully into memory first. In our use case, our data is already in memory, so we can optimize for that.
Before
Running on MacBook Pro M3 Max.
After
What changes are included in this PR?
How are these changes tested?