-
Notifications
You must be signed in to change notification settings - Fork 12.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hasher::write
should clarify its "whole unit" behaviour
#94026
Comments
While we are at it, it would be good to also clarify what prefix-freedom means in the presence of a mix of I propose the following rules for
There should also be analogous rules for Either way, I think for
|
I strongly disagree with the notion that Consider the case of a 7-byte slice: this can be hashed by reading the first 4 bytes and last 4 bytes (with 1 byte of overlap) and hashing those two values. This is much more efficient than hashing each byte individually or buffering the bytes until they form a full word.
We should absolutely optimize SipHasher to be faster. |
Wouldn't that mean that strings "abcdefg" and "abcddefg" always hash to the same value? If so, that would make a poor |
I think more generally, beyond the optimization for unaligned memory access, it seems easy to assume a simple hasher -- e.g., FxHash from rustc, is not going to keep a buffer around to keep 'partial' writes prior to inserting them into the hash function. If you write a partial slice, it'll still end up hashing a full usize -- so writing a series of slices rather than one large one can have a large impact. (Effectively, this is a form of zero-padding the input buffer to fit a 8-byte block). |
If multiple calls to
|
Why that? As long as |
|
That doesn't explain why |
It is required for the property that unequal values write different sequences to the For instance, suppose that a hasher (say, Then this would cause a guaranteed collision between |
To ensure that hashing |
Hmm, if hashers don't merge (And the |
That sounds like a job for the slice hash function (that
No, it wouldn't, since the lengths of the strings are also hashed. |
This is not true. The |
It's worse than that -- if it doesn't merge the FF byte with the previous write but instead does something else on a whole block, you might lose the prefix free property for strings (depending on the details of how it gets expanded to a block and what the block hash is). |
Oh, I missed that it calls Maybe they should just be hashed as byte slices though, that would resolve all the concerns above... wouldn't it? |
Well if the hasher does merge writes, like SipHash does, then the But yes, just using the normal hash for |
It would solve the issue for For instance, if somebody implements their own custom version of It is also convenient to have writes merged for other reasons, as described in the original ticket description. For instance, if the custom If on the other hand |
cc @Amanieu for summary of libs discussion and reply |
It seems that the key question here is whether the responsibility for ensuring that
If we were to choose the first option then nothing needs to change except some documentation. If we were to choose the second option then all hashers that guarantee HashDoS resistance will be required to hash the length of each slice passed to I personally favor the second option since it would allow for faster hashing in cases where HashDoS resistance is not required (FxHash). There might be some performance loss on SipHash when hashing strings since a full 8-byte length needs to be hashed instead of a single terminator character, but there is a way around this. We could add an unstable Incidentally I noticed an optimization for SipHash that would apply no matter which approach is taken. SipHash hashes the low 8 bits of the total length but this doesn't actually guarantee prefix-freedom: playground. This should be removed in favor of a full length hash (either in |
I've made a proposal partway between those two approaches as PR #94598 I think just leaving it to Edit: I think the method |
On the other hand this would require siphasher (used for hash tables by default) to hash the length, which would slow it down as lengths now get hashed even if not necessary for disambiguation. Maybe we could have a third option where the Hash impl is responsible for calling a new method with the disambiguation data and the Hasher can then decide to hash it or not depending on if it needs HashDoS protection. Edit: race with |
You're misreading that code.
This idea is basically to skip over some relevant information when hashing. To me this seems to defeat the whole purpose of hashing. It's like only hashing the first half of a (i32, i32) pair for performance. Is this really such a performance win? It could be a big performance loss due to avoidable collisions. |
…anieu Further elaborate the lack of guarantees from `Hasher` I realized that I got too excited in rust-lang#94598 by adding new methods, and forgot to do the documentation to really answer the core question in rust-lang#94026. This PR just has that doc update. r? `@Amanieu`
Inspired by https://users.rust-lang.org/t/hash-prefix-collisions/71823/10?u=scottmcm
Hash::hash_slice
has a bunch of text clarifying thath.hash_slice(&[a, b]); h.hash_slice(&[c]);
is not guaranteed to be the same ash.hash_slice(&[a]); h.hash_slice(&[b, c]);
.However,
Hasher::write
is unclear whether that same rule applies to it. It's very clear that.write(&[a])
is not the same as.write_u8(a)
, but not whether the same sequence of bytes towrite
is supposed to be the same thing, even if they're in different groupings, likeh.write(&[a, b]); h.write(&[c]);
vsh.write(&[a]); h.write(&[b, c]);
.This is important for the same kind of things as the
VecDeque
example mentioned onhash_slice
. If I have a circular byte buffer, is it legal for itsHash
to just.write
the two parts? Or does it need towrite_u8
all the individual bytes since two circular buffers should compare equal regardless of where the split happens to be?Given that
Hash for str
andHash for [T]
are doing prefix-freedom already, it feels to me likewrite
should not be doing it again.Also, our
SipHasher
implementation is going out of its way to maintain the "different chunking ofwrite
s is fine":rust/library/core/src/hash/sip.rs
Lines 264 to 308 in 6bf3008
So it seems to me like this has been the expected behaviour the whole time. And if not, we should optimize
SipHasher
to be faster.cc #80303 which lead to this text in
hash_slice
.The text was updated successfully, but these errors were encountered: