-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change MurmurHash 3 to a faster hash without sacrificing quality #528
Comments
The MurmurHash3 was not only selected for speed but also for simplicity. Since MurmurHash3 is already fast enough and hashing comprises only a tiny fraction of the total execution time in most applications, doubling the hashing speed will not bring noticeable improvements in speed. For applications that really need fast hashing, providing appropriate modules will be a better solution. |
I agree, that it's not needed at the moment and that the linked library is quite horrible. I'm though still convinced the algorithm should be built in the DaoVM reference implementation. My Dao source code uses hash maps a lot (and for critical stuff as fast network data processing) and internally Dao uses also hashing, so I believe in bigger applications this will play a significant role. xxHash is also quite simple (look at a "high-level" implementation in JS - 32bit 64bit - which is way nicer than the original C implementation full of ifdefs and other unnecessary clutter). In my opinion it's worth it to rewrite this high-level implementation in C99 just for the DaoVM purpose - it'll be small. The linked xxHash C library should not be incorporated as you said because it's bloated and might become a Dao module at most. |
xxHash is only fast for large buffers. For small buffers or usage in a hash table FNV1 is still the fastest. Murmur 3 is also not really recommended. |
Due to its simplicity, it should be quite fast. Just curious, are there many FNV primes for a given bit length? If not, there will be not be enough hash seeds to choose randomly, when it is necessary to defend agains hash key collision attacks (e.g. in web application).
Why is that? Any benchmarks to compare FNV1 and Murmur3? |
I'm certainly missing something as both arguments seem void based on the tests in your, @rurban, repository: https://github.com/rurban/smhasher . |
You have to use the right benchmarks :) If you just test the hash function by itself, the longer wordwise ones dominate of course. But this is only useful for file or networking digests. If you test them inside a typical hash table, usually inlined if small enough, the smaller ones dominate. You have to follow the link to |
Well, the primary motivation for the switch was speed, but I do trust quite a lot the Should this come again to question, it should be quite easy to implement xxHash portably based on the xxHash specification. |
dumblob <notifications@github.com> schrieb am Do., 17. Aug. 2017, 19:42:
xxHash is only fast for large buffers. For small buffers or usage in a
hash table FNV1 is still the fastest. Murmur 3 is also not really
recommended.
I'm certainly missing something as both arguments seem void based on the
tests in your, @rurban <https://github.com/rurban>, repository:
https://github.com/rurban/smhasher .
Then read it again, what is tested there.
—
… You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#528 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AACjUdZo-GhBrxLQfC0OQMgVbL8jDbaSks5sZHtVgaJpZM4KxLxW>
.
|
It seems you mean I can see in your smhasher-list though, that even faster than xxHash (especially for smaller keys as you noted) is t1ha - and without sacrificing any quality (according to |
Partially. FNV1a and it's mult. variants are the fastest for hash tables, besides _mm_crc32_u64 and t1ha. Even if the quality is a bit worse, the icache (the hash function) and dcache (the hash data structure) are more important than everything else. Regarding security against DOS attacks, forget about the hash function, as even siphash can be brute forced. Only the collision resolution scheme can protect against DOS attacks. |
There is a new hash function wyhash, which seems to raise the bar of hashing (including PRNG) by a giant step. It's less complicated than xxHash, but more than MurmurHash. So if there will be any need, then there is something we can grab immediately as it's significantly faster and with a state of the art distribution of keys. See also smhasher tests which @rurban regularly updates. |
Where did you see this? Please note speed is not the most important factor in choosing a hash function. wyhash seems quite new, not many people have looked into it to spot potential problems. It would be very unwise to switch to it. Also, just with a glimpse of the wyhash source code, it is obvious that it will produce different results on machines with different endianness. This is quite alarming for me. |
https://github.com/rurban/smhasher/blob/master/doc/wyhash
Sure, that's why I noted "if there will be any need" 😉.
This is interesting. Are there any places in Dao which require that? Or do we expose it somewhere, so that it allows e.g. sending a computed hash over network and blindly using it on another machine? Bytecode? Or something else? |
师兄好,我是2000级复旦生科的。wyhash是小弟开发的。 ^_^ |
dumblob <notifications@github.com> schrieb am Do., 13. Juni 2019, 12:44:
Where did you see this?
https://github.com/rurban/smhasher/blob/master/doc/wyhash
https://github.com/rurban/smhasher/blob/master/doc/wyhash32low
Please note speed is not the most important factor in choosing a hash
function.
Sure, that's why I noted "if there will be any need" 😉.
This is quite alarming for me.
Murmurhash3 is also not endian compatible. Besides much slower, bigger and
worse quality than wyhash. And both are 64bit only.
There are some good endian-portable and 32bit only hashes also.
This is interesting. Are there any places in Dao which require that? Or do
… we expose it somewhere, so that it allows e.g. sending a computed hash over
network and blindly using it on another machine? Bytecode? Or something
else?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#528?email_source=notifications&email_token=AAAKGUNWQJ4VPAWELHWWVZLP2IQHDA5CNFSM4CWEXRLKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODXTJJ7Q#issuecomment-501650686>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAAKGULA4HGEMHMAUIVEIGTP2IQHDANCNFSM4CWEXRLA>
.
|
Dao does not really require that. In fact, I didn't pay attention to this until recently. In some networked applications (such as certain types of games), the clients and server are required to have deterministic and identical behavior, which cannot be guaranteed if endian-dependent hashes are used for map keys.
I had a look at MurmurHash3, it turns out it is not endian-neutral either, but it can be easily fixed. I will also look into theses hashes when I am less busy. @rurban Curious, what platforms and compilers do you use for smhasher tests? I noticed that wyhash uses intrinsics, could its speed is mainly a result of this? If this is the case, it should not be very difficult to modify some other hash functions to use intrinsics too. |
你好,不错:thumbsup: |
Well, this can be (and shall be IMHO) easily ensured in the serialization/deserialization routines from the Dao standard library before/after sending/receiving it over network. Another (even better) solution would be to use e.g. Cap'n Proto or sproto for such "shared" structures to avoid having to pay attention to such details. So in the end I'm still not convinced endianess neutrality really matters in Dao 😉. |
It is not about sending data over network. I was talking about something like Deterministic Lockstep for Networked Games。 |
Thanks for the article. I'm though still not there as I'm not aware of any Dao/Dao_modules API which would expose the internals of a hash function and I can't imagine how e.g. the hash table API would cause a different result when replaying inputs as in the deterministic lockstep scenario (assuming just the Dao APIs were involved). I must be still overlooking something - I'm eager to get enlightened 😉. |
You indeed overlooked something: for-in iteration over hash map😄. |
Ok, maybe you are unaware that Dao hash map is an unordered associative array. If a hash map has multiple keys, the keys will be sorted by their hash values (which may appear unordered if the ordering by hash values is different from ordering by their actual values). If the hash function is not endian-neutral, different endianness can generate different hash values for the keys, so they can be visited in different orders in for-in iteration. And forget about data over network, the application could use hash maps for internal data that are not communicated over network. As an example, game engine such as Urho3D uses hash maps for scene node components and event registration etc., their handling will depend on the ordering of the keys. |
Oh yeah, thanks for explanation. I must say I didn't see anywhere in Dao doc, that hash maps are ordered and that they're even deterministically iterable and that the order is guaranteed cross-machine. And because it's a hash map I would never get the idea, that there is some defined ordering and - crosses himself - never expected that a hash map will be iterated in the same order even if it's the same invariable map immediately iterated two times in a row. With a tree-based (e.g. weighted red-black) map that's a different story, but with hash map it's really surprising that Dao needs something like endianess independence. In the worst case I would call it an optimization for the case when one wants to sort the iterated hash map pairs (key + value) using an algorithm which strongly benefits from a partially sorted sequence (e.g. Tim sort), but even then I'm not sure whether the requirement of having an endianess independent hash function is justified. |
Hash map in many languages/implementations including Dao does guarantee a deterministic ordering of the keys, it's just that it is not a meaningful ordering. It is also called unordered map or unordered associative array so that users are reminded that they should not depend on the ordering of hash keys for their programs. Nevertheless, hash map in Dao as well as in many other languages/implementations is deterministically iterable, for tasks that does not depend on the ordering of hash keys.
It's not a requirement, but a feature that is nice to have in Dao. This feature will remove some pitfalls related to hash map. Maybe most users will not encounter such pitfalls, but who knows how they will use hash maps and what kind of problems they will try to solve. I believe language designers should never speculate how users will use their languages and expect them to use their languages properly or in certain ways, and programming languages should be designed and implemented to contain as less pitfalls as possible. |
Just checked, Lua and the game engine I am using uses endianness-neutral hash functions for hash map. I haven't look into other languages, I think many of them are also using endianness-neutral hash functions (at least any serious language should). Interesting, for string keys, Lua (5.2.3) does not use the whole strings for hashing. |
Limin Fu <notifications@github.com> schrieb am Mi., 19. Juni 2019, 17:15:
Oh yeah, thanks for explanation. I must say I didn't see anywhere in Dao
doc, that hash maps are ordered and that they're even deterministically
iterable and that the order is guaranteed cross-machine. And because it's a
hash map I would never get the idea, that there is some defined ordering
and - crosses himself - never expected that a hash map will be iterated in
the same order even if it's the same invariable map immediately iterated
two times in a row.
Hash map in many languages/implementations including Dao does guarantee a
deterministic ordering of the keys, it's just that it is *not a
meaningful ordering*. It is also called unordered map or unordered
associative array so that users are reminded that they should not depend on
the ordering of hash keys for their programs. Nevertheless, hash map in Dao
as well as in many other languages/implementations is deterministically
iterable, for tasks that does not depend on the ordering of hash keys.
deterministic ordering leads to your insecure hashmaps. secure hashmaps
need to be fully indeterministic, randomly ordered and randomly seeded. Any
language which provide ordered iterations are either slow or insecure.
but with hash map it's really surprising that Dao needs something like
endianess independence.
but even then I'm not sure whether the requirement of having an endianess
independent hash function is justified.
It's not a requirement, but a feature that is nice to have in Dao. This
feature will remove some pitfalls related to hash map. Maybe most users
will not encounter such pitfalls, but who knows how they will use hash maps
and what kind of problems they will try to solve.
platform independent deterministic hashing has the same problems as above.
best is not to expose any random ordering, eg in JSON or YAML exported
maps. clients need to assume fully random ordering.
I believe language designers should never speculate how users will use
their languages and expect them to use their languages properly or in
certain ways, and programming languages should be designed and implemented
to contain as less pitfalls as possible.
Just checked, Lua and the game engine I am using uses endianness-neutral
hash functions for hash map. I haven't look into other languages, I think
many of them are also using endianness-neutral hash functions (at least any
serious language should).
no, for sure not. serious languages should not expose security-critical
internals, should not rely on internal secrets.
random old popular languages do, but they were not properly designed and
have to carry now the burden of old mistakes.
|
I think the default behaviour should be deterministic. If you want non-deterministic behaviour, you can introduce it explicitly. In Dao, you can actually construct randomly seed hash maps using the explicit constructor: map<@K=any,@V=any>( count: int, hashing: enum<none,auto,random>|int = $none )
[index: int => tuple<@K,@V>] => map<@K,@V> or reset the hashing seed randomly using: reset( self: map<@K,@V>, hashing: enum<none,auto,random> ) Also, Dao modules web.cgi and web.http use randomly seeded hash maps for http get, post and cookie data. |
Understood. I also feel it's a nice to have feature.
I tend to agree with @rurban that the default setting shall be secure rather then deterministically iterable. One can then use the explicit constructor or I'm also certain, that if we really want to support deterministically iterable maps, we can implement both hashing alogrithms. By default the faster one with random ordering with the constraint, that calling |
TL;DR
Jump to the end of this thread to see the latest proposal.
Original proposal:
As per tests on https://github.com/Cyan4973/xxHash it seems xxHash is a way faster (2x) option to the currently used MurmurHash 3 while offering the same quality. xxHash is BSD licensed (i.e. compatible with MIT).
The text was updated successfully, but these errors were encountered: