Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change MurmurHash 3 to a faster hash without sacrificing quality #528

Open
dumblob opened this issue Nov 14, 2016 · 29 comments
Open

Change MurmurHash 3 to a faster hash without sacrificing quality #528

dumblob opened this issue Nov 14, 2016 · 29 comments

Comments

@dumblob
Copy link

dumblob commented Nov 14, 2016

TL;DR

Jump to the end of this thread to see the latest proposal.

Original proposal:

As per tests on https://github.com/Cyan4973/xxHash it seems xxHash is a way faster (2x) option to the currently used MurmurHash 3 while offering the same quality. xxHash is BSD licensed (i.e. compatible with MIT).

@daokoder
Copy link
Owner

The MurmurHash3 was not only selected for speed but also for simplicity. Since MurmurHash3 is already fast enough and hashing comprises only a tiny fraction of the total execution time in most applications, doubling the hashing speed will not bring noticeable improvements in speed. For applications that really need fast hashing, providing appropriate modules will be a better solution.

@dumblob
Copy link
Author

dumblob commented Jan 16, 2017

I agree, that it's not needed at the moment and that the linked library is quite horrible. I'm though still convinced the algorithm should be built in the DaoVM reference implementation. My Dao source code uses hash maps a lot (and for critical stuff as fast network data processing) and internally Dao uses also hashing, so I believe in bigger applications this will play a significant role.

xxHash is also quite simple (look at a "high-level" implementation in JS - 32bit 64bit - which is way nicer than the original C implementation full of ifdefs and other unnecessary clutter).

In my opinion it's worth it to rewrite this high-level implementation in C99 just for the DaoVM purpose - it'll be small. The linked xxHash C library should not be incorporated as you said because it's bloated and might become a Dao module at most.

@rurban
Copy link
Contributor

rurban commented Aug 15, 2017

xxHash is only fast for large buffers. For small buffers or usage in a hash table FNV1 is still the fastest. Murmur 3 is also not really recommended.

@daokoder
Copy link
Owner

For small buffers or usage in a hash table FNV1 is still the fastest.

Due to its simplicity, it should be quite fast. Just curious, are there many FNV primes for a given bit length? If not, there will be not be enough hash seeds to choose randomly, when it is necessary to defend agains hash key collision attacks (e.g. in web application).

Murmur 3 is also not really recommended.

Why is that? Any benchmarks to compare FNV1 and Murmur3?

@dumblob
Copy link
Author

dumblob commented Aug 17, 2017

xxHash is only fast for large buffers. For small buffers or usage in a hash table FNV1 is still the fastest. Murmur 3 is also not really recommended.

I'm certainly missing something as both arguments seem void based on the tests in your, @rurban, repository: https://github.com/rurban/smhasher .

@rurban
Copy link
Contributor

rurban commented Aug 17, 2017

You have to use the right benchmarks :)

If you just test the hash function by itself, the longer wordwise ones dominate of course. But this is only useful for file or networking digests.

If you test them inside a typical hash table, usually inlined if small enough, the smaller ones dominate. You have to follow the link to
"When used in a hash table the instruction cache will usually beat the CPU and throughput measured here. In my tests the smallest FNV1A beats the fastest crc32_hw1 with Perl 5 hash tables."
https://github.com/rurban/perl-hash-stats#average-case-perl-core-testsuite

@dumblob
Copy link
Author

dumblob commented May 21, 2018

Well, the primary motivation for the switch was speed, but I do trust quite a lot the smhasher "benchmark" and mainly the quality classification whereas I would never accept any hash function, which doesn't perform at least at the level GOOD (otherwise there are security risks with severe effects like DoS originating in bad distribution of keys or O(n^2) worst-case scenarios triggered by a specially crafted inputs/packets/..., which is not acceptable for default built-in algorithms of general purpose languages like Dao).

Should this come again to question, it should be quite easy to implement xxHash portably based on the xxHash specification.

@rurban
Copy link
Contributor

rurban commented May 21, 2018 via email

@dumblob
Copy link
Author

dumblob commented May 21, 2018

It seems you mean FNV1a_YT instead of FNV1a. Then if we'll ignore CRC-based hashes it's the fastest. But also one of the very few with the absolute worst quality 😉 .

I can see in your smhasher-list though, that even faster than xxHash (especially for smaller keys as you noted) is t1ha - and without sacrificing any quality (according to smshasher). I have to admit I'm really surprised, that something can be so quick without sacrificing quality while being totally portable (yeah, it'll be slower on HW without native 64bit arithmetics, but that's a negligible issue nowadays).

@rurban
Copy link
Contributor

rurban commented May 21, 2018

Partially. FNV1a and it's mult. variants are the fastest for hash tables, besides _mm_crc32_u64 and t1ha. Even if the quality is a bit worse, the icache (the hash function) and dcache (the hash data structure) are more important than everything else.

Regarding security against DOS attacks, forget about the hash function, as even siphash can be brute forced. Only the collision resolution scheme can protect against DOS attacks.

@dumblob
Copy link
Author

dumblob commented Jun 12, 2019

There is a new hash function wyhash, which seems to raise the bar of hashing (including PRNG) by a giant step. It's less complicated than xxHash, but more than MurmurHash. So if there will be any need, then there is something we can grab immediately as it's significantly faster and with a state of the art distribution of keys.

See also smhasher tests which @rurban regularly updates.

@dumblob dumblob changed the title Change MurmurHash 3 to xxHash which is 2x that fast Change MurmurHash 3 to a faster hash without sacrificing quality Jun 12, 2019
@daokoder
Copy link
Owner

as it's significantly faster and with a state of the art distribution of keys.

Where did you see this?

Please note speed is not the most important factor in choosing a hash function. wyhash seems quite new, not many people have looked into it to spot potential problems. It would be very unwise to switch to it.

Also, just with a glimpse of the wyhash source code, it is obvious that it will produce different results on machines with different endianness. This is quite alarming for me.

@dumblob
Copy link
Author

dumblob commented Jun 13, 2019

Where did you see this?

https://github.com/rurban/smhasher/blob/master/doc/wyhash
https://github.com/rurban/smhasher/blob/master/doc/wyhash32low

Please note speed is not the most important factor in choosing a hash function.

Sure, that's why I noted "if there will be any need" 😉.

This is quite alarming for me.

This is interesting. Are there any places in Dao which require that? Or do we expose it somewhere, so that it allows e.g. sending a computed hash over network and blindly using it on another machine? Bytecode? Or something else?

@wangyi-fudan
Copy link

师兄好,我是2000级复旦生科的。wyhash是小弟开发的。 ^_^

@rurban
Copy link
Contributor

rurban commented Jun 13, 2019 via email

@daokoder
Copy link
Owner

This is interesting. Are there any places in Dao which require that? Or do we expose it somewhere, so that it allows e.g. sending a computed hash over network and blindly using it on another machine? Bytecode? Or something else?

Dao does not really require that. In fact, I didn't pay attention to this until recently. In some networked applications (such as certain types of games), the clients and server are required to have deterministic and identical behavior, which cannot be guaranteed if endian-dependent hashes are used for map keys.

There are some good endian-portable and 32bit only hashes also.

I had a look at MurmurHash3, it turns out it is not endian-neutral either, but it can be easily fixed. I will also look into theses hashes when I am less busy.

@rurban Curious, what platforms and compilers do you use for smhasher tests? I noticed that wyhash uses intrinsics, could its speed is mainly a result of this? If this is the case, it should not be very difficult to modify some other hash functions to use intrinsics too.

@daokoder
Copy link
Owner

师兄好,我是2000级复旦生科的。wyhash是小弟开发的。 ^_^

你好,不错:thumbsup:

@dumblob
Copy link
Author

dumblob commented Jun 14, 2019

In some networked applications (such as certain types of games), the clients and server are required to have deterministic and identical behavior, which cannot be guaranteed if endian-dependent hashes are used for map keys.

Well, this can be (and shall be IMHO) easily ensured in the serialization/deserialization routines from the Dao standard library before/after sending/receiving it over network. Another (even better) solution would be to use e.g. Cap'n Proto or sproto for such "shared" structures to avoid having to pay attention to such details.

So in the end I'm still not convinced endianess neutrality really matters in Dao 😉.

@daokoder
Copy link
Owner

Well, this can be (and shall be IMHO) easily ensured in the serialization/deserialization routines from the Dao standard library before/after sending/receiving it over network.

It is not about sending data over network. I was talking about something like Deterministic Lockstep for Networked Games

@dumblob
Copy link
Author

dumblob commented Jun 17, 2019

It is not about sending data over network. I was talking about something like Deterministic Lockstep for Networked Games

Thanks for the article. I'm though still not there as I'm not aware of any Dao/Dao_modules API which would expose the internals of a hash function and I can't imagine how e.g. the hash table API would cause a different result when replaying inputs as in the deterministic lockstep scenario (assuming just the Dao APIs were involved). I must be still overlooking something - I'm eager to get enlightened 😉.

@daokoder
Copy link
Owner

I must be still overlooking something

You indeed overlooked something: for-in iteration over hash map😄.

@dumblob
Copy link
Author

dumblob commented Jun 18, 2019

Yeah, it's tedious, but again - look at the scenario below. In this case it absolutely doesn't matter, that one machine is little endian and the other big endian.

unstable_hash00

So what is the hidden secret I'm missing?

@daokoder
Copy link
Owner

So what is the hidden secret I'm missing?

Ok, maybe you are unaware that Dao hash map is an unordered associative array. If a hash map has multiple keys, the keys will be sorted by their hash values (which may appear unordered if the ordering by hash values is different from ordering by their actual values). If the hash function is not endian-neutral, different endianness can generate different hash values for the keys, so they can be visited in different orders in for-in iteration.

And forget about data over network, the application could use hash maps for internal data that are not communicated over network. As an example, game engine such as Urho3D uses hash maps for scene node components and event registration etc., their handling will depend on the ordering of the keys.

@dumblob
Copy link
Author

dumblob commented Jun 18, 2019

Oh yeah, thanks for explanation. I must say I didn't see anywhere in Dao doc, that hash maps are ordered and that they're even deterministically iterable and that the order is guaranteed cross-machine. And because it's a hash map I would never get the idea, that there is some defined ordering and - crosses himself - never expected that a hash map will be iterated in the same order even if it's the same invariable map immediately iterated two times in a row.

With a tree-based (e.g. weighted red-black) map that's a different story, but with hash map it's really surprising that Dao needs something like endianess independence. In the worst case I would call it an optimization for the case when one wants to sort the iterated hash map pairs (key + value) using an algorithm which strongly benefits from a partially sorted sequence (e.g. Tim sort), but even then I'm not sure whether the requirement of having an endianess independent hash function is justified.

@daokoder
Copy link
Owner

daokoder commented Jun 19, 2019

Oh yeah, thanks for explanation. I must say I didn't see anywhere in Dao doc, that hash maps are ordered and that they're even deterministically iterable and that the order is guaranteed cross-machine. And because it's a hash map I would never get the idea, that there is some defined ordering and - crosses himself - never expected that a hash map will be iterated in the same order even if it's the same invariable map immediately iterated two times in a row.

Hash map in many languages/implementations including Dao does guarantee a deterministic ordering of the keys, it's just that it is not a meaningful ordering. It is also called unordered map or unordered associative array so that users are reminded that they should not depend on the ordering of hash keys for their programs. Nevertheless, hash map in Dao as well as in many other languages/implementations is deterministically iterable, for tasks that does not depend on the ordering of hash keys.

but with hash map it's really surprising that Dao needs something like endianess independence.

but even then I'm not sure whether the requirement of having an endianess independent hash function is justified.

It's not a requirement, but a feature that is nice to have in Dao. This feature will remove some pitfalls related to hash map. Maybe most users will not encounter such pitfalls, but who knows how they will use hash maps and what kind of problems they will try to solve.

I believe language designers should never speculate how users will use their languages and expect them to use their languages properly or in certain ways, and programming languages should be designed and implemented to contain as less pitfalls as possible.

@daokoder
Copy link
Owner

daokoder commented Jun 19, 2019

Just checked, Lua and the game engine I am using uses endianness-neutral hash functions for hash map. I haven't look into other languages, I think many of them are also using endianness-neutral hash functions (at least any serious language should).

Interesting, for string keys, Lua (5.2.3) does not use the whole strings for hashing.

@rurban
Copy link
Contributor

rurban commented Jun 19, 2019 via email

@daokoder
Copy link
Owner

deterministic ordering leads to your insecure hashmaps. secure hashmaps
need to be fully indeterministic, randomly ordered and randomly seeded. Any
language which provide ordered iterations are either slow or insecure.

I think the default behaviour should be deterministic. If you want non-deterministic behaviour, you can introduce it explicitly. In Dao, you can actually construct randomly seed hash maps using the explicit constructor:

map<@K=any,@V=any>( count: int, hashing: enum<none,auto,random>|int = $none )
            [index: int => tuple<@K,@V>] => map<@K,@V>

or reset the hashing seed randomly using:

reset( self: map<@K,@V>, hashing: enum<none,auto,random> )

Also, Dao modules web.cgi and web.http use randomly seeded hash maps for http get, post and cookie data.

@dumblob
Copy link
Author

dumblob commented Jun 19, 2019

It's not a requirement, but a feature that is nice to have in Dao.

Understood. I also feel it's a nice to have feature.

This feature will remove some pitfalls related to hash map. Maybe most users will not encounter such pitfalls, but who knows how they will use hash maps and what kind of problems they will try to solve.

I tend to agree with @rurban that the default setting shall be secure rather then deterministically iterable. One can then use the explicit constructor or reset() to make the iteration deterministic or just use sort().

I'm also certain, that if we really want to support deterministically iterable maps, we can implement both hashing alogrithms. By default the faster one with random ordering with the constraint, that calling reset($auto) will make a complete rehashing due to switch to a different algorithm. I'm actually certain, that such "rehashing" won't be used much in practice and I'm also certain, that even the deterministic iteration will be used very sporadically.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants