Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lose entropy #49

Closed
gzm55 opened this issue Feb 5, 2020 · 12 comments
Closed

lose entropy #49

gzm55 opened this issue Feb 5, 2020 · 12 comments

Comments

@gzm55
Copy link
Contributor

gzm55 commented Feb 5, 2020

it should be noted that wyhash can lose entropy (similarly as described below). For instance, when data could be correlated with the seed ^ _wypN values or equal to it.

@leo-yuriev described this issue on the README of t1ha hash.

@wangyi-fudan
Copy link
Owner

I don't think so. because we keep the mask as secret. thus it can lose entropy with 2^-64 chance.

@erthink
Copy link

erthink commented Feb 6, 2020

The actual chance is 2^-63 per each multiplication. Moreover, the later a zero rolls, (then) the greater the losses will be. For example, if a zero rolls at the end of the main loop, then the 1/4 of the data will be lost.

For many applications, this is an unacceptable flaw. Therefore, users should know about the price of speed, in order to make their own conclusions and assess the risks (regardless of how you assess the severity of the problem).

@wangyi-fudan
Copy link
Owner

Thanks!
I also warn that DRAM is dangerous: http://www.cs.toronto.edu/~bianca/papers/sigmetrics09.pdf

@erthink
Copy link

erthink commented Feb 6, 2020

It is clear that you can't fix the flaw without significantly slowing down wyhash. This is not a problem, wyhash is still a good and fast for many applications. However, you should not justify or gloss over this flaw.

On the contrary, this should be well documented so that users understand exactly what is happening and can choose the speed provided by the wyhash with awareness.

@wangyi-fudan
Copy link
Owner

OK, my friend!

@erthink
Copy link

erthink commented Feb 6, 2020

@wangyi-fudan, I don't want to find fault, but still the current wording is not correct. Since with an increase in the length of the hashed data, the probability of losing something tends to 100%.

I think it is better to formulate "at least 2 ^ -63 per 64-bit word", give an exact analytical formula (i.e. 1.0 - ((1.0 - (2 ^ -63)) ^ N)) and specify the probability for 1,000,000 bytes (for instance).

$ echo "scale = 32; 1.0 - ((1.0 - (2.0 ^ -63)) ^ 125000)" | bc
.00000000000001355252715606865817

This value is small enough to be shy of it ;)

@erthink
Copy link

erthink commented Feb 6, 2020

Minor issue: the "FastestHash" it not a hash, but a do-nothing stub.

@wangyi-fudan
Copy link
Owner

The FastestHash does things, providing useful hashes for short strings in hash table.
I will update the number.
cheers!
We are in deadly flu season, so I imaging if one day I will die, at least wyhash is there.

@wangyi-fudan
Copy link
Owner

wangyi-fudan commented Feb 6, 2020

The scale=32 confused me ;-).
it is 2^-66 per byte, so it will be 2^-26 per TB(2^40), approximately: one loss per 64 million TB.

@gzm55
Copy link
Contributor Author

gzm55 commented Feb 11, 2020

PR #52 should improve this issue with very few runtime cost.

@wangyi-fudan
Copy link
Owner

Dear All:
The issue is fully solved. based on @gzm55 's idea, we can use the following mix functions:
_wymix(A,B)=_wymum(A,B)^A^B.
If A==0 and B!=0, it will return B.
The cost is not bad.

HashFunction Words Hashmap Bulk64K Bulk16M
FastestHash 337.38 51.33 14233.53 3435973.84
std::hash 96.53 36.12 7.33 7.35
wyhash 249.93 44.28 21.61 19.28
xxHash64 109.42 35.19 14.71 14.49
XXH3_scalar 181.24 42.13 13.12 13.07
t1ha2_atonce 124.07 35.80 17.10 16.51

Cheers!

@rurban
Copy link
Contributor

rurban commented Feb 12, 2020

The FastestHash does things, providing useful hashes for short strings in hash table.
I will update the number.
cheers!
We are in deadly flu season, so I imaging if one day I will die, at least wyhash is there.

It's not a deadly flu, just a very bad cold. Much less deadly than SARS or MERS, not talking about a really dangerous flu, like the Spanish flu. It's over in May with warm temperatures coming, hold on.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants