-
Notifications
You must be signed in to change notification settings - Fork 446
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use FNV1a for string hashing #1806
Conversation
Why does our benchmark only show ~2-fold difference for Numbers, but the linked issue was 1-2 mins vs 15 hours (so ~500x speed difference). Are we sure this fixes the issue as it sounds like something else was going on too. Edit: actually it looks like your "numbers" file was their fix, and not the cause of the problem. They had something akin to base64 encoding originally. |
We don't know exactly what values were used in the linked issue, but it's possible to make something similar with this perl 1-liner:
Running the benchmark on this gives a much bigger difference (8994 seconds is approx. 2.5 hours):
The probe chart for X31 is fairly spectacular: while the one for FNV1a is much more well-behaved: Assuming lookups that are evenly distributed between all the bins, on average you'd need something like 5563 comparisons to find something in an X31 hash table, but only 1.7 in one using FNV1a. That would easily account for the difference in speed seen. |
htslib/khash.h
Outdated
const khint_t offset_basis = 2166136261; | ||
const khint_t FNV_prime = 16777619; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we get consistent indentation here please? Similarly for __ac_FNV1a_hash_kstring below. That looks like hard tabs for this file, although there's precedence for 4 spaces in the Wang hash.
I don't mind either way except for not switching half way through a function :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spacing has been made consistent.
The existing X31 hash propagates bits fairly slowly, resulting in a poor distribution of keys if most of the differences in strings are at the end. Fix by using FNV1a instead, which is a similar speed to calculate but distributes keys much more effectively. Includes kh_stats() function in khash which produces a histogram of probe chain lengths and a khash test framework. The test program can also be used to benchmark insertion and lookup times.
Marked as "Ready for review" now samtools/samtools#2081 has been merged. Both samtools' and bcftools' |
Note draft because samtools/samtools#2081 needs to be merged first, to fix a dependency on hash table ordering on some outputs.
The existing X31 hash propagates bits fairly slowly, resulting in a poor distribution of keys if most of the differences in strings are at the end. Fix by using FNV1a instead, which is a similar speed to calculate but distributes keys much more effectively. While this cannot completely solve the problem of certain inputs distributing badly, it hopefully makes it less likely to accidentally find one.
Includes kh_stats() function in khash which produces a histogram of probe chain lengths and a khash test framework. The test program can also be used to benchmark insertion and lookup times.
Some benchmarking results are listed below. The "Numbers" input is the default benchmark, consisting of strings
test0
through totest49999999
. "Human" is the reference names from GRCh38_full_analysis_set_plus_decoy_hla.fa. "Big" is the reference names from the file with a very large header linked in issue samtools/samtools#1105.So FNV1a performance is much better than X31 on bad cases, a little better for everything on lookups, and only slightly slower for insertions on the "Human" and "Big" tests.
Also for interest, here are probe length charts for the various tests (note log scales on y):
"Numbers", showing a very poor distribution for X31:
"Human":
"Big", in this case the names are long so X31 mixes better and the distributions are similar:
Fixes samtools/samtools#2066