Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use FNV1a for string hashing #1806

Merged
merged 1 commit into from
Jul 18, 2024
Merged

Conversation

daviesrob
Copy link
Member

Note draft because samtools/samtools#2081 needs to be merged first, to fix a dependency on hash table ordering on some outputs.

The existing X31 hash propagates bits fairly slowly, resulting in a poor distribution of keys if most of the differences in strings are at the end. Fix by using FNV1a instead, which is a similar speed to calculate but distributes keys much more effectively. While this cannot completely solve the problem of certain inputs distributing badly, it hopefully makes it less likely to accidentally find one.

Includes kh_stats() function in khash which produces a histogram of probe chain lengths and a khash test framework. The test program can also be used to benchmark insertion and lookup times.

Some benchmarking results are listed below. The "Numbers" input is the default benchmark, consisting of strings test0 through to test49999999. "Human" is the reference names from GRCh38_full_analysis_set_plus_decoy_hla.fa. "Big" is the reference names from the file with a very large header linked in issue samtools/samtools#1105.

Insert X31 Insert FNV1a Lookup X31 Lookup FNV1a
Numbers 17.95 13.79 14.08 8.39
Human 0.000435 0.000468 0.000134 0.000117
Big 13.22 14.30 6.77 6.63

So FNV1a performance is much better than X31 on bad cases, a little better for everything on lookups, and only slightly slower for insertions on the "Human" and "Big" tests.

Also for interest, here are probe length charts for the various tests (note log scales on y):

"Numbers", showing a very poor distribution for X31:
kh_benchmark

"Human":
kh_human

"Big", in this case the names are long so X31 mixes better and the distributions are similar:
kh_big

Fixes samtools/samtools#2066

@jkbonfield
Copy link
Contributor

jkbonfield commented Jul 15, 2024

Why does our benchmark only show ~2-fold difference for Numbers, but the linked issue was 1-2 mins vs 15 hours (so ~500x speed difference).

Are we sure this fixes the issue as it sounds like something else was going on too.

Edit: actually it looks like your "numbers" file was their fix, and not the cause of the problem. They had something akin to base64 encoding originally.

@daviesrob
Copy link
Member Author

We don't know exactly what values were used in the linked issue, but it's possible to make something similar with this perl 1-liner:

perl -e '@x = ("A"..."Z", "a"..."z","0"..."9"); for ($i = 0; $i < 40000000; $i++) { $j = $i; $s = ""; for ($k = 0; $k < 5; $k++) { $s = $x[$j % 62] . $s; $j /= 62; } print "$s\n"; }' > /tmp/names

Running the benchmark on this gives a much bigger difference (8994 seconds is approx. 2.5 hours):

Insert X31 Insert FNV1a Lookup X31 Lookup FNV1a
names 8994 7.02 8878 4.36

The probe chart for X31 is fairly spectacular:

kh_names_x31

while the one for FNV1a is much more well-behaved:

kh_names_fnv

Assuming lookups that are evenly distributed between all the bins, on average you'd need something like 5563 comparisons to find something in an X31 hash table, but only 1.7 in one using FNV1a. That would easily account for the difference in speed seen.

htslib/khash.h Outdated
Comment on lines 446 to 447
const khint_t offset_basis = 2166136261;
const khint_t FNV_prime = 16777619;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we get consistent indentation here please? Similarly for __ac_FNV1a_hash_kstring below. That looks like hard tabs for this file, although there's precedence for 4 spaces in the Wang hash.

I don't mind either way except for not switching half way through a function :-)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spacing has been made consistent.

The existing X31 hash propagates bits fairly slowly, resulting in
a poor distribution of keys if most of the differences in strings
are at the end.  Fix by using FNV1a instead, which is a similar
speed to calculate but distributes keys much more effectively.

Includes kh_stats() function in khash which produces a histogram
of probe chain lengths and a khash test framework.  The test
program can also be used to benchmark insertion and lookup
times.
@daviesrob daviesrob marked this pull request as ready for review July 17, 2024 16:10
@daviesrob
Copy link
Member Author

Marked as "Ready for review" now samtools/samtools#2081 has been merged. Both samtools' and bcftools' make check should work with this change.

@jkbonfield jkbonfield merged commit a135bc0 into samtools:develop Jul 18, 2024
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Header loading time depends on header names
2 participants