-
Notifications
You must be signed in to change notification settings - Fork 6.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bloom filters using a flawed double hashing scheme #4120
Comments
See prototyping+testing in https://github.com/pdillinger/rocksdb/tree/bloom-filters, commit 90d1ab9 |
Thanks @pdillinger for the well-researched, well-written proposal. Benchmarks would have the final world about any such change. The simplest benchmark that we have is db_bench. Do you think you can run db_bench readrandom with and without the change and see how much the difference is? |
Awesome analysis! I just want to point out that if we change the Bloom filters format we may also take this opportunity to re-evaluate the choice of the hash function, which is quite dated. |
@maysamyabandeh My experience with db_bench so far is that it's slow to get results and there's so much variance in the results, in testing up to num = 20M (still just hitting disk cache for reads), that I don't have faith that I'm going to have meaningful data if I do some big runs. Why so much variance? Up to 33% change in ops/sec from one run to the next. The level stats are even different. I would expect similar behavior given that the --seed option suggests deterministic seeding of random generators. And I obviously can't just readrandom on the same existing db across various changes/settings, because they affect its construction/formatting in incompatible ways. ./db_bench --bloom_bits 10 --num 20000000 --benchmarks fillrandom,readrandom,readmissing,stats,levelstats,sstables @ot I don't know how common it is for keys to already be uniformly distributed, like UUIDs, but perhaps it's worth an option to essentially skip hashing on the assumption that keys are already uniformly distributed--or that a sufficient portion of the key is uniformly distributed. You could imagine sharding based on some portion of a UUID making that portion predictable, but XORing together the whole key should still be super fast and take care of that case. You just need at least 32 contiguous, uniformly-distributed bits somewhere in the key. Hypothetically, you could even fall back on real hashing if the XOR approach is found not to be accurate enough, perhaps by tracking how many keys added to the structure did not change it (i.e. would have been an FP query at time added). In related news (possibly to put Bloom filter fixing/enhancement in context), I've baked up a Bloom filter alternative discussed in my dissertation ("static compacted table" or SCT, Section 7.2.3, http://peterd.org/pcd-diss.pdf) that is remarkably compact and accurate when the set of keys are known ahead of time. Despite a pre-pass on the keys to put them in buckets, I've got it screaming (cache locality!), squeaking ahead-ish in speed and accuracy even when limiting to 9 bits/key compared to 10 in the Bloom filters (implementation not yet shared - ugly at the moment):
My recent literature search suggests that the Cuckoo filter is state-of-the-art for compact+accurate+fast-ish, especially if deletion is needed. However, SCT beats the Cuckoo filter soundly in cases where (a) the key set is known ahead, (b) deletion is not needed, and (c) no worst-case bounds on look-up time are needed, just average case (with no adversary) matters. I unfortunately never got the cycles to publish this structure outside my dissertation. Extending the code used in the cuckoo filter paper for Table 3, I get these results:
Note: Unlike Table 3 in the paper, here I have fixed the # of items added, to best match our usage paradigm, and because SCT can add elements until 100% load factor but is best below 100%, e.g. around 92% for 8-bit entries (8.69 bits / key). |
@pdillinger you can use readrandom[X10] to run the benchmark many time, discard the first run, and average over the next runs. "-duration 60" could also be used to control the duration of the benchmark. Also make sure block_cache (default 8MB) is proportional to the DB size. |
Very rare for most of the use cases I'm aware of, in fact it is quite common to encode a hierarchy (for example table name, column name) or other kind of structure into the key. I don't think the cost of hashing is really an issue here, hashing can be amortized over the multiple lookups (at least one per level), and I assume memory lookups take the majority of the cost.
This is very nice, I'll look into it. I have done some work on perfect hashing, which can also be used to represent approximate sets with provable false positive guarantees, I wonder how it compares with your method. BTW, vanilla Cuckoo filters require power-of-2 table size (because of the XOR trick to find the other slot) but it's possible to modify the method to work with any table size, have you considered that? |
(At risk of going a bit off topic)
That seems quite doable, such as this:("tag", also called "fingerprint", is the value stored in a cell; let int incr = (tag << 1) + 1 - (1 << (bits_per_tag - 1)); // small, odd, and (positive or negative)
other_index = mod(index & 1 ? index + incr : index - incr, size); where the only thing we assume about In fact, my test results seriously call into question this suspicious bit from the cuckoo filter paper:
The first thing suspicious about this is that since h_2 is a function of h_1 + the tag, if h_1(x) = h_1(y) and h_2(x) = h_2(y), then y should have the same tag, at least with high probability, and "merge" with x (adding y after x does not change the structure). Therefore, we don't have to worry about first-order clustering, where h_1(x) = h_1(y) and h_2(x) = h_2(y) but tag(x) != tag(y). We only have to worry about higher-order clustering, such as h_2(x) = h_1(y), h_2(y) = h_1(z), h_2(z) = h_1(z). This is certainly possible with the XOR scheme if tag(x) XOR tag(y) = tag(z). But such second-order clustering is not possible with my even-odd scheme; you have to get to third-order (four items) to get clustering effects, and this should be negligible. Also note that using the tag literally, rather than a hash, excludes first-order clustering entirely (rather than with high probability), and improves locality. :-D To test this idea, we need to test where the cuckoo filter rejects an add, because this is how clustering would affect the structure (not FP rate assuming a specific number of additions). We do that here, and let the SCT get to 98% (higher than normal, but still competitive!). Comparing reference implementation of cuckoo filter (*CF) to ones using my new scheme for computing alternate index (*CF-alt), and SCT (from earlier comments):
Pay most attention to # items added, though the difference in the -alt versions is likely sampling error (based on inspecting more runs). FP rate and speed of SCT don't look as good now because it's adding more elements than CF--look at bits per item. And if we fix the number of items added, CF-alt and SCT show their strengths:
I'm still using power-of-two sized structures here, but there's no reason to believe they'll behave any different with arbitrary (even) sizes when using my scheme. |
@maysamyabandeh I'm still getting too much noise with db_bench, I believe because there seems to be significant randomness in how the resulting db from N insertions is laid out, based at least on levelstats. For example, since readrandom seems to query stored keys only (not clear from docs), 9-bit-per-element Bloom filters should have slightly faster readrandom times (and slower readmissing times due to FP rate) than 10-bit-per-element, but one of the 9-bit dbs I built had consistently faster readrandom time than the 10-bit using the same Bloom algorithm. Sampling over many database constructions, not just many reads, gets very expensive (for the resources I have handy - a Linux laptop). We could control for the randomness of the db layout with a new db_bench option that rebuilds and replaces all the filters in the database. This would allow us to swap in a new filter implementation without building a new db from scratch, and the performance of this step could be a good simulation for the cost of building filters that contributes to the cost of building the db, without the other noise of db building. Perhaps we can't easily support dropping in replacement filters of different size, but replacement of the same size (bits / item) should be easy and sufficiently control the noise to compare filter algorithms head-to-head, roughly in-situ. |
@pdillinger The randomness in the experiments could also imply that the impact of the change is not significant. But still if the change is sensible we could apply it anyway. However since this would be a format change, as @ot suggested if we go with that, we could as well revisit the bloom filters with any pending ideas for improvement. There is another part of the code base that could benefit from this without however a format change and also would be less affected by experiment errors: memtable prefix blooms. The code is here util/dynamic_bloom.h. Can you take a look to see if the suggested changes are applicable there too and whether the expected improvement show up in benchmarks? |
FWIW, my above proposed cuckoo hashing scheme shows up in this paper published Aug 2018: https://www.vldb.org/pvldb/vol11/p1041-breslow.pdf |
…ebook#6007) Summary: Adds an improved, replacement Bloom filter implementation (FastLocalBloom) for full and partitioned filters in the block-based table. This replacement is faster and more accurate, especially for high bits per key or millions of keys in a single filter. Speed The improved speed, at least on recent x86_64, comes from * Using fastrange instead of modulo (%) * Using our new hash function (XXH3 preview, added in a previous commit), which is much faster for large keys and only *slightly* slower on keys around 12 bytes if hashing the same size many thousands of times in a row. * Optimizing the Bloom filter queries with AVX2 SIMD operations. (Added AVX2 to the USE_SSE=1 build.) Careful design was required to support (a) SIMD-optimized queries, (b) compatible non-SIMD code that's simple and efficient, (c) flexible choice of number of probes, and (d) essentially maximized accuracy for a cache-local Bloom filter. Probes are made eight at a time, so any number of probes up to 8 is the same speed, then up to 16, etc. * Prefetching cache lines when building the filter. Although this optimization could be applied to the old structure as well, it seems to balance out the small added cost of accumulating 64 bit hashes for adding to the filter rather than 32 bit hashes. Here's nominal speed data from filter_bench (200MB in filters, about 10k keys each, 10 bits filter data / key, 6 probes, avg key size 24 bytes, includes hashing time) on Skylake DE (relatively low clock speed): $ ./filter_bench -quick -impl=2 -net_includes_hashing # New Bloom filter Build avg ns/key: 47.7135 Mixed inside/outside queries... Single filter net ns/op: 26.2825 Random filter net ns/op: 150.459 Average FP rate %: 0.954651 $ ./filter_bench -quick -impl=0 -net_includes_hashing # Old Bloom filter Build avg ns/key: 47.2245 Mixed inside/outside queries... Single filter net ns/op: 63.2978 Random filter net ns/op: 188.038 Average FP rate %: 1.13823 Similar build time but dramatically faster query times on hot data (63 ns to 26 ns), and somewhat faster on stale data (188 ns to 150 ns). Performance differences on batched and skewed query loads are between these extremes as expected. The only other interesting thing about speed is "inside" (query key was added to filter) vs. "outside" (query key was not added to filter) query times. The non-SIMD implementations are substantially slower when most queries are "outside" vs. "inside". This goes against what one might expect or would have observed years ago, as "outside" queries only need about two probes on average, due to short-circuiting, while "inside" always have num_probes (say 6). The problem is probably the nastily unpredictable branch. The SIMD implementation has few branches (very predictable) and has pretty consistent running time regardless of query outcome. Accuracy The generally improved accuracy (re: Issue facebook#5857) comes from a better design for probing indices within a cache line (re: Issue facebook#4120) and improved accuracy for millions of keys in a single filter from using a 64-bit hash function (XXH3p). Design details in code comments. Accuracy data (generalizes, except old impl gets worse with millions of keys): Memory bits per key: FP rate percent old impl -> FP rate percent new impl 6: 5.70953 -> 5.69888 8: 2.45766 -> 2.29709 10: 1.13977 -> 0.959254 12: 0.662498 -> 0.411593 16: 0.353023 -> 0.0873754 24: 0.261552 -> 0.0060971 50: 0.225453 -> ~0.00003 (less than 1 in a million queries are FP) Fixes facebook#5857 Fixes facebook#4120 Unlike the old implementation, this implementation has a fixed cache line size (64 bytes). At 10 bits per key, the accuracy of this new implementation is very close to the old implementation with 128-byte cache line size. If there's sufficient demand, this implementation could be generalized. Compatibility Although old releases would see the new structure as corrupt filter data and read the table as if there's no filter, we've decided only to enable the new Bloom filter with new format_version=5. This provides a smooth path for automatic adoption over time, with an option for early opt-in. Pull Request resolved: facebook#6007 Test Plan: filter_bench has been used thoroughly to validate speed, accuracy, and correctness. Unit tests have been carefully updated to exercise new and old implementations, as well as the logic to select an implementation based on context (format_version). Differential Revision: D18294749 Pulled By: pdillinger fbshipit-source-id: d44c9db3696e4d0a17caaec47075b7755c262c5f
For the record, the code used to look like this: Lines 296 to 306 in 662b8fe
The cited Kirsch+Mitzenmacher result is just an asymptotic result, as m and n go to infinity. This code seems to use the result to avoid actually verifying the practical limitations of their code, and the practical issues were significant. Not mentioned above but in https://github.com/facebook/rocksdb/wiki/RocksDB-Bloom-Filter#full-filters-new-format is that this implementation would hit an FP rate "wall" and couldn't really get below about 0.1% FP rate even with 100 bits/key. This caused problems for some FIFO compaction users. I blame the Kirsch+Mitzenmacher paper more than the implementor, as the research should make clear its limitations, which were known at publication time. My prior work cited by the Kirsch+Mitzenmacher paper warns of potential complications of naively applying double hashing in practice, and offers solutions to those issues. |
Someone found a similar flaw in Google code (https://github.com/google/guava/blob/d3c306e9e4220d80f111348a92f5cb56d8ad965c/guava/src/com/google/common/hash/BloomFilterStrategies.java#L41) and wrote a paper about it (and more), https://arxiv.org/pdf/2009.11789.pdf (Thanks for citing my Ph.D. dissertation!) |
Expected behavior
I expected the Bloom filter implementations to have behavior reasonably close to the accuracy predicted by formulas.
Actual behavior
With more thorough testing, the Bloom filters used for summarizing keys in an SST file show compromised accuracy. Fixing some implementation flaws can result in faster and more accurate Bloom filters.
Introduction
From my investigation, RocksDB appears to have three Bloom filter implementations:
All of these derive the requisite Bloom filter indices (often 6 of them for ~1% FP @ 10 bits/key) from a single 32-bit hash value using variations on double hashing. The original leveldb implementation cites [Kirsch,Mitzenmacher 2006] ("Less Hashing, Same Performance: Building a Better Bloom Filter") as justification, but that paper is about an asymptotic result, not about real-world performance on Bloom filters that might be medium-sized or small. My publications "Fast and Accurate Bitstate Verification for SPIN" and "Bloom Filters in Probabilistic Verification" established the idea underlying the Kirsch/Mitzenmacher work, along with many practical issues, but my dissertation has the best information/advice for implementing schemes like this. http://peterd.org/pcd-diss.pdf
I investigated the first two implementations above because the third appears to be most careful in converting hash information into indices. Both IMPL1 (BloomFilterPolicy::CreateFilter/KeyMayMatch) and IMPL2 (FullFilterBitsBuilder/FullFilterBitsReader) suffer from two issues I present below, with relatively simple solutions.
By the way, I believe the technique in IMPL2 (FullFilterBitsBuilder/FullFilterBitsReader) to ensure CPU cache locality was first described in "Cache-, Hash- and Space-Efficient Bloom Filters" by Putze et al. (PDF available online), in what they call "Blocked Bloom Filters" (which is different from what RocksDB calls "Block-based Bloom filter"). I recommend removing the basic mathematical analysis on the "RocksDB Bloom Filter" page of the wiki and just referencing this paper. Because the number of shards can be very high, the variance in number of keys landing in each shard is key to understanding the accuracy. Your implementations show the dramatic speed improvement that is possible, but it should be noted that in a good implementation, the accuracy hit is not lost in noise (as exibited below in detailed results).
First Issue
The first issue is "Issue 1" in section 6.5.1 (printed page 94 / pdf page 116) of http://peterd.org/pcd-diss.pdf, which is a standard problem for double hashing in open-addressed hash tables. Specifically, certain values for 'delta', such as zero, cause only one or few unique indices to be probed repeatedly. In open-addressed hash tables this could lead to infinite loops, but in Bloom filters such cases have a more subtle effect in the false positive rate. Standard solutions ensure that 'delta' is non-zero and relatively prime to the modulus.
If IMPL1 (BloomFilterPolicy::CreateFilter/KeyMayMatch) is to be fixed (backward compatibility?), I recommend switching it to enhanced double hashing to dodge entirely the complications of ensuring relative primality. The change (call it ENH) is simple (applied in two places):
This carries a small hit in execution speed (~3% longer running relevant Bloom filter unit test) but has substantial accuracy benefit: just 1.5% of tested Bloom filters having a "mediocre" FP rate (IMPL1+ENH) rather than 31% (IMPL baseline) (based on my enhancements to the unit tests - see below). (More results and justification of speed trade-offs later.)
We can use the ENH change similarly for IMPL2 (FullFilterBitsBuilder/FullFilterBitsReader), but the accuracy benefit is not as significant (1 mediocre FP case for IMPL2+ENH vs. 3 mediocre for IMPL2 baseline). But here we can get a similar accuracy boost with almost no runtime overhead, because the effective modulus (CPU cache line size) is a power of 2. Ensuring 'delta' is odd suffices to ensure unique indices. Call this change ODD (applied in Builder and Reader):
IMPL2+ODD yields 2 mediocre FP cases.
Second Issue
These implementations only work with 32 bits of hash information and if bits of that data are reused with virtually no mixing, subtle associations between starting index and delta can affect accuracy. This appears to be fixable by throwing in one or two XOR operations between the existing modulus operations. IMPL1+XOR (I didn't put any engineering into the specific value used here):
This carries a minimal speed hit (< 1%) but is itself sufficient to lower mediocre Bloom filters in the unit test from 31% (IMPL1 baseline) to 9% (IMPL1+XOR). Combined with the previous change we get no mediocre Bloom filters (0% for IMPL1+ENH+XOR).
IMPL2+XOR inserts two operations, because parts of the hash value are used three times:
Obviously we could get the best accuracy with more significant mixing and/or with (re-)distributing parts of the hash over subsequent indices (like IMPL3), but I'm trying to close the accuracy gap with minimal additional running time.
IMPL2+XOR carries a modest speed hit (~3%) with an accuracy benefit similar to IMPL2+ENH, but combining IMPL2+ODD+XOR works better, with no mediocre FP Bloom filters and the same modest speed hit (~3%). IMPL2+ENH+XOR is slightly more accurate (results below) with about a 5% speed hit vs. IMPL2 baseline.
Accuracy vs. speed
You might say "ok, an improvement in accuracy is nice, but not if it's slower in unit tests." First of all, the purpose of these Bloom filters is to skip costly disk reads, so depeding on a specific deployment (I don't have a typical deployment handy, just an SSD laptop) a small improvement in Bloom filter accuracy could pay for a small regression in speed.
But even more generally, Bloom filters have a built-in way of trading accuracy for speed: the number of bits set per index. Manipulating this parameter might enable finding a configuration incorporating a new change that is at least as good in all dimensions (speed, accuracy, memory use) as the old configuration and better in at least one dimension.
So I tested the new configurations, IMPL1+ENH+XOR, IMPL2+ODD+XOR, and IMPL2+ENH+XOR with setting 5 bits per index instead of 6, keeping the same memory setting. All three modified implementations beat the baseline in accuracy with the same or better speed. (Details below.) Thus, the accuracy boosts are worth the runtime overhead, as you can negate the runtime overhead with remaining net accuracy benefit by setting fewer indices, at least in the common 5 vs. 6 case.
Detailed results
Working from commit 926f3a7, I extended bloom_test.cc to do more thorough testing. In short, Bloom filters are tested up to 10M keys, with more independence between testing of each filter, and more statistics gathered and printed.
The existing IMPL1 fares poorly with this more thorough testing, and IMPL2 fares not super well. (For some tests to finish, I also had to disable the limit of 2% FP rate.)
Running times are shortest of 2-3 runs for VaryingLength unit test. Mediocre filters are any with FP rate over 1.25%, out of 64 differently-sized Bloom filters. Average FP rate excludes lengths < 1000 because they sometimes have much more than 10 bits / key and are not strong samples.
The theoretical entry comes from standard FP rate formulas for Bloom filters, assuming uniform random hash functions. The same theoretical ideal FP applies to IMPL2 except that the sharding for CPU cache locality affects it somewhat. I don't have a simple formula handy to compute a theoretical prediction with that (assuming quality hashing and index computation).
Tests were run on a Debian 8 laptop with a Intel(R) Core(TM) i5-2520M CPU @ 2.50GHz and 8GB ram.
The text was updated successfully, but these errors were encountered: