Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bloom filters using a flawed double hashing scheme #4120

Closed
pdillinger opened this issue Jul 12, 2018 · 12 comments
Closed

Bloom filters using a flawed double hashing scheme #4120

pdillinger opened this issue Jul 12, 2018 · 12 comments
Assignees

Comments

@pdillinger
Copy link
Contributor

Expected behavior

I expected the Bloom filter implementations to have behavior reasonably close to the accuracy predicted by formulas.

Actual behavior

With more thorough testing, the Bloom filters used for summarizing keys in an SST file show compromised accuracy. Fixing some implementation flaws can result in faster and more accurate Bloom filters.

Introduction

From my investigation, RocksDB appears to have three Bloom filter implementations:

  • IMPL1: BloomFilterPolicy::CreateFilter/KeyMayMatch (essentially unchanged from leveldb, for backward compatibility? still used?)
  • IMPL2: FullFilterBitsBuilder/FullFilterBitsReader (modified from that, mostly for CPU cache locality)
  • IMPL3: DynamicBloom (for in-memory only? lacking comments on purpose of yet another implementation)

All of these derive the requisite Bloom filter indices (often 6 of them for ~1% FP @ 10 bits/key) from a single 32-bit hash value using variations on double hashing. The original leveldb implementation cites [Kirsch,Mitzenmacher 2006] ("Less Hashing, Same Performance: Building a Better Bloom Filter") as justification, but that paper is about an asymptotic result, not about real-world performance on Bloom filters that might be medium-sized or small. My publications "Fast and Accurate Bitstate Verification for SPIN" and "Bloom Filters in Probabilistic Verification" established the idea underlying the Kirsch/Mitzenmacher work, along with many practical issues, but my dissertation has the best information/advice for implementing schemes like this. http://peterd.org/pcd-diss.pdf

I investigated the first two implementations above because the third appears to be most careful in converting hash information into indices. Both IMPL1 (BloomFilterPolicy::CreateFilter/KeyMayMatch) and IMPL2 (FullFilterBitsBuilder/FullFilterBitsReader) suffer from two issues I present below, with relatively simple solutions.

By the way, I believe the technique in IMPL2 (FullFilterBitsBuilder/FullFilterBitsReader) to ensure CPU cache locality was first described in "Cache-, Hash- and Space-Efficient Bloom Filters" by Putze et al. (PDF available online), in what they call "Blocked Bloom Filters" (which is different from what RocksDB calls "Block-based Bloom filter"). I recommend removing the basic mathematical analysis on the "RocksDB Bloom Filter" page of the wiki and just referencing this paper. Because the number of shards can be very high, the variance in number of keys landing in each shard is key to understanding the accuracy. Your implementations show the dramatic speed improvement that is possible, but it should be noted that in a good implementation, the accuracy hit is not lost in noise (as exibited below in detailed results).

First Issue

The first issue is "Issue 1" in section 6.5.1 (printed page 94 / pdf page 116) of http://peterd.org/pcd-diss.pdf, which is a standard problem for double hashing in open-addressed hash tables. Specifically, certain values for 'delta', such as zero, cause only one or few unique indices to be probed repeatedly. In open-addressed hash tables this could lead to infinite loops, but in Bloom filters such cases have a more subtle effect in the false positive rate. Standard solutions ensure that 'delta' is non-zero and relatively prime to the modulus.

If IMPL1 (BloomFilterPolicy::CreateFilter/KeyMayMatch) is to be fixed (backward compatibility?), I recommend switching it to enhanced double hashing to dodge entirely the complications of ensuring relative primality. The change (call it ENH) is simple (applied in two places):

-      const uint32_t delta = (h >> 17) | (h << 15);  // Rotate right 17 bits
+      uint32_t delta = (h >> 17) | (h << 15);  // Rotate right 17 bits
       for (size_t j = 0; j < num_probes_; j++) {
+        delta += j; // Enhanced double hashing
         const uint32_t bitpos = h % bits;
         array[bitpos/8] |= (1 << (bitpos % 8));
         h += delta;
       }

This carries a small hit in execution speed (~3% longer running relevant Bloom filter unit test) but has substantial accuracy benefit: just 1.5% of tested Bloom filters having a "mediocre" FP rate (IMPL1+ENH) rather than 31% (IMPL baseline) (based on my enhancements to the unit tests - see below). (More results and justification of speed trade-offs later.)

We can use the ENH change similarly for IMPL2 (FullFilterBitsBuilder/FullFilterBitsReader), but the accuracy benefit is not as significant (1 mediocre FP case for IMPL2+ENH vs. 3 mediocre for IMPL2 baseline). But here we can get a similar accuracy boost with almost no runtime overhead, because the effective modulus (CPU cache line size) is a power of 2. Ensuring 'delta' is odd suffices to ensure unique indices. Call this change ODD (applied in Builder and Reader):

-  const uint32_t delta = (h >> 17) | (h << 15);  // Rotate right 17 bits
+  const uint32_t delta = (h >> 17) | (h << 15) | 1;  // Rotate right 17 bits, ensure odd

IMPL2+ODD yields 2 mediocre FP cases.

Second Issue

These implementations only work with 32 bits of hash information and if bits of that data are reused with virtually no mixing, subtle associations between starting index and delta can affect accuracy. This appears to be fixable by throwing in one or two XOR operations between the existing modulus operations. IMPL1+XOR (I didn't put any engineering into the specific value used here):

       uint32_t h = hash_func_(keys[i]);
       const uint32_t delta = (h >> 17) | (h << 15);  // Rotate right 17 bits
+      h ^= 0x6740bca3; // random mixing
       for (size_t j = 0; j < num_probes_; j++) {
         const uint32_t bitpos = h % bits;

This carries a minimal speed hit (< 1%) but is itself sufficient to lower mediocre Bloom filters in the unit test from 31% (IMPL1 baseline) to 9% (IMPL1+XOR). Combined with the previous change we get no mediocre Bloom filters (0% for IMPL1+ENH+XOR).

IMPL2+XOR inserts two operations, because parts of the hash value are used three times:

   uint32_t h = hash;
   const uint32_t delta = (h >> 17) | (h << 15);  // Rotate right 17 bits
+  h ^= 0x6740bca3; // random mixing
   uint32_t b = (h % num_lines) * (cache_line_size * 8);
+  h ^= 0x41fc0926; // random mixing
...
     const uint32_t bitpos = b + (h % (cache_line_size * 8));

Obviously we could get the best accuracy with more significant mixing and/or with (re-)distributing parts of the hash over subsequent indices (like IMPL3), but I'm trying to close the accuracy gap with minimal additional running time.

IMPL2+XOR carries a modest speed hit (~3%) with an accuracy benefit similar to IMPL2+ENH, but combining IMPL2+ODD+XOR works better, with no mediocre FP Bloom filters and the same modest speed hit (~3%). IMPL2+ENH+XOR is slightly more accurate (results below) with about a 5% speed hit vs. IMPL2 baseline.

Accuracy vs. speed

You might say "ok, an improvement in accuracy is nice, but not if it's slower in unit tests." First of all, the purpose of these Bloom filters is to skip costly disk reads, so depeding on a specific deployment (I don't have a typical deployment handy, just an SSD laptop) a small improvement in Bloom filter accuracy could pay for a small regression in speed.

But even more generally, Bloom filters have a built-in way of trading accuracy for speed: the number of bits set per index. Manipulating this parameter might enable finding a configuration incorporating a new change that is at least as good in all dimensions (speed, accuracy, memory use) as the old configuration and better in at least one dimension.

So I tested the new configurations, IMPL1+ENH+XOR, IMPL2+ODD+XOR, and IMPL2+ENH+XOR with setting 5 bits per index instead of 6, keeping the same memory setting. All three modified implementations beat the baseline in accuracy with the same or better speed. (Details below.) Thus, the accuracy boosts are worth the runtime overhead, as you can negate the runtime overhead with remaining net accuracy benefit by setting fewer indices, at least in the common 5 vs. 6 case.

Detailed results

Working from commit 926f3a7, I extended bloom_test.cc to do more thorough testing. In short, Bloom filters are tested up to 10M keys, with more independence between testing of each filter, and more statistics gathered and printed.

The existing IMPL1 fares poorly with this more thorough testing, and IMPL2 fares not super well. (For some tests to finish, I also had to disable the limit of 2% FP rate.)

Running times are shortest of 2-3 runs for VaryingLength unit test. Mediocre filters are any with FP rate over 1.25%, out of 64 differently-sized Bloom filters. Average FP rate excludes lengths < 1000 because they sometimes have much more than 10 bits / key and are not strong samples.

Configuration  | Time (% vs base) | Mediocre | Avg FP rate | Max FP rate
-------------------------------------------------------------------
IMPL1 base     | 32283 ms         | 20       | 1.16%       | 1.97%
IMPL1+ENH      | 33155 ms (+2.7%) | 1        | 1.06%       | 1.35%
IMPL1+XOR      | 32459 ms (+0.5%) | 6        | 0.85%       | 2.06%
IMPL1+ENH+XOR  | 33239 ms (+2.9%) | 0        | 0.86%       | 1.00%
IMPL1+ENH+XOR@5| 30256 ms (-6.3%) | 0        | 0.96%       | 1.08%
Theoretical    |                  |          | 0.84%       |
Configuration  | Time (% vs base) | Mediocre | Avg FP rate | Max FP rate
-------------------------------------------------------------------
IMPL2 base     | 16542 ms         | 3        | 1.13%       | 1.51%
IMPL2+ODD      | 16642 ms (+0.6%) | 2        | 1.07%       | 1.42%
IMPL2+ENH      | 17268 ms (+4.3%) | 1        | 0.96%       | 1.27%
IMPL2+XOR      | 16996 ms (+2.7%) | 2        | 1.14%       | 1.26%
IMPL2+ODD+XOR  | 16830 ms (+1.7%) | 0        | 1.06%       | 1.24%
IMPL2+ENH+XOR  | 17445 ms (+5.4%) | 0        | 0.95%       | 1.05%
IMPL2+ODD+XOR@5| 16386 ms (-0.9%) | 1        | 1.10%       | 1.28%
IMPL2+ENH+XOR@5| 16522 ms (-0.1%) | 0        | 1.03%       | 1.17%

The theoretical entry comes from standard FP rate formulas for Bloom filters, assuming uniform random hash functions. The same theoretical ideal FP applies to IMPL2 except that the sharding for CPU cache locality affects it somewhat. I don't have a simple formula handy to compute a theoretical prediction with that (assuming quality hashing and index computation).

Tests were run on a Debian 8 laptop with a Intel(R) Core(TM) i5-2520M CPU @ 2.50GHz and 8GB ram.

@pdillinger
Copy link
Contributor Author

See prototyping+testing in https://github.com/pdillinger/rocksdb/tree/bloom-filters, commit 90d1ab9

@maysamyabandeh
Copy link
Contributor

Thanks @pdillinger for the well-researched, well-written proposal. Benchmarks would have the final world about any such change. The simplest benchmark that we have is db_bench. Do you think you can run db_bench readrandom with and without the change and see how much the difference is?

@ot
Copy link
Contributor

ot commented Jul 14, 2018

Awesome analysis! I just want to point out that if we change the Bloom filters format we may also take this opportunity to re-evaluate the choice of the hash function, which is quite dated.

@pdillinger
Copy link
Contributor Author

@maysamyabandeh My experience with db_bench so far is that it's slow to get results and there's so much variance in the results, in testing up to num = 20M (still just hitting disk cache for reads), that I don't have faith that I'm going to have meaningful data if I do some big runs. Why so much variance? Up to 33% change in ops/sec from one run to the next. The level stats are even different. I would expect similar behavior given that the --seed option suggests deterministic seeding of random generators.

And I obviously can't just readrandom on the same existing db across various changes/settings, because they affect its construction/formatting in incompatible ways.

./db_bench --bloom_bits 10 --num 20000000 --benchmarks fillrandom,readrandom,readmissing,stats,levelstats,sstables

@ot I don't know how common it is for keys to already be uniformly distributed, like UUIDs, but perhaps it's worth an option to essentially skip hashing on the assumption that keys are already uniformly distributed--or that a sufficient portion of the key is uniformly distributed. You could imagine sharding based on some portion of a UUID making that portion predictable, but XORing together the whole key should still be super fast and take care of that case. You just need at least 32 contiguous, uniformly-distributed bits somewhere in the key. Hypothetically, you could even fall back on real hashing if the XOR approach is found not to be accurate enough, perhaps by tracking how many keys added to the structure did not change it (i.e. would have been an FP query at time added).

In related news (possibly to put Bloom filter fixing/enhancement in context), I've baked up a Bloom filter alternative discussed in my dissertation ("static compacted table" or SCT, Section 7.2.3, http://peterd.org/pcd-diss.pdf) that is remarkably compact and accurate when the set of keys are known ahead of time. Despite a pre-pass on the keys to put them in buckets, I've got it screaming (cache locality!), squeaking ahead-ish in speed and accuracy even when limiting to 9 bits/key compared to 10 in the Bloom filters (implementation not yet shared - ugly at the moment):

Configuration  | Bits/key | Time (% vs base) | Mediocre | Avg FP rate | Max FP rate
-------------------------------------------------------------------
IMPL2 base     | 10       | 16542 ms         | 3        | 1.13%       | 1.51%
IMPL2+ENH+XOR  | 10       | 17445 ms (+5.4%) | 0        | 0.95%       | 1.05%
IMPL2 SCT      | 9        | 16145 ms (-2.4%) | 2        | 0.87%       | 1.60%

My recent literature search suggests that the Cuckoo filter is state-of-the-art for compact+accurate+fast-ish, especially if deletion is needed. However, SCT beats the Cuckoo filter soundly in cases where (a) the key set is known ahead, (b) deletion is not needed, and (c) no worst-case bounds on look-up time are needed, just average case (with no adversary) matters. I unfortunately never got the cycles to publish this structure outside my dissertation.

Extending the code used in the cuckoo filter paper for Table 3, I get these results:

metrics                                    CF     ss-CF       SCT
# of items (million)                   127.78    127.78    127.78
bits per item                           12.60     12.60     12.60
false positive rate                     0.18%     0.10%     0.07%
constr. speed (million keys/sec)         5.88      3.86      7.20

Note: Unlike Table 3 in the paper, here I have fixed the # of items added, to best match our usage paradigm, and because SCT can add elements until 100% load factor but is best below 100%, e.g. around 92% for 8-bit entries (8.69 bits / key).

@maysamyabandeh
Copy link
Contributor

@pdillinger you can use readrandom[X10] to run the benchmark many time, discard the first run, and average over the next runs. "-duration 60" could also be used to control the duration of the benchmark. Also make sure block_cache (default 8MB) is proportional to the DB size.
If a significant benefit did not show up in reading from SST file, we might be able to still benefit from the improvement in memtable bloom filters, which would not require a format change and thus easier to upgrade.

@ot
Copy link
Contributor

ot commented Jul 16, 2018

I don't know how common it is for keys to already be uniformly distributed

Very rare for most of the use cases I'm aware of, in fact it is quite common to encode a hierarchy (for example table name, column name) or other kind of structure into the key. I don't think the cost of hashing is really an issue here, hashing can be amortized over the multiple lookups (at least one per level), and I assume memory lookups take the majority of the cost.

My recent literature search suggests that the Cuckoo filter is state-of-the-art for compact+accurate+fast-ish, especially if deletion is needed. However, SCT beats the Cuckoo filter soundly in cases where (a) the key set is known ahead, (b) deletion is not needed, and (c) no worst-case bounds on look-up time are needed, just average case (with no adversary) matters. I unfortunately never got the cycles to publish this structure outside my dissertation.

This is very nice, I'll look into it. I have done some work on perfect hashing, which can also be used to represent approximate sets with provable false positive guarantees, I wonder how it compares with your method. BTW, vanilla Cuckoo filters require power-of-2 table size (because of the XOR trick to find the other slot) but it's possible to modify the method to work with any table size, have you considered that?

@pdillinger
Copy link
Contributor Author

(At risk of going a bit off topic)

but it's possible to modify the method to work with any table size, have you considered that?

That seems quite doable, such as this:("tag", also called "fingerprint", is the value stored in a cell; let mod fix undesired behavior of %)

int incr = (tag << 1) + 1 - (1 << (bits_per_tag - 1)); // small, odd, and (positive or negative)
other_index = mod(index & 1 ? index + incr : index - incr, size);

where the only thing we assume about size is that it's even. This effectively partitions the cells, between evens and odds, but this should have no noticeable impact (other than trivially ensuring our alternate index is not the same as the current one).

In fact, my test results seriously call into question this suspicious bit from the cuckoo filter paper:

Hashing the fingerprints ensures that these items can
be relocated to buckets in an entirely different part of the hash
table, hence reducing hash collisions and improving the table
utilization.

The first thing suspicious about this is that since h_2 is a function of h_1 + the tag, if h_1(x) = h_1(y) and h_2(x) = h_2(y), then y should have the same tag, at least with high probability, and "merge" with x (adding y after x does not change the structure). Therefore, we don't have to worry about first-order clustering, where h_1(x) = h_1(y) and h_2(x) = h_2(y) but tag(x) != tag(y). We only have to worry about higher-order clustering, such as h_2(x) = h_1(y), h_2(y) = h_1(z), h_2(z) = h_1(z). This is certainly possible with the XOR scheme if tag(x) XOR tag(y) = tag(z). But such second-order clustering is not possible with my even-odd scheme; you have to get to third-order (four items) to get clustering effects, and this should be negligible. Also note that using the tag literally, rather than a hash, excludes first-order clustering entirely (rather than with high probability), and improves locality. :-D

To test this idea, we need to test where the cuckoo filter rejects an add, because this is how clustering would affect the structure (not FP rate assuming a specific number of additions). We do that here, and let the SCT get to 98% (higher than normal, but still competitive!). Comparing reference implementation of cuckoo filter (*CF) to ones using my new scheme for computing alternate index (*CF-alt), and SCT (from earlier comments):

metrics                                    CF     ss-CF       SCT    CF-alt ss-CF-alt
# of items (million)                   127.76    127.68    131.61    128.01    127.87
bits per item                           12.61     12.61     12.24     12.58     12.60
false positive rate                    0.187%    0.095%    0.097%    0.183%    0.090%
constr. speed (million keys/sec)         5.33      3.48      5.40      5.48      3.51

Pay most attention to # items added, though the difference in the -alt versions is likely sampling error (based on inspecting more runs). FP rate and speed of SCT don't look as good now because it's adding more elements than CF--look at bits per item.

And if we fix the number of items added, CF-alt and SCT show their strengths:

metrics                                    CF     ss-CF       SCT    CF-alt ss-CF-alt
# of items (million)                   127.78    127.78    127.78    127.78    127.78
bits per item                           12.60     12.60     12.60     12.60     12.60
false positive rate                    0.179%    0.094%    0.077%    0.191%    0.094%
constr. speed (million keys/sec)         5.36      3.45      6.28      5.83      3.54

I'm still using power-of-two sized structures here, but there's no reason to believe they'll behave any different with arbitrary (even) sizes when using my scheme.

@pdillinger
Copy link
Contributor Author

@maysamyabandeh I'm still getting too much noise with db_bench, I believe because there seems to be significant randomness in how the resulting db from N insertions is laid out, based at least on levelstats. For example, since readrandom seems to query stored keys only (not clear from docs), 9-bit-per-element Bloom filters should have slightly faster readrandom times (and slower readmissing times due to FP rate) than 10-bit-per-element, but one of the 9-bit dbs I built had consistently faster readrandom time than the 10-bit using the same Bloom algorithm. Sampling over many database constructions, not just many reads, gets very expensive (for the resources I have handy - a Linux laptop).

We could control for the randomness of the db layout with a new db_bench option that rebuilds and replaces all the filters in the database. This would allow us to swap in a new filter implementation without building a new db from scratch, and the performance of this step could be a good simulation for the cost of building filters that contributes to the cost of building the db, without the other noise of db building. Perhaps we can't easily support dropping in replacement filters of different size, but replacement of the same size (bits / item) should be easy and sufficiently control the noise to compare filter algorithms head-to-head, roughly in-situ.

@maysamyabandeh
Copy link
Contributor

@pdillinger The randomness in the experiments could also imply that the impact of the change is not significant. But still if the change is sensible we could apply it anyway. However since this would be a format change, as @ot suggested if we go with that, we could as well revisit the bloom filters with any pending ideas for improvement.

There is another part of the code base that could benefit from this without however a format change and also would be less affected by experiment errors: memtable prefix blooms. The code is here util/dynamic_bloom.h. Can you take a look to see if the suggested changes are applicable there too and whether the expected improvement show up in benchmarks?

@pdillinger
Copy link
Contributor Author

FWIW, my above proposed cuckoo hashing scheme shows up in this paper published Aug 2018: https://www.vldb.org/pvldb/vol11/p1041-breslow.pdf

@pdillinger pdillinger self-assigned this Sep 27, 2019
merryChris pushed a commit to merryChris/rocksdb that referenced this issue Nov 18, 2019
…ebook#6007)

Summary:
Adds an improved, replacement Bloom filter implementation (FastLocalBloom) for full and partitioned filters in the block-based table. This replacement is faster and more accurate, especially for high bits per key or millions of keys in a single filter.

Speed

The improved speed, at least on recent x86_64, comes from
* Using fastrange instead of modulo (%)
* Using our new hash function (XXH3 preview, added in a previous commit), which is much faster for large keys and only *slightly* slower on keys around 12 bytes if hashing the same size many thousands of times in a row.
* Optimizing the Bloom filter queries with AVX2 SIMD operations. (Added AVX2 to the USE_SSE=1 build.) Careful design was required to support (a) SIMD-optimized queries, (b) compatible non-SIMD code that's simple and efficient, (c) flexible choice of number of probes, and (d) essentially maximized accuracy for a cache-local Bloom filter. Probes are made eight at a time, so any number of probes up to 8 is the same speed, then up to 16, etc.
* Prefetching cache lines when building the filter. Although this optimization could be applied to the old structure as well, it seems to balance out the small added cost of accumulating 64 bit hashes for adding to the filter rather than 32 bit hashes.

Here's nominal speed data from filter_bench (200MB in filters, about 10k keys each, 10 bits filter data / key, 6 probes, avg key size 24 bytes, includes hashing time) on Skylake DE (relatively low clock speed):

$ ./filter_bench -quick -impl=2 -net_includes_hashing # New Bloom filter
Build avg ns/key: 47.7135
Mixed inside/outside queries...
  Single filter net ns/op: 26.2825
  Random filter net ns/op: 150.459
    Average FP rate %: 0.954651
$ ./filter_bench -quick -impl=0 -net_includes_hashing # Old Bloom filter
Build avg ns/key: 47.2245
Mixed inside/outside queries...
  Single filter net ns/op: 63.2978
  Random filter net ns/op: 188.038
    Average FP rate %: 1.13823

Similar build time but dramatically faster query times on hot data (63 ns to 26 ns), and somewhat faster on stale data (188 ns to 150 ns). Performance differences on batched and skewed query loads are between these extremes as expected.

The only other interesting thing about speed is "inside" (query key was added to filter) vs. "outside" (query key was not added to filter) query times. The non-SIMD implementations are substantially slower when most queries are "outside" vs. "inside". This goes against what one might expect or would have observed years ago, as "outside" queries only need about two probes on average, due to short-circuiting, while "inside" always have num_probes (say 6). The problem is probably the nastily unpredictable branch. The SIMD implementation has few branches (very predictable) and has pretty consistent running time regardless of query outcome.

Accuracy

The generally improved accuracy (re: Issue facebook#5857) comes from a better design for probing indices
within a cache line (re: Issue facebook#4120) and improved accuracy for millions of keys in a single filter from using a 64-bit hash function (XXH3p). Design details in code comments.

Accuracy data (generalizes, except old impl gets worse with millions of keys):
Memory bits per key: FP rate percent old impl -> FP rate percent new impl
6: 5.70953 -> 5.69888
8: 2.45766 -> 2.29709
10: 1.13977 -> 0.959254
12: 0.662498 -> 0.411593
16: 0.353023 -> 0.0873754
24: 0.261552 -> 0.0060971
50: 0.225453 -> ~0.00003 (less than 1 in a million queries are FP)

Fixes facebook#5857
Fixes facebook#4120

Unlike the old implementation, this implementation has a fixed cache line size (64 bytes). At 10 bits per key, the accuracy of this new implementation is very close to the old implementation with 128-byte cache line size. If there's sufficient demand, this implementation could be generalized.

Compatibility

Although old releases would see the new structure as corrupt filter data and read the table as if there's no filter, we've decided only to enable the new Bloom filter with new format_version=5. This provides a smooth path for automatic adoption over time, with an option for early opt-in.
Pull Request resolved: facebook#6007

Test Plan: filter_bench has been used thoroughly to validate speed, accuracy, and correctness. Unit tests have been carefully updated to exercise new and old implementations, as well as the logic to select an implementation based on context (format_version).

Differential Revision: D18294749

Pulled By: pdillinger

fbshipit-source-id: d44c9db3696e4d0a17caaec47075b7755c262c5f
@pdillinger
Copy link
Contributor Author

For the record, the code used to look like this:

rocksdb/util/bloom.cc

Lines 296 to 306 in 662b8fe

for (size_t i = 0; i < (size_t)n; i++) {
// Use double-hashing to generate a sequence of hash values.
// See analysis in [Kirsch,Mitzenmacher 2006].
uint32_t h = hash_func_(keys[i]);
const uint32_t delta = (h >> 17) | (h << 15); // Rotate right 17 bits
for (size_t j = 0; j < num_probes_; j++) {
const uint32_t bitpos = h % bits;
array[bitpos/8] |= (1 << (bitpos % 8));
h += delta;
}
}

The cited Kirsch+Mitzenmacher result is just an asymptotic result, as m and n go to infinity. This code seems to use the result to avoid actually verifying the practical limitations of their code, and the practical issues were significant. Not mentioned above but in https://github.com/facebook/rocksdb/wiki/RocksDB-Bloom-Filter#full-filters-new-format is that this implementation would hit an FP rate "wall" and couldn't really get below about 0.1% FP rate even with 100 bits/key. This caused problems for some FIFO compaction users.

I blame the Kirsch+Mitzenmacher paper more than the implementor, as the research should make clear its limitations, which were known at publication time. My prior work cited by the Kirsch+Mitzenmacher paper warns of potential complications of naively applying double hashing in practice, and offers solutions to those issues.

@pdillinger
Copy link
Contributor Author

Someone found a similar flaw in Google code (https://github.com/google/guava/blob/d3c306e9e4220d80f111348a92f5cb56d8ad965c/guava/src/com/google/common/hash/BloomFilterStrategies.java#L41) and wrote a paper about it (and more), https://arxiv.org/pdf/2009.11789.pdf (Thanks for citing my Ph.D. dissertation!)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants