New bloom filter #243

dbaggerman · 2021-03-22T06:26:56Z

Looking at #241 made me wonder about how the extra tokens would effect our crude bloom filtering technique.

Turns out it only takes a handful of ASCII characters before the old bloom filter became 01111111 and matched everything. This implements a better bloom filter which drastically reduces collisions and makes the filter far more effective.

Since the new hashing takes a little more computation, using a lookup table is another modest improvement over recalculating every character processed.

Calculating hash values isn't an area of mathematics that I'm terribly familiar with, so there may be a better formula. But this seems to be a big improvement over the current implementation at least.

% hyperfine -r 50 './scc-master ../../torvalds/linux' './scc-new-bloom ../../torvalds/linux' './scc-bloom-table ../../torvalds/linux'
Benchmark #1: ./scc-master ../../torvalds/linux
  Time (mean ± σ):     972.0 ms ±   9.7 ms    [User: 13.557 s, System: 0.846 s]
  Range (min … max):   955.9 ms … 1004.0 ms    50 runs

Benchmark #2: ./scc-new-bloom ../../torvalds/linux
  Time (mean ± σ):     742.7 ms ±   9.0 ms    [User: 9.798 s, System: 0.905 s]
  Range (min … max):   723.9 ms … 765.9 ms    50 runs

Benchmark #3: ./scc-bloom-table ../../torvalds/linux
  Time (mean ± σ):     708.7 ms ±   7.6 ms    [User: 9.250 s, System: 0.913 s]
  Range (min … max):   690.6 ms … 733.6 ms    50 runs

Summary
  './scc-bloom-table ../../torvalds/linux' ran
    1.05 ± 0.02 times faster than './scc-big-bloom ../../torvalds/linux'
    1.37 ± 0.02 times faster than './scc-master ../../torvalds/linux'

Turns out it only takes a handful of ASCII characters before the old bloom filter became 01111111 and matched everything. This implements a better bloom filter which drastically reduces collisions and makes the filter far more effective. Since the new hashing takes a little more computation, using a lookup table is a modest improvement over recalculating every character processed.

boyter · 2021-03-22T21:57:46Z

processor/bloom.go

@@ -0,0 +1,24 @@
+package processor
+
+// Prime number less than 256


only comment I have is that could you put WHY a prime, and WHY below 256. Because I am going to scratch my head at that sometime in the future.

That probably doesn't strictly need to be a prime number anymore, but it seemed to work reasonably well so I left it.

Since ASCII characters aren't well distributed in the 0-255 range, I was originally trying to do something with multiplying by a large prime and modulus down again to "scatter" the values better. I eventually realised that multiplying the number by itself worked better for this purpose - the xor was thrown in to deal with small numbers having small squares.

Less than 256 was because I was originally going to just increase the size to uint16 until I realised I was going to need a lot more bits. The thinking was that the maximum byte (255) multiplied by the prime wouldn't overflow the uint16.

Ultimately, the whole thing is just based on my own maths not a well known hashing algorithm. I was trying to keep the calculation lightweight since it was going to be done in a tight loop - but with the addition of a prebuilt lookup table that's not such a problem anymore.

Coming back to it after sleeping on it, we could probably replace the arbitrary prime + multiplication with math/rand: rand.New(rand.NewSource(int64(b))).Uint64(). It might even produce more random results and function better. I'll put in a new PR to change it out so it doesn't lead to confusion in future (and fill out some more comments to explain things better).

boyter · 2021-03-22T21:58:26Z

processor/bloom.go

+	k2 := k >> 1 & 0x3f
+	k3 := k >> 2 & 0x3f
+
+	return (1 << k1) | (1 << k2) | (1 << k3)


The hash function... what's it based on? It looks familiar, but one of those things impossible to search for.

boyter · 2021-03-22T21:59:18Z

Funny you mention bloom filters.... because that's exactly what I have been looking at a lot recently. I didn't think about overfilling in scc though. Looks like a worthy inclusion!

foxdd · 2021-04-18T07:35:56Z

@dbaggerman the old bit-map filter could probably be 01111111 when the no-complexity argument is false. What is the comparision between current and old filter on accuracy and efficiency If the process mask runs without complexity mask?

dbaggerman · 2021-04-19T00:57:00Z

@foxdd, for languages with C-like syntax the three characters /, * and " result in the mask 00101111 which is not a whole lot better. However, anything with the 01000000 bit set will be in the 64-127 range - which due to the way ASCII is laid out does exclude alphabetic characters - probably the only reason it provided a performance benefit in the first place. It would still produce false positives on all numerals and most punctuation leading to quite a few characters processed unnecessarily.

In comparison, I just threw together a quick demo (https://gist.github.com/dbaggerman/833dc9c3593c4cd37fb7f4d66795a0f7) which indicates that the new implementation has 0 false positives in the printable ASCII range. My intuition tells me that the reduction in false positives is at least enough to make up for the small overhead of the table lookup (but I haven't gone and rebuilt previous versions to compare).

dbaggerman requested a review from boyter March 22, 2021 06:26

boyter reviewed Mar 22, 2021

View reviewed changes

boyter merged commit 95d766e into boyter:master Mar 22, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New bloom filter #243

New bloom filter #243

dbaggerman commented Mar 22, 2021

boyter Mar 22, 2021

dbaggerman Mar 22, 2021

boyter Mar 22, 2021

boyter commented Mar 22, 2021

foxdd commented Apr 18, 2021

dbaggerman commented Apr 19, 2021

		@@ -0,0 +1,24 @@
		package processor

		// Prime number less than 256

New bloom filter #243

New bloom filter #243

Conversation

dbaggerman commented Mar 22, 2021

boyter Mar 22, 2021

Choose a reason for hiding this comment

dbaggerman Mar 22, 2021

Choose a reason for hiding this comment

boyter Mar 22, 2021

Choose a reason for hiding this comment

boyter commented Mar 22, 2021

foxdd commented Apr 18, 2021

dbaggerman commented Apr 19, 2021