Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New bloom filter #243

Merged
merged 1 commit into from
Mar 22, 2021
Merged

New bloom filter #243

merged 1 commit into from
Mar 22, 2021

Conversation

dbaggerman
Copy link
Collaborator

Looking at #241 made me wonder about how the extra tokens would effect our crude bloom filtering technique.

Turns out it only takes a handful of ASCII characters before the old bloom filter became 01111111 and matched everything. This implements a better bloom filter which drastically reduces collisions and makes the filter far more effective.

Since the new hashing takes a little more computation, using a lookup table is another modest improvement over recalculating every character processed.

Calculating hash values isn't an area of mathematics that I'm terribly familiar with, so there may be a better formula. But this seems to be a big improvement over the current implementation at least.

% hyperfine -r 50 './scc-master ../../torvalds/linux' './scc-new-bloom ../../torvalds/linux' './scc-bloom-table ../../torvalds/linux'
Benchmark #1: ./scc-master ../../torvalds/linux
  Time (mean ± σ):     972.0 ms ±   9.7 ms    [User: 13.557 s, System: 0.846 s]
  Range (min … max):   955.9 ms … 1004.0 ms    50 runs

Benchmark #2: ./scc-new-bloom ../../torvalds/linux
  Time (mean ± σ):     742.7 ms ±   9.0 ms    [User: 9.798 s, System: 0.905 s]
  Range (min … max):   723.9 ms … 765.9 ms    50 runs

Benchmark #3: ./scc-bloom-table ../../torvalds/linux
  Time (mean ± σ):     708.7 ms ±   7.6 ms    [User: 9.250 s, System: 0.913 s]
  Range (min … max):   690.6 ms … 733.6 ms    50 runs

Summary
  './scc-bloom-table ../../torvalds/linux' ran
    1.05 ± 0.02 times faster than './scc-big-bloom ../../torvalds/linux'
    1.37 ± 0.02 times faster than './scc-master ../../torvalds/linux'

Turns out it only takes a handful of ASCII characters before the old
bloom filter became 01111111 and matched everything. This implements a
better bloom filter which drastically reduces collisions and makes the
filter far more effective.

Since the new hashing takes a little more computation, using a lookup
table is a modest improvement over recalculating every character
processed.
@dbaggerman dbaggerman requested a review from boyter March 22, 2021 06:26
@@ -0,0 +1,24 @@
package processor

// Prime number less than 256
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only comment I have is that could you put WHY a prime, and WHY below 256. Because I am going to scratch my head at that sometime in the future.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That probably doesn't strictly need to be a prime number anymore, but it seemed to work reasonably well so I left it.

Since ASCII characters aren't well distributed in the 0-255 range, I was originally trying to do something with multiplying by a large prime and modulus down again to "scatter" the values better. I eventually realised that multiplying the number by itself worked better for this purpose - the xor was thrown in to deal with small numbers having small squares.

Less than 256 was because I was originally going to just increase the size to uint16 until I realised I was going to need a lot more bits. The thinking was that the maximum byte (255) multiplied by the prime wouldn't overflow the uint16.

Ultimately, the whole thing is just based on my own maths not a well known hashing algorithm. I was trying to keep the calculation lightweight since it was going to be done in a tight loop - but with the addition of a prebuilt lookup table that's not such a problem anymore.

Coming back to it after sleeping on it, we could probably replace the arbitrary prime + multiplication with math/rand: rand.New(rand.NewSource(int64(b))).Uint64(). It might even produce more random results and function better. I'll put in a new PR to change it out so it doesn't lead to confusion in future (and fill out some more comments to explain things better).

k2 := k >> 1 & 0x3f
k3 := k >> 2 & 0x3f

return (1 << k1) | (1 << k2) | (1 << k3)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The hash function... what's it based on? It looks familiar, but one of those things impossible to search for.

@boyter
Copy link
Owner

boyter commented Mar 22, 2021

Funny you mention bloom filters.... because that's exactly what I have been looking at a lot recently. I didn't think about overfilling in scc though. Looks like a worthy inclusion!

@boyter boyter merged commit 95d766e into boyter:master Mar 22, 2021
@foxdd
Copy link
Contributor

foxdd commented Apr 18, 2021

@dbaggerman the old bit-map filter could probably be 01111111 when the no-complexity argument is false. What is the comparision between current and old filter on accuracy and efficiency If the process mask runs without complexity mask?

@dbaggerman
Copy link
Collaborator Author

@foxdd, for languages with C-like syntax the three characters /, * and " result in the mask 00101111 which is not a whole lot better. However, anything with the 01000000 bit set will be in the 64-127 range - which due to the way ASCII is laid out does exclude alphabetic characters - probably the only reason it provided a performance benefit in the first place. It would still produce false positives on all numerals and most punctuation leading to quite a few characters processed unnecessarily.

In comparison, I just threw together a quick demo (https://gist.github.com/dbaggerman/833dc9c3593c4cd37fb7f4d66795a0f7) which indicates that the new implementation has 0 false positives in the printable ASCII range. My intuition tells me that the reduction in false positives is at least enough to make up for the small overhead of the table lookup (but I haven't gone and rebuilt previous versions to compare).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants