-
-
Notifications
You must be signed in to change notification settings - Fork 255
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New bloom filter #243
New bloom filter #243
Conversation
Turns out it only takes a handful of ASCII characters before the old bloom filter became 01111111 and matched everything. This implements a better bloom filter which drastically reduces collisions and makes the filter far more effective. Since the new hashing takes a little more computation, using a lookup table is a modest improvement over recalculating every character processed.
@@ -0,0 +1,24 @@ | |||
package processor | |||
|
|||
// Prime number less than 256 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
only comment I have is that could you put WHY a prime, and WHY below 256. Because I am going to scratch my head at that sometime in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That probably doesn't strictly need to be a prime number anymore, but it seemed to work reasonably well so I left it.
Since ASCII characters aren't well distributed in the 0-255 range, I was originally trying to do something with multiplying by a large prime and modulus down again to "scatter" the values better. I eventually realised that multiplying the number by itself worked better for this purpose - the xor was thrown in to deal with small numbers having small squares.
Less than 256 was because I was originally going to just increase the size to uint16
until I realised I was going to need a lot more bits. The thinking was that the maximum byte (255) multiplied by the prime wouldn't overflow the uint16
.
Ultimately, the whole thing is just based on my own maths not a well known hashing algorithm. I was trying to keep the calculation lightweight since it was going to be done in a tight loop - but with the addition of a prebuilt lookup table that's not such a problem anymore.
Coming back to it after sleeping on it, we could probably replace the arbitrary prime + multiplication with math/rand
: rand.New(rand.NewSource(int64(b))).Uint64()
. It might even produce more random results and function better. I'll put in a new PR to change it out so it doesn't lead to confusion in future (and fill out some more comments to explain things better).
k2 := k >> 1 & 0x3f | ||
k3 := k >> 2 & 0x3f | ||
|
||
return (1 << k1) | (1 << k2) | (1 << k3) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The hash function... what's it based on? It looks familiar, but one of those things impossible to search for.
Funny you mention bloom filters.... because that's exactly what I have been looking at a lot recently. I didn't think about overfilling in scc though. Looks like a worthy inclusion! |
@dbaggerman the old bit-map filter could probably be 01111111 when the no-complexity argument is false. What is the comparision between current and old filter on accuracy and efficiency If the process mask runs without complexity mask? |
@foxdd, for languages with C-like syntax the three characters In comparison, I just threw together a quick demo (https://gist.github.com/dbaggerman/833dc9c3593c4cd37fb7f4d66795a0f7) which indicates that the new implementation has 0 false positives in the printable ASCII range. My intuition tells me that the reduction in false positives is at least enough to make up for the small overhead of the table lookup (but I haven't gone and rebuilt previous versions to compare). |
Looking at #241 made me wonder about how the extra tokens would effect our crude bloom filtering technique.
Turns out it only takes a handful of ASCII characters before the old bloom filter became 01111111 and matched everything. This implements a better bloom filter which drastically reduces collisions and makes the filter far more effective.
Since the new hashing takes a little more computation, using a lookup table is another modest improvement over recalculating every character processed.
Calculating hash values isn't an area of mathematics that I'm terribly familiar with, so there may be a better formula. But this seems to be a big improvement over the current implementation at least.