-
Notifications
You must be signed in to change notification settings - Fork 231
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BN: Use Bloom filter for heavy REST validators requests. #5212
Conversation
b149d1b
to
735566f
Compare
In general, "Bloom" is a person's name in this context:
So all the references to lowercase "bloom filters" are a bit suboptimal -- the proper name of this data structure is the "Bloom filter". |
proc getBitsCount*(itemsCount: int): int = | ||
## We are using 8 hashes and we want to get 0.0001 false positive rate, so we | ||
## should use `m/n == 21` according to Table 3 of | ||
## https://pages.cs.wisc.edu/~cao/papers/summary-cache/node8.html |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternatively, given a certain number of expected items (defined elsewhere in this PR), both optimal number of hash functions and optimal m/n
can be found. Not sure how worthwhile -- 8 hash functions typically gets pretty good, but figure 4 ("Probability of false positives (log scale). The top curve is for 4 hash functions. The bottom curve is for the optimum (integral) number of hash functions.") does show potentially useful improvements there.
|
||
func checkKey*(bf: BloomFilter, pubkey: ValidatorPubKey): bool = | ||
let hashes = pubkey.toHashes() | ||
if not(bf.getBit(hashes[0])): return false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not better as a loop? If it has compile-time constant bounds, I'd expect suitable unrolling to be something a C compiler might do, if useful for performance, since compilers need to be able to handle this for auto-vectorization.
|
||
proc registerKey*(bf: var BloomFilter, pubkey: ValidatorPubKey) = | ||
let hashes = pubkey.toHashes() | ||
bf.raiseBit(hashes[0]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment/question about the loop.
# running. If number of validators will exceed this value we going | ||
# to lose bloom filter's efficiency only. | ||
itemsCount = validatorsCount + (validatorsCount div 3) | ||
bitsCount = getBitsCount(int(itemsCount)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's slightly distasteful to have to convert to int
here in any explicit way -- functionally, this is something like bitsCount = itemsCount * 21
but refracted through Nim's type system. But the int(foo)
conversion is a pure artifact here, and either unnecessary to begin with, or if Nim requires it, potentially unsafe in a runtime defect way.
when sizeof(uint) == 8: 6'u else: 5'u | ||
|
||
template modMask(): uint = | ||
when sizeof(uint) == 8: 63'u else: 31'u |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
divShift
and modMask
need to be consistent with each other -- wonder if it's better to explicitly define modMask
as something similar to (1'u shl divShift()) - 1
to make this both automatic and explicit.
const | ||
MINIMAL_VALIDATORS_BF_COUNT = 1_000_000 | ||
## Minimal size of validators bloom filter. Current mainnet number of | ||
## validators is near 700k, so you could update this number if it exceeds. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
## validators is near 700k, so you could update this number if it exceeds. | |
## validators is near 700k; this default could be increased when useful. |
template modMask(): uint = | ||
when sizeof(uint) == 8: 63'u else: 31'u | ||
|
||
func raiseBit*(bf: var BloomFilter, pos: Natural) {.inline.} = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use setBit
from stew/bitops2
?
let index = uint(pos) shr divShift() | ||
bf.words[index] = bf.words[index] or (1'u shl (uint(pos) and modMask())) | ||
|
||
func getBit*(bf: BloomFilter, pos: Natural): bool {.inline.} = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use getBit
from stew/bitops2
?
words: seq[uint] | ||
length: int | ||
mask: uint | ||
used: int |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the utility of used
? It's maintained in a few places, but aside from being able to present this as a more or less normal Nim data structure, it seems to be just overhead not important for the functioning of the Bloom filter itself
bf.used | ||
|
||
template divShift(): uint = | ||
when sizeof(uint) == 8: 6'u else: 5'u |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Semi-arbitrarily asking here: all of these disparate codepaths based on 32-bit or 64-bit uint
s seem to be predicated on there being enough of a performance (or memory usage? cache locality? etc) improvement based on not just picking one of byte
/uint8
, uint32
(native on some machines, though none the BN will run well on), or uint64
(and let 32-bit machines get suboptimal code) and using it consistently.
Debugging these issues is unfortunate, because anything 32-bit only relies on setting up a 32-bit environment first. But there doesn't seem to be a fundamental reason here not to go with uint64
or byte
(each has advantages; uint32
is an awkward middle ground, optimal for nothing) as the leaf size and avoid e.g., toHashes
doubling or tripling in size.
@@ -24,6 +26,15 @@ type | |||
ValidatorIndexError* {.pure.} = enum | |||
UnsupportedValue, TooHighValue | |||
|
|||
BloomFilter* = object | |||
words: seq[uint] | |||
length: int |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
length
is used in a couple of asserts, and there's a bitsCount
function which returns it. But in general, the words
seq
has all the length information the Bloom filter functionally needs, so like used
, this is more overhead, both code and memory.
No description provided.