Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BN: Use Bloom filter for heavy REST validators requests. #5212

Closed
wants to merge 1 commit into from

Conversation

cheatfate
Copy link
Contributor

No description provided.

@github-actions
Copy link

github-actions bot commented Jul 26, 2023

Unit Test Results

         9 files    1 077 suites   38m 6s ⏱️
  3 710 tests   3 431 ✔️ 279 💤 0
15 826 runs  15 521 ✔️ 305 💤 0

Results for commit 735566f.

♻️ This comment has been updated with latest results.

@tersec
Copy link
Contributor

tersec commented Jul 28, 2023

In general, "Bloom" is a person's name in this context:

A Bloom filter is a space-efficient probabilistic data structure, conceived by Burton Howard Bloom in 1970

So all the references to lowercase "bloom filters" are a bit suboptimal -- the proper name of this data structure is the "Bloom filter".

proc getBitsCount*(itemsCount: int): int =
## We are using 8 hashes and we want to get 0.0001 false positive rate, so we
## should use `m/n == 21` according to Table 3 of
## https://pages.cs.wisc.edu/~cao/papers/summary-cache/node8.html
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, given a certain number of expected items (defined elsewhere in this PR), both optimal number of hash functions and optimal m/n can be found. Not sure how worthwhile -- 8 hash functions typically gets pretty good, but figure 4 ("Probability of false positives (log scale). The top curve is for 4 hash functions. The bottom curve is for the optimum (integral) number of hash functions.") does show potentially useful improvements there.


func checkKey*(bf: BloomFilter, pubkey: ValidatorPubKey): bool =
let hashes = pubkey.toHashes()
if not(bf.getBit(hashes[0])): return false
Copy link
Contributor

@tersec tersec Jul 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not better as a loop? If it has compile-time constant bounds, I'd expect suitable unrolling to be something a C compiler might do, if useful for performance, since compilers need to be able to handle this for auto-vectorization.


proc registerKey*(bf: var BloomFilter, pubkey: ValidatorPubKey) =
let hashes = pubkey.toHashes()
bf.raiseBit(hashes[0])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment/question about the loop.

# running. If number of validators will exceed this value we going
# to lose bloom filter's efficiency only.
itemsCount = validatorsCount + (validatorsCount div 3)
bitsCount = getBitsCount(int(itemsCount))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's slightly distasteful to have to convert to int here in any explicit way -- functionally, this is something like bitsCount = itemsCount * 21 but refracted through Nim's type system. But the int(foo) conversion is a pure artifact here, and either unnecessary to begin with, or if Nim requires it, potentially unsafe in a runtime defect way.

when sizeof(uint) == 8: 6'u else: 5'u

template modMask(): uint =
when sizeof(uint) == 8: 63'u else: 31'u
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

divShift and modMask need to be consistent with each other -- wonder if it's better to explicitly define modMask as something similar to (1'u shl divShift()) - 1 to make this both automatic and explicit.

const
MINIMAL_VALIDATORS_BF_COUNT = 1_000_000
## Minimal size of validators bloom filter. Current mainnet number of
## validators is near 700k, so you could update this number if it exceeds.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
## validators is near 700k, so you could update this number if it exceeds.
## validators is near 700k; this default could be increased when useful.

@cheatfate cheatfate changed the title BN: Use bloom filter for heavy REST validators requests. BN: Use Bloom filter for heavy REST validators requests. Jul 31, 2023
template modMask(): uint =
when sizeof(uint) == 8: 63'u else: 31'u

func raiseBit*(bf: var BloomFilter, pos: Natural) {.inline.} =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use setBit from stew/bitops2?

let index = uint(pos) shr divShift()
bf.words[index] = bf.words[index] or (1'u shl (uint(pos) and modMask()))

func getBit*(bf: BloomFilter, pos: Natural): bool {.inline.} =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use getBit from stew/bitops2?

words: seq[uint]
length: int
mask: uint
used: int
Copy link
Contributor

@tersec tersec Feb 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the utility of used? It's maintained in a few places, but aside from being able to present this as a more or less normal Nim data structure, it seems to be just overhead not important for the functioning of the Bloom filter itself

bf.used

template divShift(): uint =
when sizeof(uint) == 8: 6'u else: 5'u
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Semi-arbitrarily asking here: all of these disparate codepaths based on 32-bit or 64-bit uints seem to be predicated on there being enough of a performance (or memory usage? cache locality? etc) improvement based on not just picking one of byte/uint8, uint32 (native on some machines, though none the BN will run well on), or uint64 (and let 32-bit machines get suboptimal code) and using it consistently.

Debugging these issues is unfortunate, because anything 32-bit only relies on setting up a 32-bit environment first. But there doesn't seem to be a fundamental reason here not to go with uint64 or byte (each has advantages; uint32 is an awkward middle ground, optimal for nothing) as the leaf size and avoid e.g., toHashes doubling or tripling in size.

@@ -24,6 +26,15 @@ type
ValidatorIndexError* {.pure.} = enum
UnsupportedValue, TooHighValue

BloomFilter* = object
words: seq[uint]
length: int
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

length is used in a couple of asserts, and there's a bitsCount function which returns it. But in general, the words seq has all the length information the Bloom filter functionally needs, so like used, this is more overhead, both code and memory.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants