Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add information about bloom filter to config.md #4924

Merged
merged 2 commits into from
Jul 16, 2018
Merged

Conversation

djdv
Copy link
Contributor

@djdv djdv commented Apr 6, 2018

During an IRC conversation, this information came up. I figured it might make for a useful suggestion here.
yay or nay?

[2018.04.04] 14:28:47 <@hsanjuan> Can someone remind me the recommended values for bloom filter size in go-ipfs?
[2018.04.04] 14:46:58 <@Stebalien> - num_blocks * 1.44 * logtwo(probability_of_false_positive)
[2018.04.04] 14:48:27 < djdv> I wonder if that should be noted in https://github.com/ipfs/go-ipfs/blob/master/docs/config.md#datastore
[2018.04.04] 14:49:02 <@Stebalien> So, for a 1% false positive rate and 1m blocks, you'd want a ~1MiB (mebibyte) filter.
[2018.04.04] 14:49:19 <@Stebalien> Note: I havent' tested that, I'm just going off of wikipedia.
[2018.04.04] 14:49:28 <@Stebalien> https://en.wikipedia.org/wiki/Bloom_filter#Optimal_number_of_hash_functions

@djdv djdv requested a review from Kubuxu as a code owner April 6, 2018 19:57
@ghost ghost assigned djdv Apr 6, 2018
@ghost ghost added the status/in-progress In progress label Apr 6, 2018
@Kubuxu
Copy link
Member

Kubuxu commented Apr 6, 2018

To clear up possible confusion, the config input is in bytes.

Also this website is quite useful https://hur.st/bloomfilter/?n=1e6&p=0.01&m=&k=7

@djdv djdv force-pushed the docs/config branch 2 times, most recently from 91cf85b to bdedec2 Compare April 8, 2018 23:27
@djdv
Copy link
Contributor Author

djdv commented Apr 8, 2018

@Kubuxu
I added a reference to that tool. Does the new statement look accurate and helpful?

@djdv djdv removed the need_signoff label Apr 8, 2018
docs/config.md Outdated

This site generates useful graphs for various bloom filter values: <https://hur.st/bloomfilter/?n=1e6&p=0.01&m=&k=7>
You may use it to find a preferred optimal value, where 'm' is BloomFilterSize.
For example, for 1,000,000 blocks, expecting a 1% false positive rate, you'd end up with a filter size of 9592955 bytes. [Currently](https://github.com/ipfs/go-ipfs/blob/9c194aa7e2febeab0cbd895067d7d90d82b137f9/blocks/blockstore/caching.go), 7 hash functions are used by default, so the constant k is 7.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is 1199120 bytes. The number m in the tool is the number of bits.

@djdv djdv force-pushed the docs/config branch 2 times, most recently from 6013188 to ac92988 Compare April 9, 2018 12:18
License: MIT
Signed-off-by: Dominic Della Valle <ddvpublic@gmail.com>
@djdv
Copy link
Contributor Author

djdv commented Apr 9, 2018

Okay, I've corrected the unit size and added a reminder for users to do that as well. This should be all the information a user needs to set an appropriate value. How does it look?

Copy link
Contributor

@Mr0grog Mr0grog left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is way more informative than it was before! But reading it still left me with some questions:

  • Is there anything we can reliably say here about when & why you’d want to use this, e.g. the typical performance boost over the built-in ARC cache?

  • Are there use cases where this does or does not make sense to use?

docs/config.md Outdated
This site generates useful graphs for various bloom filter values: <https://hur.st/bloomfilter/?n=1e6&p=0.01&m=&k=7>
You may use it to find a preferred optimal value, where `m` is `BloomFilterSize` in bits. Remember to convert the value `m` from bits, into bytes for use as `BloomFilterSize` in the config file.
For example, for 1,000,000 blocks, expecting a 1% false positive rate, you'd end up with a filter size of 9592955 bits, so for `BloomFilterSize` we'd want to use 1199120 bytes.
[Currently](https://github.com/ipfs/go-ipfs/blob/9c194aa7e2febeab0cbd895067d7d90d82b137f9/blocks/blockstore/caching.go), 7 hash functions are used by default, so the constant `k` is 7 in the formula.
Copy link
Contributor

@Mr0grog Mr0grog Apr 9, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file no longer exists in v0.4.14; should it be https://github.com/ipfs/go-ipfs-blockstore/blob/547442836ade055cc114b562a3cc193d4e57c884/caching.go#L22 ?

Saying 7 is the default makes it sound like another config option should be able to change it, but that doesn’t appear to be the case. As far as I can see from reading the code, this can only be adjusted if you are using go-ipfs-blockstore directly as a library (there’s no realistic path to changing it even when using go-ipfs directly unless you are willing to skip builder.setupNode() entirely, which seems pretty painful).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

7 is optimal for 1% false positive rate.

License: MIT
Signed-off-by: Dominic Della Valle <ddvpublic@gmail.com>
@djdv
Copy link
Contributor Author

djdv commented Apr 9, 2018

@Mr0grog
I've updated the link and omitted "by default".
In addition, since the link is permanent and the default is subject to change upstream (even if unlikely), I've reworded things a bit and changed the link anchor.

In relation to performance gains, I don't have any stats on hand. There's this but it's old and experimental:
#3479

Are there use cases where this does or does not make sense to use?

I'm not sure myself, maybe low memory machines would want to avoid this. If there's an inherent gain in all cases, maybe we should consider changing the default and adding a lowmem profile to init.

deferring to @Kubuxu for more info

@Kubuxu
Copy link
Member

Kubuxu commented Apr 10, 2018

The reason this isn't used by default is: we don't have an estimate of the size of blockstore so we can't select good bloom filter size. We could use the 1MiB as a reasonable default.

@Mr0grog
Copy link
Contributor

Mr0grog commented Apr 10, 2018

That makes a lot of sense, so it seems like it would be good to say in the docs. Something like:

The bloom filter is disabled by default because the most appropriate size depends heavily on how many blocks you expect to store. A value that works well for a small storage scenario could make performance worse in a large storage scenario.

I’m assuming that, because there’s such a sharp rise in probability, you could pretty easily surpass the optimal size enough that the work of doing the hashing and lookup in the bloom filter will be an overall waste of time. Is that realistic or unnecessarily alarmist?

Side note: do you have any sense of the practical average size of a block? (Has anyone ever done any analytics on the public gateway for this?) I know a typical file created with a typical IPFS configuration will have 256 KB blocks, but what about the node for the file itself, for directories, or for non UnixFS nodes? (I’m assuming that the items in the filter are ultimately just the DAG node hashes, whether or not they are leaves with data. Is that right?) Having a rough sense of N nodes ≈ X MB of storage might help people estimate an ideal filter size.

@Kubuxu Kubuxu added the RFM label Jul 10, 2018
@whyrusleeping whyrusleeping merged commit 419bfdc into master Jul 16, 2018
@ghost ghost removed the status/in-progress In progress label Jul 16, 2018
@whyrusleeping whyrusleeping deleted the docs/config branch July 16, 2018 15:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants