Using cuckoo filter for new metric detection instead of cache #590
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Context:: Go-Carbon supports real-time indexing in trie index in Carbonserver. If the
realtime-index
parameter in config is > 0, it creates a special channel for a new metric with that size. Then cache.go populates that channel if the metric is missed in the cache. Then Carbonserver consumes this channel and populates the trie index from it during a file scan, creating index entries for the metric even if the whisper file doesn't exist.Problem: Looks like the cache is not a good predictor for new metrics. When the cache is empty and there is a lot of incoming traffic, the file scan thread is blocked for a long time and the scan never finishes.
Solution: We can use a simpler structure than the cache (a map) to detect previously seen metrics. We can use bloom filters, which are good for this and have limited space. I use cuckoo filters, which are faster and support deletion.
I added cuckoo filter support to cache.go with tests. I also added support for the bloom-size parameter in the cache config. If > 0, the cache will use a bloom filter of a specified size to detect new metrics. I'm also doing deletion from the filter if a metric leaves the cache. I'm not sure if we need this, but it might help in case of long uptime.