fix: Improve drop cache performance #757
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Which problem is this PR solving?
The trace drop cache was causing cascading failures under high load when stress relief triggered. The problem was traced to the fact that the cuckoofilter library was internally locking the system random number generator repeatedly and the time spent locking and unlocking was significant.
This change includes a private fork of the cuckoofilter library that removes this lock by using a different way to generate random values. This alone cuts the time spent in the cuckoofilter by almost half, which is definitely helpful.
In addition, this change decouples the Add call for the drop cache; Add now simply puts an item into a large (1K item) channel that is monitored by a separate goroutine, which will periodically insert the batch of items in the queue to the cuckoo cache in a single lock.
The result is that Add() is now super fast until it reaches the point where the cuckoo cache can't keep up. Hopefully, the combination of these two techniques will delay that point, but if it gets there, instead of blocking on the cache, Add will now simply drop the traceID. This has the potential to cause some trace coherence issues but if Refinery is so heavily stressed that it can't even handle the volume under stress relief, it's better than crashing.
The benchmarks don't really show this because they include the time to actually land the items in the cache, but without them the Add calls take only single-digit nanoseconds each until they start taking multiple microseconds.
Short description of the changes
Benchmarks
Note that this system has the property that its performance depends on the volume, so it fools the benchmark engine that is trying to figure out how many items it should use for a benchmark to get good results. As a result, I've run tests with a fixed number of items. All tests are run with 1000 traces and 100_000 traces -- basically, trying for undersaturated and oversaturated. After this PR, the performance in the undersaturated case is much better, which hopefully means we're less likely to get to saturation.
Before
After