Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Improve drop cache performance #757

Merged
merged 6 commits into from
Jul 6, 2023
Merged

fix: Improve drop cache performance #757

merged 6 commits into from
Jul 6, 2023

Conversation

kentquirk
Copy link
Contributor

@kentquirk kentquirk commented Jun 30, 2023

Which problem is this PR solving?

The trace drop cache was causing cascading failures under high load when stress relief triggered. The problem was traced to the fact that the cuckoofilter library was internally locking the system random number generator repeatedly and the time spent locking and unlocking was significant.

This change includes a private fork of the cuckoofilter library that removes this lock by using a different way to generate random values. This alone cuts the time spent in the cuckoofilter by almost half, which is definitely helpful.

In addition, this change decouples the Add call for the drop cache; Add now simply puts an item into a large (1K item) channel that is monitored by a separate goroutine, which will periodically insert the batch of items in the queue to the cuckoo cache in a single lock.

The result is that Add() is now super fast until it reaches the point where the cuckoo cache can't keep up. Hopefully, the combination of these two techniques will delay that point, but if it gets there, instead of blocking on the cache, Add will now simply drop the traceID. This has the potential to cause some trace coherence issues but if Refinery is so heavily stressed that it can't even handle the volume under stress relief, it's better than crashing.

The benchmarks don't really show this because they include the time to actually land the items in the cache, but without them the Add calls take only single-digit nanoseconds each until they start taking multiple microseconds.

Short description of the changes

  • Write some benchmarks (see below)
  • Create a channel used by Add
  • Write a goroutine that wakes up periodically and drains up to N items from the channel in a single lock
  • Use our fork of cuckoofilter

Benchmarks

  • Add -- test the add function only
  • Check -- test the check function only
  • AddParallel -- test the add function in 70 goroutines in parallel, while constantly generating random integers to cause conflict with the random number generator
  • CheckParallel -- test the check function running in 70 goroutines in parallel, while calling Add on a single goroutine in the background
  • CheckAddParallel -- run both Check and Add simultaneously in 30 goroutines each while also running random integers in the background. This is an attempt to mimic a high-load situation. The timing is calculated after the last task exits.

Note that this system has the property that its performance depends on the volume, so it fools the benchmark engine that is trying to figure out how many items it should use for a benchmark to get good results. As a result, I've run tests with a fixed number of items. All tests are run with 1000 traces and 100_000 traces -- basically, trying for undersaturated and oversaturated. After this PR, the performance in the undersaturated case is much better, which hopefully means we're less likely to get to saturation.

Before

BenchmarkCuckooTraceChecker_Add-10                 	  163296	      7312 ns/op
BenchmarkCuckooTraceChecker_AddParallel-10         	  147472	      8028 ns/op
BenchmarkCuckooTraceChecker_Check-10               	79204432	        24.89 ns/op
BenchmarkCuckooTraceChecker_CheckParallel-10       	 3346725	       314.4 ns/op
BenchmarkCuckooTraceChecker_CheckAddParallel-10    	  139542	      9105 ns/op

After

BenchmarkCuckooTraceChecker_Add-10                 	365702535	        11.99 ns/op
BenchmarkCuckooTraceChecker_AddParallel-10         	 2504250	       658.3 ns/op
BenchmarkCuckooTraceChecker_Check-10               	78540847	        17.14 ns/op
BenchmarkCuckooTraceChecker_CheckParallel-10       	 2583024	       469.4 ns/op
BenchmarkCuckooTraceChecker_CheckAddParallel-10    	  234552	      5117 ns/op
1```

- 

@kentquirk kentquirk requested a review from a team as a code owner June 30, 2023 14:18
@TylerHelmuth
Copy link
Contributor

We tested this change over the last 24 hours and feel confident that it fixes the problem. Will merge in for 2.0.

@TylerHelmuth TylerHelmuth merged commit 9e179ed into main Jul 6, 2023
@TylerHelmuth TylerHelmuth deleted the kent.cuckoo_perf branch July 6, 2023 16:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Fix bug with StressRelief where a single node in Stress causes others to enter stress and get stuck
2 participants