fix: Improve drop cache performance #757

kentquirk · 2023-06-30T14:18:37Z

Which problem is this PR solving?

The trace drop cache was causing cascading failures under high load when stress relief triggered. The problem was traced to the fact that the cuckoofilter library was internally locking the system random number generator repeatedly and the time spent locking and unlocking was significant.

This change includes a private fork of the cuckoofilter library that removes this lock by using a different way to generate random values. This alone cuts the time spent in the cuckoofilter by almost half, which is definitely helpful.

In addition, this change decouples the Add call for the drop cache; Add now simply puts an item into a large (1K item) channel that is monitored by a separate goroutine, which will periodically insert the batch of items in the queue to the cuckoo cache in a single lock.

The result is that Add() is now super fast until it reaches the point where the cuckoo cache can't keep up. Hopefully, the combination of these two techniques will delay that point, but if it gets there, instead of blocking on the cache, Add will now simply drop the traceID. This has the potential to cause some trace coherence issues but if Refinery is so heavily stressed that it can't even handle the volume under stress relief, it's better than crashing.

The benchmarks don't really show this because they include the time to actually land the items in the cache, but without them the Add calls take only single-digit nanoseconds each until they start taking multiple microseconds.

Short description of the changes

Write some benchmarks (see below)
Create a channel used by Add
Write a goroutine that wakes up periodically and drains up to N items from the channel in a single lock
Use our fork of cuckoofilter

Benchmarks

Add -- test the add function only
Check -- test the check function only
AddParallel -- test the add function in 70 goroutines in parallel, while constantly generating random integers to cause conflict with the random number generator
CheckParallel -- test the check function running in 70 goroutines in parallel, while calling Add on a single goroutine in the background
CheckAddParallel -- run both Check and Add simultaneously in 30 goroutines each while also running random integers in the background. This is an attempt to mimic a high-load situation. The timing is calculated after the last task exits.

Note that this system has the property that its performance depends on the volume, so it fools the benchmark engine that is trying to figure out how many items it should use for a benchmark to get good results. As a result, I've run tests with a fixed number of items. All tests are run with 1000 traces and 100_000 traces -- basically, trying for undersaturated and oversaturated. After this PR, the performance in the undersaturated case is much better, which hopefully means we're less likely to get to saturation.

Before

BenchmarkCuckooTraceChecker_Add-10                 	  163296	      7312 ns/op
BenchmarkCuckooTraceChecker_AddParallel-10         	  147472	      8028 ns/op
BenchmarkCuckooTraceChecker_Check-10               	79204432	        24.89 ns/op
BenchmarkCuckooTraceChecker_CheckParallel-10       	 3346725	       314.4 ns/op
BenchmarkCuckooTraceChecker_CheckAddParallel-10    	  139542	      9105 ns/op

After

BenchmarkCuckooTraceChecker_Add-10                 	365702535	        11.99 ns/op
BenchmarkCuckooTraceChecker_AddParallel-10         	 2504250	       658.3 ns/op
BenchmarkCuckooTraceChecker_Check-10               	78540847	        17.14 ns/op
BenchmarkCuckooTraceChecker_CheckParallel-10       	 2583024	       469.4 ns/op
BenchmarkCuckooTraceChecker_CheckAddParallel-10    	  234552	      5117 ns/op
1```

-

collect/cache/cuckoo.go

TylerHelmuth · 2023-07-06T15:59:15Z

We tested this change over the last 24 hours and feel confident that it fixes the problem. Will merge in for 2.0.

Improve drop cache performance

7593737

kentquirk requested a review from a team as a code owner June 30, 2023 14:18

New deps for cuckoo

3bd8845

TylerHelmuth reviewed Jun 30, 2023

View reviewed changes

collect/cache/cuckoo.go Outdated Show resolved Hide resolved

kentquirk added 3 commits June 30, 2023 15:34

Make it add things correctly

fd9a3d5

If the add queue fills, start dropping things.

96d8d1d

Use our private fork of the cuckoofilter

5d40705

TylerHelmuth approved these changes Jul 5, 2023

View reviewed changes

TylerHelmuth linked an issue Jul 6, 2023 that may be closed by this pull request

Fix bug with StressRelief where a single node in Stress causes others to enter stress and get stuck #756

Closed

Merge branch 'main' into kent.cuckoo_perf

db220c5

TylerHelmuth merged commit 9e179ed into main Jul 6, 2023

TylerHelmuth deleted the kent.cuckoo_perf branch July 6, 2023 16:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Improve drop cache performance #757

fix: Improve drop cache performance #757

kentquirk commented Jun 30, 2023 •

edited

Loading

TylerHelmuth commented Jul 6, 2023

fix: Improve drop cache performance #757

fix: Improve drop cache performance #757

Conversation

kentquirk commented Jun 30, 2023 • edited Loading

Which problem is this PR solving?

Short description of the changes

Benchmarks

Before

After

TylerHelmuth commented Jul 6, 2023

kentquirk commented Jun 30, 2023 •

edited

Loading