feat: Stress Relief system #594

kentquirk · 2023-01-31T18:07:18Z

Stress Relief

The problem this is designed to address is that refinery can become unstable under heavy load, and it goes from stable to unstable very quickly. A larger refinery cluster can be very hard to bring up -- we've seen what are effectively contagious restarts, where a given node crashes, wakes up, and then another node crashes, etc.

The general idea here is "let's make refinery be more stable by allowing it to shed load quickly under stress". This PR implements a "StressRelief" system that, when activated, deterministically samples each span based only on its trace ID, without attempting to direct traces to the appropriate shard first. The configuration determines the sample rate used.

There is configuration to set this up, and it allows the "mode" to be configured as "never" (it's not active), "monitor" (it's active when the stress indicator is on), or "always" (always active).

I tested this system with mode set to never and established a baseline instability level for a cluster of 3 refineries at a given "load index" (an arbitrary value indicating the amount of traffic). I found that nodes began restarting once the load index went over 600. When I set the mode to "always", the cluster was quite happy with a load of 1500, and there was no indication of stress -- it felt like it could go much higher, but I haven't yet tried higher numbers.

The next problem was to get an accurate measurement of stress so that we can automatically turn this system on. Many hours of watching refinery crash led to the realization that the internal queues seem to be critical, as does memory usage.

The desired strategy is to measure some key metrics and compare them with some defined threshold values on an appropriately adjusted scale from 0 to 1. We can then watch to see when any of them starts to rise too close to the threshold, and turn on the stress relief (and also give the user some indication of which metric is the biggest problem)

The StressRelief system uses a hysteresis model where it turns on at a much higher level than it turns off -- the current defaults are 50 and 20 (user-adjustable, of course).

The set of particular metrics to watch is not currently user-adjustable, although they're coded to make that possible without too much difficulty. But it's a lot of added complexity in the config for probably little value, since most people won't have a good idea which things they should worry about.

Memory

The existing cache system tries to manage memory, particularly with the newer "impact" cache overrun strategy. But we can also use excessive RAM usage as a signal for stress.

Queues

There are multiple queues that are tracked in metrics:

"collector_peer_queue_length"
"collector_incoming_queue_length"
"libhoney_peer_queue_length"
"libhoney_upstream_queue_length"

Once any of these queue lengths rise above zero, it means that refinery is not processing data as quickly as it is arriving, and in a high-volume situation, it doesn't take long for that to overwhelm refinery and crash it. But the configured length of the queues is definitely a factor.

kentquirk · 2023-02-01T15:59:44Z

I'll resolve conflicts etc after #593 is merged, as this is dependent on it anyway.

collect/collect.go

collect/stressRelief.go

kentquirk requested a review from a team as a January 31, 2023 18:07

kentquirk added type: enhancement New feature or request status: review needed Changes need review. version: bump minor A PR that adds behavior, but is backwards-compatible. labels Jan 31, 2023

kentquirk force-pushed the kent.metrics_for_stress branch from c1e4d74 to cbd967b Compare February 1, 2023 15:43

Base automatically changed from kent.metrics_for_stress to main February 1, 2023 19:45

kentquirk added 2 commits February 1, 2023 14:48

feat: extend and unify metrics system

7a92a2d

feat: implement stress relief system

66afcdb

kentquirk force-pushed the kent.stress_relief branch from 4a5fac9 to 66afcdb Compare February 1, 2023 19:49

TylerHelmuth reviewed Feb 1, 2023

View reviewed changes

collect/collect.go Outdated Show resolved Hide resolved

collect/collect.go Show resolved Hide resolved

collect/stressRelief.go Show resolved Hide resolved

collect/stressRelief.go Show resolved Hide resolved

Respond to feedback

b4d0a47

kentquirk requested a review from TylerHelmuth February 1, 2023 20:54

TylerHelmuth approved these changes Feb 1, 2023

View reviewed changes

kentquirk merged commit aeb0467 into main Feb 1, 2023

kentquirk deleted the kent.stress_relief branch February 1, 2023 23:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Stress Relief system #594

feat: Stress Relief system #594

kentquirk commented Jan 31, 2023

kentquirk commented Feb 1, 2023 •

edited

Loading

feat: Stress Relief system #594

feat: Stress Relief system #594

Conversation

kentquirk commented Jan 31, 2023

Stress Relief

Memory

Queues

kentquirk commented Feb 1, 2023 • edited Loading

kentquirk commented Feb 1, 2023 •

edited

Loading