Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Stress Relief system #594

Merged
merged 3 commits into from
Feb 1, 2023
Merged

feat: Stress Relief system #594

merged 3 commits into from
Feb 1, 2023

Conversation

kentquirk
Copy link
Contributor

Stress Relief

The problem this is designed to address is that refinery can become unstable under heavy load, and it goes from stable to unstable very quickly. A larger refinery cluster can be very hard to bring up -- we've seen what are effectively contagious restarts, where a given node crashes, wakes up, and then another node crashes, etc.

The general idea here is "let's make refinery be more stable by allowing it to shed load quickly under stress". This PR implements a "StressRelief" system that, when activated, deterministically samples each span based only on its trace ID, without attempting to direct traces to the appropriate shard first. The configuration determines the sample rate used.

There is configuration to set this up, and it allows the "mode" to be configured as "never" (it's not active), "monitor" (it's active when the stress indicator is on), or "always" (always active).

I tested this system with mode set to never and established a baseline instability level for a cluster of 3 refineries at a given "load index" (an arbitrary value indicating the amount of traffic). I found that nodes began restarting once the load index went over 600. When I set the mode to "always", the cluster was quite happy with a load of 1500, and there was no indication of stress -- it felt like it could go much higher, but I haven't yet tried higher numbers.

The next problem was to get an accurate measurement of stress so that we can automatically turn this system on. Many hours of watching refinery crash led to the realization that the internal queues seem to be critical, as does memory usage.

The desired strategy is to measure some key metrics and compare them with some defined threshold values on an appropriately adjusted scale from 0 to 1. We can then watch to see when any of them starts to rise too close to the threshold, and turn on the stress relief (and also give the user some indication of which metric is the biggest problem)

The StressRelief system uses a hysteresis model where it turns on at a much higher level than it turns off -- the current defaults are 50 and 20 (user-adjustable, of course).

The set of particular metrics to watch is not currently user-adjustable, although they're coded to make that possible without too much difficulty. But it's a lot of added complexity in the config for probably little value, since most people won't have a good idea which things they should worry about.

Memory

The existing cache system tries to manage memory, particularly with the newer "impact" cache overrun strategy. But we can also use excessive RAM usage as a signal for stress.

Queues

There are multiple queues that are tracked in metrics:

  • "collector_peer_queue_length"
  • "collector_incoming_queue_length"
  • "libhoney_peer_queue_length"
  • "libhoney_upstream_queue_length"

Once any of these queue lengths rise above zero, it means that refinery is not processing data as quickly as it is arriving, and in a high-volume situation, it doesn't take long for that to overwhelm refinery and crash it. But the configured length of the queues is definitely a factor.

@kentquirk kentquirk requested a review from a team as a January 31, 2023 18:07
@kentquirk kentquirk added type: enhancement New feature or request status: review needed Changes need review. version: bump minor A PR that adds behavior, but is backwards-compatible. labels Jan 31, 2023
@kentquirk kentquirk force-pushed the kent.metrics_for_stress branch from c1e4d74 to cbd967b Compare February 1, 2023 15:43
@kentquirk
Copy link
Contributor Author

kentquirk commented Feb 1, 2023

I'll resolve conflicts etc after #593 is merged, as this is dependent on it anyway.

Base automatically changed from kent.metrics_for_stress to main February 1, 2023 19:45
collect/collect.go Outdated Show resolved Hide resolved
collect/collect.go Show resolved Hide resolved
collect/stressRelief.go Show resolved Hide resolved
collect/stressRelief.go Show resolved Hide resolved
@kentquirk kentquirk merged commit aeb0467 into main Feb 1, 2023
@kentquirk kentquirk deleted the kent.stress_relief branch February 1, 2023 23:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status: review needed Changes need review. type: enhancement New feature or request version: bump minor A PR that adds behavior, but is backwards-compatible.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants