Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Stress Relief
The problem this is designed to address is that refinery can become unstable under heavy load, and it goes from stable to unstable very quickly. A larger refinery cluster can be very hard to bring up -- we've seen what are effectively contagious restarts, where a given node crashes, wakes up, and then another node crashes, etc.
The general idea here is "let's make refinery be more stable by allowing it to shed load quickly under stress". This PR implements a "StressRelief" system that, when activated, deterministically samples each span based only on its trace ID, without attempting to direct traces to the appropriate shard first. The configuration determines the sample rate used.
There is configuration to set this up, and it allows the "mode" to be configured as "never" (it's not active), "monitor" (it's active when the stress indicator is on), or "always" (always active).
I tested this system with mode set to never and established a baseline instability level for a cluster of 3 refineries at a given "load index" (an arbitrary value indicating the amount of traffic). I found that nodes began restarting once the load index went over 600. When I set the mode to "always", the cluster was quite happy with a load of 1500, and there was no indication of stress -- it felt like it could go much higher, but I haven't yet tried higher numbers.
The next problem was to get an accurate measurement of stress so that we can automatically turn this system on. Many hours of watching refinery crash led to the realization that the internal queues seem to be critical, as does memory usage.
The desired strategy is to measure some key metrics and compare them with some defined threshold values on an appropriately adjusted scale from 0 to 1. We can then watch to see when any of them starts to rise too close to the threshold, and turn on the stress relief (and also give the user some indication of which metric is the biggest problem)
The StressRelief system uses a hysteresis model where it turns on at a much higher level than it turns off -- the current defaults are 50 and 20 (user-adjustable, of course).
The set of particular metrics to watch is not currently user-adjustable, although they're coded to make that possible without too much difficulty. But it's a lot of added complexity in the config for probably little value, since most people won't have a good idea which things they should worry about.
Memory
The existing cache system tries to manage memory, particularly with the newer "impact" cache overrun strategy. But we can also use excessive RAM usage as a signal for stress.
Queues
There are multiple queues that are tracked in metrics:
Once any of these queue lengths rise above zero, it means that refinery is not processing data as quickly as it is arriving, and in a high-volume situation, it doesn't take long for that to overwhelm refinery and crash it. But the configured length of the queues is definitely a factor.