-
Notifications
You must be signed in to change notification settings - Fork 95
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New cache management strategy #547
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this look really good.
I've left some suggestions, with the main one being how we expose the cache strategy via configuration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great stuff. Left a couple questions here and there
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated config looks good - I've left a query regarding config validation and empty values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me as long as @MikeGoldsmith question is addressed
Implements a new cache management strategy that ejects "large" items from the cache rather than resizing the cache. ## Which problem is this PR solving? Currently, when Refinery is under memory pressure and exceeds the configured memory maximum, it attempts to resize the trace cache to 90% of its previous size, and ejects the oldest traces. But as the trace cache is sized by trace count, when a very large trace arrives, it can cause the cache to shrink repeatedly, discarding the smaller traces, which doesn't help much. The result is that the cache can be resized to a tiny fraction of its original size to very little benefit. Furthermore, it never recovers until configuration is manually reloaded. This PR implements a different strategy: * The memory size of individual spans is calculated when they are placed into the cache * Spans also track their arrival time * The "cacheImpact" of a span is a measure of how long the span has been in the cache, multiplied by the size of the span * Traces keep track of the total impact of all the spans in the trace When a memory overrun occurs, the system sorts the traces by cache impact and ejects (makes a sampling decision and drops or sends) those traces with the largest impact until memory usage falls below the maximum. The cache is not resized. This strategy leads to more stable memory usage and fewer overruns. ## Short description of the changes * The memory size of individual spans is calculated when they are placed into the cache * Spans also track their arrival time * The "cacheImpact" of a span is a measure of how long the span has been in the cache, multiplied by the size of the span * Traces keep track of the total impact of all the spans in the trace * There's some new telemetry * There's a config value to control switching between the two modes dynamically and it shows up in the sample config * There are tests for some of the algorithmic calculations as well as the strategy modes Note to reviewers -- this is on the large side, but I didn't see a great way to break it up.
Implements a new cache management strategy that ejects "large" items from the cache rather than resizing the cache.
Which problem is this PR solving?
Currently, when Refinery is under memory pressure and exceeds the configured memory maximum, it attempts to resize the trace cache to 90% of its previous size, and ejects the oldest traces. But as the trace cache is sized by trace count, when a very large trace arrives, it can cause the cache to shrink repeatedly, discarding the smaller traces, which doesn't help much. The result is that the cache can be resized to a tiny fraction of its original size to very little benefit.
Furthermore, it never recovers until configuration is manually reloaded.
This PR implements a different strategy:
When a memory overrun occurs, the system sorts the traces by cache impact and ejects (makes a sampling decision and drops or sends) those traces with the largest impact until memory usage falls below the maximum. The cache is not resized.
This strategy leads to more stable memory usage and fewer overruns.
Short description of the changes
Note to reviewers -- this is on the large side, but I didn't see a great way to break it up.