Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New cache management strategy #547

Merged
merged 4 commits into from
Nov 9, 2022
Merged

New cache management strategy #547

merged 4 commits into from
Nov 9, 2022

Conversation

kentquirk
Copy link
Contributor

@kentquirk kentquirk commented Oct 28, 2022

Implements a new cache management strategy that ejects "large" items from the cache rather than resizing the cache.

Which problem is this PR solving?

Currently, when Refinery is under memory pressure and exceeds the configured memory maximum, it attempts to resize the trace cache to 90% of its previous size, and ejects the oldest traces. But as the trace cache is sized by trace count, when a very large trace arrives, it can cause the cache to shrink repeatedly, discarding the smaller traces, which doesn't help much. The result is that the cache can be resized to a tiny fraction of its original size to very little benefit.

Furthermore, it never recovers until configuration is manually reloaded.

This PR implements a different strategy:

  • The memory size of individual spans is calculated when they are placed into the cache
  • Spans also track their arrival time
  • The "cacheImpact" of a span is a measure of how long the span has been in the cache, multiplied by the size of the span
  • Traces keep track of the total impact of all the spans in the trace

When a memory overrun occurs, the system sorts the traces by cache impact and ejects (makes a sampling decision and drops or sends) those traces with the largest impact until memory usage falls below the maximum. The cache is not resized.

This strategy leads to more stable memory usage and fewer overruns.

Short description of the changes

  • The memory size of individual spans is calculated when they are placed into the cache
  • Spans also track their arrival time
  • The "cacheImpact" of a span is a measure of how long the span has been in the cache, multiplied by the size of the span
  • Traces keep track of the total impact of all the spans in the trace
  • There's some new telemetry
  • There's a config value to control switching between the two modes dynamically and it shows up in the sample config
  • There are tests for some of the algorithmic calculations as well as the strategy modes

Note to reviewers -- this is on the large side, but I didn't see a great way to break it up.

@kentquirk kentquirk marked this pull request as ready for review October 28, 2022 22:46
@kentquirk kentquirk requested review from a team and MikeGoldsmith October 28, 2022 22:46
@kentquirk kentquirk added type: enhancement New feature or request version: bump minor A PR that adds behavior, but is backwards-compatible. labels Oct 28, 2022
Copy link
Contributor

@MikeGoldsmith MikeGoldsmith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this look really good.

I've left some suggestions, with the main one being how we expose the cache strategy via configuration.

collect/collect.go Outdated Show resolved Hide resolved
rules_complete.toml Show resolved Hide resolved
types/event.go Show resolved Hide resolved
config_complete.toml Outdated Show resolved Hide resolved
Copy link
Contributor

@TylerHelmuth TylerHelmuth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great stuff. Left a couple questions here and there

collect/cache/cache.go Outdated Show resolved Hide resolved
collect/collect.go Outdated Show resolved Hide resolved
collect/collect.go Show resolved Hide resolved
collect/collect.go Show resolved Hide resolved
Copy link
Contributor

@MikeGoldsmith MikeGoldsmith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated config looks good - I've left a query regarding config validation and empty values.

config/file_config.go Show resolved Hide resolved
Copy link
Contributor

@TylerHelmuth TylerHelmuth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me as long as @MikeGoldsmith question is addressed

collect/collect.go Outdated Show resolved Hide resolved
collect/collect.go Outdated Show resolved Hide resolved
types/event.go Outdated Show resolved Hide resolved
types/event.go Outdated Show resolved Hide resolved
types/event.go Show resolved Hide resolved
collect/collect.go Outdated Show resolved Hide resolved
collect/collect.go Show resolved Hide resolved
@kentquirk kentquirk merged commit 43b94c6 into main Nov 9, 2022
@kentquirk kentquirk deleted the kent.stable_cache branch November 9, 2022 14:27
ghost pushed a commit to opsramp/tracing-proxy that referenced this pull request Jul 5, 2024
Implements a new cache management strategy that ejects "large" items
from the cache rather than resizing the cache.

## Which problem is this PR solving?

Currently, when Refinery is under memory pressure and exceeds the
configured memory maximum, it attempts to resize the trace cache to 90%
of its previous size, and ejects the oldest traces. But as the trace
cache is sized by trace count, when a very large trace arrives, it can
cause the cache to shrink repeatedly, discarding the smaller traces,
which doesn't help much. The result is that the cache can be resized to
a tiny fraction of its original size to very little benefit.

Furthermore, it never recovers until configuration is manually reloaded.

This PR implements a different strategy:
* The memory size of individual spans is calculated when they are placed
into the cache
* Spans also track their arrival time
* The "cacheImpact" of a span is a measure of how long the span has been
in the cache, multiplied by the size of the span
* Traces keep track of the total impact of all the spans in the trace

When a memory overrun occurs, the system sorts the traces by cache
impact and ejects (makes a sampling decision and drops or sends) those
traces with the largest impact until memory usage falls below the
maximum. The cache is not resized.

This strategy leads to more stable memory usage and fewer overruns.

## Short description of the changes

* The memory size of individual spans is calculated when they are placed
into the cache
* Spans also track their arrival time
* The "cacheImpact" of a span is a measure of how long the span has been
in the cache, multiplied by the size of the span
* Traces keep track of the total impact of all the spans in the trace
* There's some new telemetry
* There's a config value to control switching between the two modes
dynamically and it shows up in the sample config
* There are tests for some of the algorithmic calculations as well as
the strategy modes

Note to reviewers -- this is on the large side, but I didn't see a great
way to break it up.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: enhancement New feature or request version: bump minor A PR that adds behavior, but is backwards-compatible.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants