New cache management strategy #547

kentquirk · 2022-10-28T19:41:00Z

Implements a new cache management strategy that ejects "large" items from the cache rather than resizing the cache.

Which problem is this PR solving?

Currently, when Refinery is under memory pressure and exceeds the configured memory maximum, it attempts to resize the trace cache to 90% of its previous size, and ejects the oldest traces. But as the trace cache is sized by trace count, when a very large trace arrives, it can cause the cache to shrink repeatedly, discarding the smaller traces, which doesn't help much. The result is that the cache can be resized to a tiny fraction of its original size to very little benefit.

Furthermore, it never recovers until configuration is manually reloaded.

This PR implements a different strategy:

The memory size of individual spans is calculated when they are placed into the cache
Spans also track their arrival time
The "cacheImpact" of a span is a measure of how long the span has been in the cache, multiplied by the size of the span
Traces keep track of the total impact of all the spans in the trace

When a memory overrun occurs, the system sorts the traces by cache impact and ejects (makes a sampling decision and drops or sends) those traces with the largest impact until memory usage falls below the maximum. The cache is not resized.

This strategy leads to more stable memory usage and fewer overruns.

Short description of the changes

The memory size of individual spans is calculated when they are placed into the cache
Spans also track their arrival time
The "cacheImpact" of a span is a measure of how long the span has been in the cache, multiplied by the size of the span
Traces keep track of the total impact of all the spans in the trace
There's some new telemetry
There's a config value to control switching between the two modes dynamically and it shows up in the sample config
There are tests for some of the algorithmic calculations as well as the strategy modes

Note to reviewers -- this is on the large side, but I didn't see a great way to break it up.

MikeGoldsmith

I think this look really good.

I've left some suggestions, with the main one being how we expose the cache strategy via configuration.

collect/collect.go

rules_complete.toml

types/event.go

config_complete.toml

TylerHelmuth

Great stuff. Left a couple questions here and there

collect/cache/cache.go

collect/collect.go

MikeGoldsmith

Updated config looks good - I've left a query regarding config validation and empty values.

config/file_config.go

TylerHelmuth

Looks good to me as long as @MikeGoldsmith question is addressed

collect/collect.go

types/event.go

collect/collect.go

Implements a new cache management strategy that ejects "large" items from the cache rather than resizing the cache. ## Which problem is this PR solving? Currently, when Refinery is under memory pressure and exceeds the configured memory maximum, it attempts to resize the trace cache to 90% of its previous size, and ejects the oldest traces. But as the trace cache is sized by trace count, when a very large trace arrives, it can cause the cache to shrink repeatedly, discarding the smaller traces, which doesn't help much. The result is that the cache can be resized to a tiny fraction of its original size to very little benefit. Furthermore, it never recovers until configuration is manually reloaded. This PR implements a different strategy: * The memory size of individual spans is calculated when they are placed into the cache * Spans also track their arrival time * The "cacheImpact" of a span is a measure of how long the span has been in the cache, multiplied by the size of the span * Traces keep track of the total impact of all the spans in the trace When a memory overrun occurs, the system sorts the traces by cache impact and ejects (makes a sampling decision and drops or sends) those traces with the largest impact until memory usage falls below the maximum. The cache is not resized. This strategy leads to more stable memory usage and fewer overruns. ## Short description of the changes * The memory size of individual spans is calculated when they are placed into the cache * Spans also track their arrival time * The "cacheImpact" of a span is a measure of how long the span has been in the cache, multiplied by the size of the span * Traces keep track of the total impact of all the spans in the trace * There's some new telemetry * There's a config value to control switching between the two modes dynamically and it shows up in the sample config * There are tests for some of the algorithmic calculations as well as the strategy modes Note to reviewers -- this is on the large side, but I didn't see a great way to break it up.

New cache management strategy

6a0879f

kentquirk marked this pull request as ready for review October 28, 2022 22:46

kentquirk requested review from a team and MikeGoldsmith October 28, 2022 22:46

kentquirk added type: enhancement New feature or request version: bump minor A PR that adds behavior, but is backwards-compatible. labels Oct 28, 2022

MikeGoldsmith reviewed Oct 31, 2022

View reviewed changes

collect/collect.go Outdated Show resolved Hide resolved

rules_complete.toml Show resolved Hide resolved

types/event.go Show resolved Hide resolved

config_complete.toml Outdated Show resolved Hide resolved

MikeGoldsmith assigned kentquirk Oct 31, 2022

TylerHelmuth reviewed Oct 31, 2022

View reviewed changes

collect/cache/cache.go Outdated Show resolved Hide resolved

collect/collect.go Outdated Show resolved Hide resolved

collect/collect.go Show resolved Hide resolved

collect/collect.go Show resolved Hide resolved

kentquirk added 2 commits October 31, 2022 19:04

Change config value to a meaningful string

4d86e2d

Respond to feedback

e9b250d

kentquirk requested review from MikeGoldsmith and TylerHelmuth October 31, 2022 23:18

MikeGoldsmith reviewed Nov 1, 2022

View reviewed changes

config/file_config.go Show resolved Hide resolved

TylerHelmuth approved these changes Nov 1, 2022

View reviewed changes

MikeGoldsmith approved these changes Nov 2, 2022

View reviewed changes

seh reviewed Nov 3, 2022

View reviewed changes

Respond to feedback, memoize cacheImpact

739c25b

kentquirk merged commit 43b94c6 into main Nov 9, 2022

kentquirk deleted the kent.stable_cache branch November 9, 2022 14:27

kentquirk mentioned this pull request Nov 21, 2022

MinCacheCapacity to prevent Refinery from iteratively reducing its cache capacity #340

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New cache management strategy #547

New cache management strategy #547

kentquirk commented Oct 28, 2022 •

edited

Loading

MikeGoldsmith left a comment

TylerHelmuth left a comment

MikeGoldsmith left a comment •

edited

Loading

TylerHelmuth left a comment

New cache management strategy #547

New cache management strategy #547

Conversation

kentquirk commented Oct 28, 2022 • edited Loading

Which problem is this PR solving?

Short description of the changes

MikeGoldsmith left a comment

Choose a reason for hiding this comment

TylerHelmuth left a comment

Choose a reason for hiding this comment

MikeGoldsmith left a comment • edited Loading

Choose a reason for hiding this comment

TylerHelmuth left a comment

Choose a reason for hiding this comment

kentquirk commented Oct 28, 2022 •

edited

Loading

MikeGoldsmith left a comment •

edited

Loading