Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Integrate seastar memory profiling into RP #10562
Integrate seastar memory profiling into RP #10562
Changes from all commits
0b79515
5462773
4306a0d
6c0f1c0
ec26919
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is the memory sampling service showing up in these apparently unrelated tests?
Does the fixture somehow require it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes exactly, all these tests use the storage::api or log_manager which own the batch cache.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a bit unfortunate: additional complexity for all these additional services and the tests that use them. I guess this was because of my suggestion to have the reclaim process trigger the LWM logger, right?
Probably something like @dotnwat 's suggestion of a reclaim API which could decouple these two would be a good way forward: you could register a reclaim listener which will be notified when reclaim happens without every reclaimer having to know about every listener.
I don't think we need to do this in this series though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, one of the reasons why I did the timer based thing before was because I wanted to avoid all these changes. Though that's just being a bit lazy really and the notify based one certainly seems better. The reclaim API suggestion would certainly help.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think think it is confusing that "size" and "count" have a totally different basis: upscaled and not. I would not expect this and I'd expect
size / count
to give the average (often "only") allocation size but it won't.If we want to provide upscaled values (I think it is quite nice to), I suggest we provided it for both.
size
is also perhaps a bit unclear?Maybe
bytes_sampled
,bytes_upscaled
,count_sampled
,count_upscaled
?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see your point. However, I don't really see how one could upscale the count? We are not sampling based on it.
Overall this is a bit an artefact of the seastar implementation. I guess the proper names would be
bytes_upscaled
andsamples_taken
. I haven't made any use of the count value yet so we could probably just drop it all completely from the debug API.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the count is really useful because the average size of the allocation is useful info.
I had assumed "count" gets scaled up in a similar way as size: by the inverse of the probability that any given allocation is sampled - but I see the problem: it's that different allocation sizes may be mixed at the same site?
Still doesn't
upscaled_size / sampled_size * sampled_count
give a meaningful number?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah exactly.
It breaks if the allocations are very diverse, for example:
Sample Rate: 16k (lets assume we sample every 16k for simplicity)
Actual allocations: 16 * 1k (one gets sampled), 1 * 16k (gets sampled)
Math would then work out as:
Upscaled size: 32k
Sampled size: 17k
Sampled count: 2
=> Upscaled count: 32k / 17k * 2 ~ 3.7 vs. 17 actual allocations
Real avg. allocation size: 32k / 17 = 1.8k
I will think a bit more over the weekend but I can't really come up with anything meaningful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was thinking about this a bit more but can't come up with something that breaks with a basic example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right, I wasn't thinking this through carefully. I feel like we should track the min & max size for sampled allocations, which would help cover this hold but this can wait for another change.