-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
libct/cg/fs2: use file + anon + swap for usage #3933
Conversation
libcontainer/cgroups/fs2/memory.go
Outdated
@@ -78,7 +78,7 @@ func setMemory(dirPath string, r *configs.Resources) error { | |||
return nil | |||
} | |||
|
|||
func statMemory(dirPath string, stats *cgroups.Stats) error { | |||
func statMemory(rootCgroupPath, dirPath string, stats *cgroups.Stats) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't love this, but it made unit testing simpler and keeps all exposed APIs unchanged.
Welcome better testing suggestions, especially some way to diff v1/v2 (either mock or e2e on real machines?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could make UnifiedMountpoint
a variable that you can override in tests. We do this for some other stuff in the cgroup tests. The only issue is that it is publicly exposed and we probably don't want to expose allowing this to be changed by libcontainer users...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would you prefer that?
I think this signature is awkward, but could probably be better resolved by more deliberate refactoring separate from this issue, while avoiding an exported/mutable var like you mentioned.
I was digging into the FS calls to see if I could drop in a mock, but quickly realized that wouldn't be so simple 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems the only point of adding this extra argument is to make the function call statsFromMeminfo
. If that's the case, we can test statsFromMeminfo
directly in the unit tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah good call. I left a test for pod cgroup as an "anti-test" on the root cgroup, and added a simple test for v2 root, although I hoped to avoid mocking swap calculations.
80b6fb0
to
ff695e9
Compare
libcontainer/cgroups/fs2/memory.go
Outdated
// cgroup v1 `usage_in_bytes` reports memory usage as rss (NR_ANON_MAPPED) + cache (NR_FILE_PAGES). | ||
// cgroup v2 reports the same values as anon and file. | ||
// sum those to report the same value as `usage_in_bytes` in v1 for consistency. | ||
stats.MemoryStats.Usage.Usage = stats.MemoryStats.Stats["anon"] + stats.MemoryStats.Stats["file"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
potentially we should add swapcached
here for consistency with v1 as well...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should emulate the what cgroupv1 gives you for the cgroupv2 (and only for the root case too) -- in general in OCI we just give you what the kernel provides. I suspect emulating cgroupv2 more correctly (which is what you seem to be doing here? I need to look into how the kernel does the accounting...).
We have to do cgroupv1 -> cgroupv2 mapping for writing to cgroupfs (because otherwise old configs wouldn't work on newer systems) but faking cgroupv1 stats output for cgroupv2 seems particularly odd to me.
@kolyshkin wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm. my original intention was to match usage_in_bytes
which I think is NR_FILE_PAGES + NR_ANON_MAPPED (+ NR_SWAPCACHE) for the root cgroup. I'm not sure what the equivalent of memory.current
would be for the root cgroup in v2 if not something like the v1 estimate? I admit I don't know specifics of how the per-cg usage counter charges.
I guess part of the problem is "usage" is poorly defined vs. specific zones/memory types? part of my read was that runc -> cadvisor -> k8s basically aligned on "usage" == NR_FILE_PAGES + NR_ANON_MAPPED, and "working set" == "usage" - "inactive_file". which has issues, but is at least consistent in v1 + v2 short of renamed statistics.
I can definitely say total - free
hasn't been received well, but welcome alternatives 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kolyshkin any thoughts on this one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As described in https://github.com/torvalds/linux/blob/master/Documentation/admin-guide/cgroup-v1/memory.rst?plain=1#L625-L633 , I think maybe we should add swapcached
here.
So if we want to merge this, we should make a decision for this question.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with adding swapcached here. One reservation I had -- total - free
for swap also does not equal swapcached
for the root, but if you look at specific slices, memory.swap.current
for a given cgroup doesn't equal memory.stat.swapcached
anyway. I suppose it makes sense -- swap area can be used or not free in different ways (?), part of the swap space may not be writable (not the case for my demo however).
we also deliberately add memory + swap usage to match v1 later -
// As cgroup v1 reports SwapUsage values as mem+swap combined,
// while in cgroup v2 swap values do not include memory,
// report combined mem+swap for v1 compatibility.
swapUsage.Usage += memoryUsage.Usage
if swapUsage.Limit != math.MaxUint64 {
swapUsage.Limit += memoryUsage.Limit
}
stats.MemoryStats.SwapUsage = swapUsage
should potentially make the same adjustment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added swapcached, but what do you think about the swap adjustment?
I think we should add memory usage for consistency, e.g. you can see consumers expect it today https://github.com/google/cadvisor/blob/8164b38067246b36c773204f154604e2a1c962dc/container/libcontainer/handler.go#L808
but i'm not sure if we should stick with total - free
for swap usage, use root swapcached, or something else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should potentially make the same adjustment
I think so, maybe this is a bug before.
When we fall though to use statsFromMeminfo
to calc node's memory data, we return in here, so report combined mem+swap for v1 compatibility
is missing.
Please see:
https://github.com/opencontainers/runc/blob/main/libcontainer/cgroups/fs2/memory.go#L105-L126
@kolyshkin Am I right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
think that might be better?
notably, I think we want nr_swap_pages from the kernel, but we only get proxies to that -
https://github.com/torvalds/linux/blob/5d0c230f1de8c7515b6567d9afba1f196fb4e2f4/mm/memcontrol.c#L3677-L3680
https://github.com/torvalds/linux/blob/5d0c230f1de8c7515b6567d9afba1f196fb4e2f4/mm/swapfile.c#L3257-L3258
wonder if we have the same "overcounting" for swap or not
@kolyshkin @cyphar ptal |
This is likely to break compatibility, so I'm not sure we can merge this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@AkihiroSuda could you elaborate on the problem? Current code causes major pain as a behavioral change to downstream consumers for v1 vs v2. I would argue compatibility has already been broken, and this restores previous behavior. Can you clarify what you mean? Who breaks here with this change and in what way? |
There might be third party programs that are already depending on the existing output for v2. |
I can understand that this is a change to v2 consumers. I am not sure I see how it leads to breakage. You’d simply see a lower number for usage? The contract is unchanged. Alternative changes up the stack would lead to runc producing different stats for identical memory usage under v1 and v2, but cadvisor and Kubernetes would fudge it using the same calculation as this PR. It seems to me “more correct” to fix the actual meaning of usage. But if there is concern about this approach, then I would start the discussion at cadvisor level or potentially purely in k/k. The same argument you’re suggesting applies to cadvisor and k/k too, but at least Kubernetes folks seem to agree this is a bug for v1/v2. So I hope folks at some layer are willing to adapt and recognize the calculation is wildly different. Ps: what layer would you suggest the change occurs at? Or do you disagree there’s even an issue? |
Reframing. I can understand that changing these numbers with no API change
has an effect on users. That’s exactly the scenario I hope to “fix” and the
one which runc “broke” with the existing calculation.
What I don’t understand, is what then runc’s contract actually is. Is it
implementation defined by “whatever we did first”? That leaves little room
for bug fixes.
Is it formally defined — “usage is rss + cache”? Then this PR makes sense
and repairs an existing API violation, regardless of existing behavior. I
don’t see a formal definition of this API.
Cadvisor and Kubernetes strike me as rather large 3p users as you put it. I
would be curious where runc puts the balance of practical usage and
behavioral consistency against “bug compatibility” and what you think this
"correct" here ignoring potential compat breakage (and what that means
we should do here?).
There is plenty of opportunity up the stack for fudging. However IMO it’s
extremely misleading machines with comparable usage will report differently
whether they are v1 or v2. If runc doesn’t consider than a bug, okay I
suppose, but I’d like to understand clearly for future reference when we
can or cannot make corrective behavioral changes like this and what runc’s
actual guarantees are. It’s a bit disappointing to me something like this
would differ from v1 since day 1 and may remain baked in — future 3p users
would have to know about this and fudge their logic too if they care about
getting the ”correct” data, otherwise they will perceive they lose memory
when they upgrade to v2.
…On Sun, Jul 16, 2023 at 9:38 AM Akihiro Suda ***@***.***> wrote:
@AkihiroSuda <https://github.com/AkihiroSuda> could you elaborate on the
problem?
Current code causes major pain as a behavioral change to downstream
consumers for v1 vs v2. I would argue compatibility has *already* been
broken, and this restores previous behavior.
Can you clarify what you mean? Who breaks here with this change and in
what way?
There might be third party programs that are already depending on the
existing output for v2.
—
Reply to this email directly, view it on GitHub
<#3933 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABT4LWLCWMUCJPNQUKX4HNLXQPVF5ANCNFSM6AAAAAA2ESQEVI>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
I agree. I am not seeing how this would be breaking if we are presenting a more accurate view of usage in line with v1. |
I'm ok to merge this, but if something breaks we may have to revert this and maybe re-add it as another field like "Usage2". |
Could you update https://github.com/opencontainers/runc/blob/main/CHANGELOG.md to explain the change? |
Wasn't sure, if I should mark this as "unreleased" or create a new unreleased 1.1.9 tag? marked it as "unreleased - change" |
f1b45c7
to
e92b917
Compare
I agree with the general direction of this PR; thank you for your contribution @alexeldeib. Just a few notes:
While at it, it might make sense to rename |
@alexeldeib needs a rebase |
b86d3f9
to
92f1e3a
Compare
file
+ anon
for usage19f95b1
to
884a7aa
Compare
I think that might be better? swap confused me a bit, but I think I get it. I do not think we should count swapcached. v1 has In this case we use which would mean we actually call which makes sense semantically for us anyway - if usage were already usage with swap, then summing swap + usage for swap would be silly. v2 has so v2 is consistent at least there |
This aligns v2 usage calculations more closely with v1. Current node-level reporting for v1 vs v2 on the same machine under similar load may differ by ~250-750Mi. Also return usage as combined swap + memory usage, aligned with v1 and non-root v2 cgroups. `mem_cgroup_usage` in the kernel counts NR_FILE_PAGES + NR_ANON_MAPPED + `nr_swap_pages` (if swap enabled) [^0]. Using total - free results in higher "usage" numbers. This is likely due to various types of reclaimable memory technically counted as in use (e.g. inactive anon). See also kubernetes/kubernetes#118916 for more context [^0]: https://github.com/torvalds/linux/blob/06c2afb862f9da8dc5efa4b6076a0e48c3fbaaa5/mm/memcontrol.c#L3673-L3680 Signed-off-by: Alexander Eldeib <alexeldeib@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@lifubang it looks like your comments were addressed, so I dismissed your old "request changes" review. |
I think this may have brought in inconsistency issue for cgroupv2. AIUI, there is memory account difference by design between cgroup v1 and cgroup v2 (please correct me if I'm wrong):
So it's WAI that v2 shows higher memory usage for non-root cgroups, which should be applied to the root as well. But this PR made the root account memory usage as v1, which appears inconsistency to me. |
cc @kolyshkin @haircommander @mrunalp it would be great to get your thoughts on #3933 (comment). I'm a little concerned that after this change we are no longer properly accounting kernel memory in cgroupv2 as @linxiulei mentioned. |
what would the right math be in your mind? do you have a source for the specific v2 non-root calculation? I'd browsed but didn't find anything that would indicate this inconsistency (not at all a kernel expert, had to learn some things to send this PR). the original issue was It's not clear to me what the right math would be if we need to add something other than rss + cache. I tried a few approaches but the numbers didn't quite add up (see the comments in k/k). there's active debate over what should be counted, but in practice this PR seemed to do the right thing when I tested it with some load. Have you found a counterexample? is there a specific type of memory you suspect is not properly accounted? inactive_anon/file were where I'd looked previously, not sure what you're thinking. |
I assume "what we called usage in k8s-land" is what we collect from I don't think |
I think the key here is to emulate what was previously done in cgroupv1. I think the initial investigation done by @alexeldeib is correct: if we look at the calculate of memory usage, we can see v1 uses memory.usage_in_bytes, even for the root cgroup. In the root cgroup case, we can see this is built from anon + file, and so the best approximation for cgroupv1 usage_in_bytes of the root cgroup on v2 is to just use |
This aligns v2 usage calculations more closely with v1. Current node-level reporting for v1 vs v2 on the same machine under similar load may differ by several hundred megabytes.
mem_cgroup_usage
in the kernel countsNR_FILE_PAGES + NR_ANON_MAPPED + nr_swap_pages (if swap enabled) 1.
Using total - free results in higher "usage" numbers. This is likely due to various types of reclaimable memory technically counted as in use (e.g. inactive anon).
See also kubernetes/kubernetes#118916 for more context
Footnotes
https://github.com/torvalds/linux/blob/06c2afb862f9da8dc5efa4b6076a0e48c3fbaaa5/mm/memcontrol.c#L3673-L3680 ↩