Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

libct/cg/fs2: use file + anon + swap for usage #3933

Merged
merged 1 commit into from
Aug 3, 2023

Conversation

alexeldeib
Copy link
Contributor

@alexeldeib alexeldeib commented Jul 10, 2023

This aligns v2 usage calculations more closely with v1. Current node-level reporting for v1 vs v2 on the same machine under similar load may differ by several hundred megabytes.

mem_cgroup_usage in the kernel counts
NR_FILE_PAGES + NR_ANON_MAPPED + nr_swap_pages (if swap enabled) 1.

Using total - free results in higher "usage" numbers. This is likely due to various types of reclaimable memory technically counted as in use (e.g. inactive anon).

See also kubernetes/kubernetes#118916 for more context

Footnotes

  1. https://github.com/torvalds/linux/blob/06c2afb862f9da8dc5efa4b6076a0e48c3fbaaa5/mm/memcontrol.c#L3673-L3680

@@ -78,7 +78,7 @@ func setMemory(dirPath string, r *configs.Resources) error {
return nil
}

func statMemory(dirPath string, stats *cgroups.Stats) error {
func statMemory(rootCgroupPath, dirPath string, stats *cgroups.Stats) error {
Copy link
Contributor Author

@alexeldeib alexeldeib Jul 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't love this, but it made unit testing simpler and keeps all exposed APIs unchanged.

Welcome better testing suggestions, especially some way to diff v1/v2 (either mock or e2e on real machines?)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could make UnifiedMountpoint a variable that you can override in tests. We do this for some other stuff in the cgroup tests. The only issue is that it is publicly exposed and we probably don't want to expose allowing this to be changed by libcontainer users...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would you prefer that?

I think this signature is awkward, but could probably be better resolved by more deliberate refactoring separate from this issue, while avoiding an exported/mutable var like you mentioned.

I was digging into the FS calls to see if I could drop in a mock, but quickly realized that wouldn't be so simple 😄

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems the only point of adding this extra argument is to make the function call statsFromMeminfo. If that's the case, we can test statsFromMeminfo directly in the unit tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah good call. I left a test for pod cgroup as an "anti-test" on the root cgroup, and added a simple test for v2 root, although I hoped to avoid mocking swap calculations.

@alexeldeib alexeldeib force-pushed the ace/v2root branch 3 times, most recently from 80b6fb0 to ff695e9 Compare July 10, 2023 14:11
// cgroup v1 `usage_in_bytes` reports memory usage as rss (NR_ANON_MAPPED) + cache (NR_FILE_PAGES).
// cgroup v2 reports the same values as anon and file.
// sum those to report the same value as `usage_in_bytes` in v1 for consistency.
stats.MemoryStats.Usage.Usage = stats.MemoryStats.Stats["anon"] + stats.MemoryStats.Stats["file"]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

potentially we should add swapcached here for consistency with v1 as well...

Copy link
Member

@cyphar cyphar Jul 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should emulate the what cgroupv1 gives you for the cgroupv2 (and only for the root case too) -- in general in OCI we just give you what the kernel provides. I suspect emulating cgroupv2 more correctly (which is what you seem to be doing here? I need to look into how the kernel does the accounting...).

We have to do cgroupv1 -> cgroupv2 mapping for writing to cgroupfs (because otherwise old configs wouldn't work on newer systems) but faking cgroupv1 stats output for cgroupv2 seems particularly odd to me.

@kolyshkin wdyt?

Copy link
Contributor Author

@alexeldeib alexeldeib Jul 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm. my original intention was to match usage_in_bytes which I think is NR_FILE_PAGES + NR_ANON_MAPPED (+ NR_SWAPCACHE) for the root cgroup. I'm not sure what the equivalent of memory.current would be for the root cgroup in v2 if not something like the v1 estimate? I admit I don't know specifics of how the per-cg usage counter charges.

I guess part of the problem is "usage" is poorly defined vs. specific zones/memory types? part of my read was that runc -> cadvisor -> k8s basically aligned on "usage" == NR_FILE_PAGES + NR_ANON_MAPPED, and "working set" == "usage" - "inactive_file". which has issues, but is at least consistent in v1 + v2 short of renamed statistics.

I can definitely say total - free hasn't been received well, but welcome alternatives 😄

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kolyshkin any thoughts on this one?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As described in https://github.com/torvalds/linux/blob/master/Documentation/admin-guide/cgroup-v1/memory.rst?plain=1#L625-L633 , I think maybe we should add swapcached here.
So if we want to merge this, we should make a decision for this question.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with adding swapcached here. One reservation I had -- total - free for swap also does not equal swapcached for the root, but if you look at specific slices, memory.swap.current for a given cgroup doesn't equal memory.stat.swapcached anyway. I suppose it makes sense -- swap area can be used or not free in different ways (?), part of the swap space may not be writable (not the case for my demo however).

we also deliberately add memory + swap usage to match v1 later -

// As cgroup v1 reports SwapUsage values as mem+swap combined,
	// while in cgroup v2 swap values do not include memory,
	// report combined mem+swap for v1 compatibility.
	swapUsage.Usage += memoryUsage.Usage
	if swapUsage.Limit != math.MaxUint64 {
		swapUsage.Limit += memoryUsage.Limit
	}
	stats.MemoryStats.SwapUsage = swapUsage

should potentially make the same adjustment

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added swapcached, but what do you think about the swap adjustment?

I think we should add memory usage for consistency, e.g. you can see consumers expect it today https://github.com/google/cadvisor/blob/8164b38067246b36c773204f154604e2a1c962dc/container/libcontainer/handler.go#L808

but i'm not sure if we should stick with total - free for swap usage, use root swapcached, or something else?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should potentially make the same adjustment

I think so, maybe this is a bug before.
When we fall though to use statsFromMeminfo to calc node's memory data, we return in here, so report combined mem+swap for v1 compatibility is missing.
Please see:
https://github.com/opencontainers/runc/blob/main/libcontainer/cgroups/fs2/memory.go#L105-L126

@kolyshkin Am I right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

think that might be better?

notably, I think we want nr_swap_pages from the kernel, but we only get proxies to that -
https://github.com/torvalds/linux/blob/5d0c230f1de8c7515b6567d9afba1f196fb4e2f4/mm/memcontrol.c#L3677-L3680
https://github.com/torvalds/linux/blob/5d0c230f1de8c7515b6567d9afba1f196fb4e2f4/mm/swapfile.c#L3257-L3258

wonder if we have the same "overcounting" for swap or not

@mrunalp
Copy link
Contributor

mrunalp commented Jul 11, 2023

@kolyshkin @cyphar ptal

@AkihiroSuda
Copy link
Member

This is likely to break compatibility, so I'm not sure we can merge this.

Copy link
Member

@AkihiroSuda AkihiroSuda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alexeldeib
Copy link
Contributor Author

alexeldeib commented Jul 16, 2023

@AkihiroSuda could you elaborate on the problem?

Current code causes major pain as a behavioral change to downstream consumers for v1 vs v2. I would argue compatibility has already been broken, and this restores previous behavior.

Can you clarify what you mean? Who breaks here with this change and in what way?

@AkihiroSuda
Copy link
Member

@AkihiroSuda could you elaborate on the problem?

Current code causes major pain as a behavioral change to downstream consumers for v1 vs v2. I would argue compatibility has already been broken, and this restores previous behavior.

Can you clarify what you mean? Who breaks here with this change and in what way?

There might be third party programs that are already depending on the existing output for v2.

@alexeldeib
Copy link
Contributor Author

alexeldeib commented Jul 16, 2023

I can understand that this is a change to v2 consumers. I am not sure I see how it leads to breakage. You’d simply see a lower number for usage? The contract is unchanged.

Alternative changes up the stack would lead to runc producing different stats for identical memory usage under v1 and v2, but cadvisor and Kubernetes would fudge it using the same calculation as this PR.

It seems to me “more correct” to fix the actual meaning of usage. But if there is concern about this approach, then I would start the discussion at cadvisor level or potentially purely in k/k.

The same argument you’re suggesting applies to cadvisor and k/k too, but at least Kubernetes folks seem to agree this is a bug for v1/v2. So I hope folks at some layer are willing to adapt and recognize the calculation is wildly different.

Ps: what layer would you suggest the change occurs at? Or do you disagree there’s even an issue?

@alexeldeib
Copy link
Contributor Author

alexeldeib commented Jul 16, 2023 via email

@mrunalp
Copy link
Contributor

mrunalp commented Jul 25, 2023

I can understand that this is a change to v2 consumers. I am not sure I see how it leads to breakage. You’d simply see a lower number for usage? The contract is unchanged.

I agree. I am not seeing how this would be breaking if we are presenting a more accurate view of usage in line with v1.

@AkihiroSuda
Copy link
Member

AkihiroSuda commented Jul 25, 2023

I'm ok to merge this, but if something breaks we may have to revert this and maybe re-add it as another field like "Usage2".

@AkihiroSuda
Copy link
Member

Could you update https://github.com/opencontainers/runc/blob/main/CHANGELOG.md to explain the change?

@alexeldeib
Copy link
Contributor Author

Could you update https://github.com/opencontainers/runc/blob/main/CHANGELOG.md to explain the change?

Wasn't sure, if I should mark this as "unreleased" or create a new unreleased 1.1.9 tag? marked it as "unreleased - change"

CHANGELOG.md Outdated Show resolved Hide resolved
@alexeldeib alexeldeib force-pushed the ace/v2root branch 2 times, most recently from f1b45c7 to e92b917 Compare July 26, 2023 15:41
@kolyshkin
Copy link
Contributor

I agree with the general direction of this PR; thank you for your contribution @alexeldeib.

Just a few notes:

  1. The logic of getting the root cgroup stats should better be contained in statsFromMeminfo, not spread between the two as it is now.
  2. Maybe instead adding an extra argument to statMemory for the sake of unit test, we can test statsFromMeminfo directly? This will make the commit less invasive.

While at it, it might make sense to rename statsFromMeminfo to e.g. rootStatsFromMeminfo so it is more clear it is only used for root cgroup.

@kolyshkin
Copy link
Contributor

@alexeldeib needs a rebase

@alexeldeib alexeldeib force-pushed the ace/v2root branch 3 times, most recently from b86d3f9 to 92f1e3a Compare August 2, 2023 11:38
@alexeldeib alexeldeib changed the title libct/cg/fs2: use file + anon for usage libct/cg/fs2: use file + anon + swap for usage Aug 2, 2023
@alexeldeib alexeldeib force-pushed the ace/v2root branch 4 times, most recently from 19f95b1 to 884a7aa Compare August 2, 2023 12:55
@alexeldeib
Copy link
Contributor Author

alexeldeib commented Aug 2, 2023

I think that might be better?

swap confused me a bit, but I think I get it. I do not think we should count swapcached.

v1 has usage_in_bytes from mem_cgroup_read_u64 with MEMFILE_PRIVATE(_MEM, RES_USAGE)
https://github.com/torvalds/linux/blob/5d0c230f1de8c7515b6567d9afba1f196fb4e2f4/mm/memcontrol.c#L5016-L5020

In this case we use memcg->memory not memcg->memsw (which is v1 only? vs memcg->swap for v2 only?)
https://github.com/torvalds/linux/blob/5d0c230f1de8c7515b6567d9afba1f196fb4e2f4/mm/memcontrol.c#L3705-L3706

which would mean we actually call mem_cgroup_usage with swap = false
https://github.com/torvalds/linux/blob/5d0c230f1de8c7515b6567d9afba1f196fb4e2f4/mm/memcontrol.c#L3722-L3724

which makes sense semantically for us anyway - if usage were already usage with swap, then summing swap + usage for swap would be silly.

v2 has current with memory_current_read which seems to just read memcg->memory
https://github.com/torvalds/linux/blob/5d0c230f1de8c7515b6567d9afba1f196fb4e2f4/mm/memcontrol.c#L6714-L6715

so v2 is consistent at least there

This aligns v2 usage calculations more closely with v1.
Current node-level reporting for v1 vs v2 on the same
machine under similar load may differ by ~250-750Mi.

Also return usage as combined swap + memory usage, aligned
with v1 and non-root v2 cgroups.

`mem_cgroup_usage` in the kernel counts NR_FILE_PAGES
+ NR_ANON_MAPPED + `nr_swap_pages` (if swap enabled) [^0].

Using total - free results in higher "usage" numbers.
This is likely due to various types of reclaimable
memory technically counted as in use (e.g. inactive anon).

See also kubernetes/kubernetes#118916 for more context

[^0]: https://github.com/torvalds/linux/blob/06c2afb862f9da8dc5efa4b6076a0e48c3fbaaa5/mm/memcontrol.c#L3673-L3680

Signed-off-by: Alexander Eldeib <alexeldeib@gmail.com>
Copy link
Contributor

@kolyshkin kolyshkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@kolyshkin
Copy link
Contributor

@lifubang it looks like your comments were addressed, so I dismissed your old "request changes" review.

@linxiulei
Copy link

I think this may have brought in inconsistency issue for cgroupv2. AIUI, there is memory account difference by design between cgroup v1 and cgroup v2 (please correct me if I'm wrong):

  • v1 accounts file + anon + swap
  • v2 accounts file + anon + swap + in-kernel data

So it's WAI that v2 shows higher memory usage for non-root cgroups, which should be applied to the root as well. But this PR made the root account memory usage as v1, which appears inconsistency to me.

@bobbypage
Copy link

cc @kolyshkin @haircommander @mrunalp it would be great to get your thoughts on #3933 (comment).

I'm a little concerned that after this change we are no longer properly accounting kernel memory in cgroupv2 as @linxiulei mentioned.

@ace-cohere
Copy link

it's WAI that v2 shows higher memory usage for non-root cgroups, which should be applied to the root as well.

what would the right math be in your mind? do you have a source for the specific v2 non-root calculation? I'd browsed but didn't find anything that would indicate this inconsistency (not at all a kernel expert, had to learn some things to send this PR).

the original issue was total - free was not reflective of what we called "usage" in k8s-land. you can see several examples of the cgroup and proc info stats in the linked k/k issue to get a sense what the right numbers should be and whether it's a pure calculation or OS issue.

It's not clear to me what the right math would be if we need to add something other than rss + cache. I tried a few approaches but the numbers didn't quite add up (see the comments in k/k). there's active debate over what should be counted, but in practice this PR seemed to do the right thing when I tested it with some load. Have you found a counterexample?

is there a specific type of memory you suspect is not properly accounted? inactive_anon/file were where I'd looked previously, not sure what you're thinking.

@linxiulei
Copy link

I assume "what we called usage in k8s-land" is what we collect from memory.current in v2, which I believe is more than rss + cache. One specific example is TCP memory (see https://lpc.events/event/16/contributions/1212/attachments/1079/2052/LPC%202022%20-%20TCP%20memory%20isolation.pdf). The kernel doc suggests the kmem is accumulated into memory.current (main counter).

I don't think memory.current accounts all types of kernel memory such as memory page tables, device drivers etc. But I do believe a visible amount of memory is not accounted in rss + cache, which is essentially userspace memory. Other example I could think of, but without solid evidence, is dcache. Maybe ithis](https://android.googlesource.com/kernel/msm/+/android-msm-3.9-usb-and-mmc-hacks/Documentation/cgroups/memory.txt#298) is a good list of kernel memory types we miss despite the doc is for v1

@haircommander
Copy link
Contributor

I think the key here is to emulate what was previously done in cgroupv1. I think the initial investigation done by @alexeldeib is correct: if we look at the calculate of memory usage, we can see v1 uses memory.usage_in_bytes, even for the root cgroup. In the root cgroup case, we can see this is built from anon + file, and so the best approximation for cgroupv1 usage_in_bytes of the root cgroup on v2 is to just use file + anon

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cgroupv2 backport/1.1-done A PR in main branch which has been backported to release-1.1 impact/changelog status/4-merge
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants