Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory stats and freezer management with cgroupv2 #10251

Closed
notnoop opened this issue Mar 29, 2021 · 10 comments
Closed

Memory stats and freezer management with cgroupv2 #10251

notnoop opened this issue Mar 29, 2021 · 10 comments
Assignees
Labels
stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/cgroups cgroups issues type/bug
Milestone

Comments

@notnoop
Copy link
Contributor

notnoop commented Mar 29, 2021

Nomad cgroup-v2 integration as it has some cgroupv1-isms. Cgroups-v2 changed the filesystem representation and changed the memory metrics that Nomad has relied on, so Nomad reports 0 memory summary metric across ~all drivers.

First, Nomad memory reporting relies on cgroup-v1 metrics. Nomad defaults to using RSS as the top line memory summary value to report, and reports Kernel Max Usage, Kernel Usage, Max Usage, RSS, none of which are reported in cgroupv2. You can view the libcontainer reporting difference by comparing cgroup v1 memory stats with cgroup v2. This is pretty confusing.

Also, the executor DestroyCgroup method uses libcontainer cgroup v1 . This needs to be updated to account for v2 and ideally select the relevant cgroup backend.

It's not clear what the state of cgroup-v2 adoption is. Seems like Fedora and ArchLinux. Other distros, like RHEL and Ubuntu, provide it as an option but the default one.

Sample metrics of cgroup v2

Running on Fedora 33, I see the following stats info:

                  = 1e2bdcc2-983d-1e0c-d226-95577bffc188
Eval ID             = dae9b0ab-d31a-446b-a9df-5f2cbf37dc53
Name                = memory.cache[0]
Node ID             = f7bf24d9-d3c0-c34e-0b80-1c6a5de7eddf
Node Name           = ip-172-31-74-56.ec2.internal
Job ID              = memory
Job Version         = 1
Client Status       = running
Client Description  = Tasks are running
Desired Status      = run
Desired Description = <none>
Created             = 2021-03-28T17:52:15-04:00
Modified            = 2021-03-28T17:52:33-04:00
Deployment ID       = ff079acc-8f67-41bb-dc67-e5c506e9a795
Deployment Health   = healthy
Evaluated Nodes     = 1
Filtered Nodes      = 0
Exhausted Nodes     = 0
Allocation Time     = 88.646µs
Failures            = 0

Task "redis" is "running"
Task Resources
CPU           Memory        Disk     Addresses
2465/500 MHz  0 B/1000 MiB  300 MiB

Memory Stats
Cache  Kernel Max Usage  Kernel Usage  Max Usage  RSS  Swap  Usage
0 B    0 B               0 B           0 B        0 B  0 B   261 MiB

CPU Stats
Percent  System Mode  Throttled Periods  Throttled Time  User Mode
98.64%   0.00%        0                  0               98.64%

Task Events:
Started At     = 2021-03-28T21:52:22Z
Finished At    = N/A
Total Restarts = 0
Last Restart   = N/A

Recent Events:
Time                       Type        Description
2021-03-28T17:52:22-04:00  Started     Task started by client
2021-03-28T17:52:20-04:00  Task Setup  Building Task Directory
2021-03-28T17:52:15-04:00  Received    Task received by client

Placement Metrics
Node                                  binpack  job-anti-affinity  node-affinity  node-reschedule-penalty  final score
f7bf24d9-d3c0-c34e-0b80-1c6a5de7eddf  0.635    0                  0              0                        0.635

Also, here is docker memory stats for cgroup v1 and v2

Cgroup v2

{
  "usage": 2744320,
  "stats": {
    "active_anon": 1757184,
    "active_file": 0,
    "anon": 1622016,
    "anon_thp": 0,
    "file": 0,
    "file_dirty": 0,
    "file_mapped": 0,
    "file_writeback": 0,
    "inactive_anon": 0,
    "inactive_file": 0,
    "kernel_stack": 73728,
    "pgactivate": 0,
    "pgdeactivate": 0,
    "pgfault": 3531,
    "pglazyfree": 0,
    "pglazyfreed": 0,
    "pgmajfault": 0,
    "pgrefill": 0,
    "pgscan": 0,
    "pgsteal": 0,
    "shmem": 0,
    "slab": 573440,
    "slab_reclaimable": 0,
    "slab_unreclaimable": 573440,
    "sock": 0,
    "thp_collapse_alloc": 0,
    "thp_fault_alloc": 0,
    "unevictable": 0,
    "workingset_activate": 0,
    "workingset_nodereclaim": 0,
    "workingset_refault": 0
  },
  "limit": 2036068352
}

Cgroup v1

{
  "usage": 6778880,
  "max_usage": 9478144,
  "stats": {
    "active_anon": 1622016,
    "active_file": 2297856,
    "cache": 4055040,
    "dirty": 0,
    "hierarchical_memory_limit": 9223372036854772000,
    "hierarchical_memsw_limit": 0,
    "inactive_anon": 0,
    "inactive_file": 1757184,
    "mapped_file": 2027520,
    "pgfault": 5049,
    "pgmajfault": 33,
    "pgpgin": 5016,
    "pgpgout": 3591,
    "rss": 1626112,
    "rss_huge": 0,
    "total_active_anon": 1622016,
    "total_active_file": 2297856,
    "total_cache": 4055040,
    "total_dirty": 0,
    "total_inactive_anon": 0,
    "total_inactive_file": 1757184,
    "total_mapped_file": 2027520,
    "total_pgfault": 5049,
    "total_pgmajfault": 33,
    "total_pgpgin": 5016,
    "total_pgpgout": 3591,
    "total_rss": 1626112,
    "total_rss_huge": 0,
    "total_unevictable": 0,
    "total_writeback": 0,
    "unevictable": 0,
    "writeback": 0
  },
  "limit": 1026154496
}

Links

@mircea-c
Copy link

Just upgraded to Debian “bullseye” and they are using cgroup-v2 as the default which is now causing this issue. Running nomad version 1.1.3.

@MorphBonehunter
Copy link

Also Ubuntu will change to cgroup-v2 with the upcoming 21.10 release.

@Himura2la
Copy link

Himura2la commented Sep 28, 2021

As a workaround, you can switch the kernel to "hybrid" cgroup hierarchy. It fixes the issue. For debian-based distros, use the following script:

sudo sed -i \
    '/^GRUB_CMDLINE_LINUX=/ s/"$/systemd.unified_cgroup_hierarchy=false systemd.legacy_systemd_cgroup_controller=false"/' \
    /etc/default/grub
sudo update-grub
sudo reboot

Details:

@keslerm
Copy link

keslerm commented Dec 1, 2021

The workaround works for this, but I noticed that the memory usage reporting in the nomad dashboard only represents the RSS memory, where as docker considers both the RSS and the Cache. This results in things that might look okay memory wise in nomad being OOM killed by the kernel and it not being obvious without looking at dmesg.

Running docker stats shows the combined memory usage

@mircea-c
Copy link

Any update on this issue? There's a PR open for this (timdaman/check_docker#82).

@aep
Copy link
Contributor

aep commented Jan 20, 2022

is there a workaround to disable cgroups for everything except the pid list? the raw executor is currently nonfunctional due to #10551 unless you disable cgroups entirely. unfortunately it doesn't use sessions so you leak processes without cgroups

@m1kc
Copy link

m1kc commented Mar 31, 2022

Any updates on this one?

@tgross
Copy link
Member

tgross commented Mar 31, 2022

Hi @m1kc improved cgroups v2 support is planned to ship in Nomad 1.3.0. Much of the work that @shoenig has done has already landed in main.

@shoenig shoenig added the theme/cgroups cgroups issues label Mar 31, 2022
@shoenig shoenig added this to the 1.3.0 milestone Apr 5, 2022
@shoenig shoenig self-assigned this Apr 5, 2022
@shoenig
Copy link
Member

shoenig commented Apr 5, 2022

Between #11289 and #12419 (shipping in Nomad 1.3) I think we're reporting what's available in cgroups v2

exec:

Memory Stats
Cache    Max Usage  RSS      Swap  Usage
6.5 MiB  8.6 MiB    988 KiB  0 B   8.2 MiB

raw_exec:

Memory Stats
RSS     Swap
62 MiB  0 B

@github-actions
Copy link

github-actions bot commented Oct 9, 2022

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 9, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/cgroups cgroups issues type/bug
Projects
None yet
Development

No branches or pull requests

9 participants