Nomad exec driver leaks cgroups, causing host system running out of memory #6823

fho · 2019-12-09T17:13:52Z

Nomad version

Reproduced with:

Nomad v0.10.0 (25ee121)
Nomad v0.10.2 (0d2d6e3)

Operating system and Environment details

Reproduced with Linux kernels:

Ubuntu 4.15.0-1050-gcp
ArchLinux 4.19.87-1-lts

Issue

Nomad does not remove cgroups for terminated exec tasks.
This causes that more and more memory is used on the host system by the kernfs_node_cache and task_struct SLAB caches.
This causes that the host system becomes unstable by running out of memory, starting to swap and then page allocation failure happens.

Reproduction steps

1.) Start a batch job via nomad that:

runs a command that is available in the exec chroot and finish fast, e.g. /bin/ls
runs periodically every 1 second (optionally with prohibit_overlap = true)
2.)
Monitor the number of cgroups on the system created by nomad,
e.g. via watch -n 1 'find $(ls /sys/fs/cgroup/*/nomad -d) -type d| wc -l', the number is continously growing
Monitor slab caches via slabtop -s c -d1, the kernfs_node_cache and task_struct caches are continuously growing

Somewhen the system runs out of available memory, swaps and page allocation failures happen.

Fix: Remove cgroups when an exec task terminates

Job file (if appropriate)

job "example" {
  periodic {
    cron = "*/1 * * * * * *"
    prohibit_overlap = true
  }
  datacenters = ["sandbox"]
  type = "batch"
  group "cache" {
    count = 1

    task "cgroupleak" {
      driver = "exec"
      config {
        command = "/bin/ls"
      }
      resources {
        cpu    = 20 # 500 MHz
        memory = 10 # 256MB
      }
      service {
        name = "cgroupleak"
      }
    }
  }
}

The text was updated successfully, but these errors were encountered:

notnoop · 2019-12-09T17:19:09Z

Thanks @fho . I'll investigate this and update you very soon!

fho · 2019-12-13T14:12:02Z

Thanks a lot for the fast response and fix!

notnoop · 2019-12-13T14:20:41Z

@fho anytime! It'll go out in 0.10.3. Thank you so much for reporting it.

For context, Nomad leaked cgroups in a regression since 0.9.0 :(. If an exec task exits with zero exit code, nomad 0.9 didn't clean up the cgroups. Nomad 0.10.2 fixed this issue in #6722 . But systemd cgroup was special, and we didn't properly clean it up; we addressed it in #6839 .

Let us know if you have any questions or further observations!

github-actions · 2022-11-15T02:30:03Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

notnoop self-assigned this Dec 9, 2019

notnoop added this to the 0.10.3 milestone Dec 9, 2019

notnoop added this to Needs Triage in Nomad - Community Issues Triage via automation Dec 9, 2019

notnoop moved this from Needs Triage to In Progress in Nomad - Community Issues Triage Dec 9, 2019

tgross added type/bug theme/driver/exec labels Dec 9, 2019

This was referenced Dec 11, 2019

executor: stop joining executor to container cgroup #6839

Merged

executor: move out of all of container cgroups #6840

Closed

notnoop closed this as completed in #6839 Dec 13, 2019

Nomad - Community Issues Triage automation moved this from In Progress to Done Dec 13, 2019

github-actions bot locked as resolved and limited conversation to collaborators Nov 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nomad exec driver leaks cgroups, causing host system running out of memory #6823

Nomad exec driver leaks cgroups, causing host system running out of memory #6823

fho commented Dec 9, 2019 •

edited

Loading

notnoop commented Dec 9, 2019

fho commented Dec 13, 2019

notnoop commented Dec 13, 2019

github-actions bot commented Nov 15, 2022

Nomad exec driver leaks cgroups, causing host system running out of memory #6823

Nomad exec driver leaks cgroups, causing host system running out of memory #6823

Comments

fho commented Dec 9, 2019 • edited Loading

Nomad version

Operating system and Environment details

Issue

Reproduction steps

Job file (if appropriate)

notnoop commented Dec 9, 2019

fho commented Dec 13, 2019

notnoop commented Dec 13, 2019

github-actions bot commented Nov 15, 2022

fho commented Dec 9, 2019 •

edited

Loading