Nomad CPU pinning is moving the container after it's created. #11705

gitrgoliveira · 2021-12-17T16:57:18Z

Nomad version

1.1.6 - Enterprise

Issue

When the number of cores is defined in the resources stanza, using the docker driver, this is only applied after the container has already started and is not being set for all of the PIDs of the container.

This is a problem, because containers that have more than one process may get spawned before Nomad moves PID 1 to `/sys/fs/cgroup/cpuset/nomad/reserved/, which means the sub-processes may not get the assigned cores.

In this use case, one of the sub-processes pins threads to the available cores, which ends up getting “out-of-sync” because the assigned cores are changed after start-up.

Reproduction steps

Start a Nomad task with
A docker container having more than one process
Set the number of reserved cores in the resources stanza

Expected Result

Task should start directly under /nomad/reserved

Actual Result

Task starts under /nomad/shared/<alloc ID>/<task name> and then gets moved to /nomad/reserved/<alloc ID>/<task name>

The text was updated successfully, but these errors were encountered:

gitrgoliveira · 2022-01-05T15:36:29Z

adding an example Nomad job file

job "service-1" {
  region      = "europe"
  datacenters = ["europe-west1"]
  type        = "service"
  group "service-1" {
    update {
      auto_revert       = false
      auto_promote      = false
      canary            = 0
      max_parallel      = 1
      stagger           = "30s"
      health_check      = "checks"
      min_healthy_time  = "10s"
      healthy_deadline  = "5m"
      progress_deadline = "10m"
    }
    restart {
      attempts = "60"
      delay    = "1s"
      interval = "60s"
      mode     = "fail"
    }
    network {
      port "http" {
        to = 8080
      }
    }
    service {
      name = "service-1"
      task = "service-1"
      port = "http"

      check {
        interval = "10s"
        path     = "/healthz"
        timeout  = "2s"
        type     = "http"
      }
    }
    task "service-1" {
      resources {
        cores  = 8
        memory = 3000
      }

      meta {
        DEPLOY_API_INJECT_ENV = true
      }

      driver = "docker"
      config {
        image    = "image_name"
        hostname = "service-${NOMAD_ALLOC_INDEX}"
        labels = {
          service_name = "service-1"
        }

        logging {
          type = "fluentd"
          config {
            fluentd-address = "127.0.0.1:24224"
            tag             = "docker.service-1"
          }
        }
        dns_servers = ["172.17.0.1"]
        ports = [
          "http",
        ]
      }
    }
  }
}

tgross · 2022-01-05T16:21:51Z

@gitrgoliveira the behavior of the application inside the image is what's challenging about debugging this one. That's the information we'd really need to dig into this.

aholyoake-bc · 2022-01-13T19:57:26Z

I've done some more investigating with this in this gist:
https://gist.github.com/aholyoake-bc/1ba35f8fe578108632f0cfd6781b3815.

Note: the issue report above

This is a problem, because containers that have more than one process may get spawned before Nomad moves PID 1 to `/sys/fs/cgroup/cpuset/nomad/reserved/, which means the sub-processes may not get the assigned cores.

isn't quite right. This should be modified to

Any processes started before nomad has managed to modify the cpuset will remain on the old cpuset and use cores that were not intended for them

Test job (See gist)

This is a simple job with the cores resource set to 4 and run on an 8 core machine.

There is a bash script in the docker container which reports whenever the current cpu set changes.

#!/usr/bin/env bash

name=${1:-test} #Logging identifier
max=${2:-100000} #Number of iterations to run
previous=""
index=0

while [[ "$max" == "" ]] || [[ $index -lt $max  ]]
do
	index=$((index + 1))
        #Note: this file never gets updated(!) i.e. always reports 0-7 even when the cpuset as reported by taskset is 0-3
	#cpuset="$(cat /sys/fs/cgroup/cpuset/cpuset.cpus)"
	cpuset="$(taskset -c -p $$)"
	if [[ "$cpuset" != "$previous" ]]
	then
		echo $(date -u +'%Y-%m-%dT%H:%M:%S.%NZ') $name $cpuset
		previous="$cpuset"
	fi
	sleep 0.01
done

In the entrypoint to the container I call this script a number of times.

Test Cases

Entrypoint with 3 tasks before the long running task

i.e. the entrypoint is set to this

#!/usr/bin/env bash
cpusetchanges background 50000 &
cpusetchanges setup1 1
cpusetchanges setup2 1
cpusetchanges longrunning 50000

The first time is a long-running background process, then a couple of very short lived foreground processes, then a long-lived foreground process.

The output of running the container is thus:

make logs
2022-01-13T19:38:29.849263280Z background pid 7's current affinity list: 0-7
2022-01-13T19:38:29.851655411Z setup1 pid 8's current affinity list: 0-7
2022-01-13T19:38:29.870367635Z setup2 pid 17's current affinity list: 0-7
2022-01-13T19:38:29.888552418Z longrunning pid 23's current affinity list: 0-3

i.e. we see the first three processes are run without the cpuset that nomad should have given them. The background task running in the container never gets its cpuset changed.

The host cgroups for the long running container processes look like this

make show-cgroups
15143 /docker/86b28e83e4ce102b150a1ddf7525550dcffe6e3dc5a1a35d4bd3f5b85289b903 0-7 CPUID 5
15199 /nomad/reserved/a69e416e-ea96-bca5-39d7-7f38499b3b54-test 0-3 CPUID 3

Entrypoint with only a background and long running task

i.e. the entrypoint is set to this

#!/usr/bin/env bash
cpusetchanges background &
cpusetchanges longrunning 50000

Logs

2022-01-13T19:46:17.017405753Z background pid 7's current affinity list: 0-7
2022-01-13T19:46:17.017552533Z longrunning pid 8's current affinity list: 0-7

Host Cgroups (never changes)

27558 /docker/5e9721282ca56cf82e19a0b0e35128b5b082bae7291408e370ac9ea9f0156a82 0-7 CPUID 5
27559 /docker/5e9721282ca56cf82e19a0b0e35128b5b082bae7291408e370ac9ea9f0156a82 0-7 CPUID 5

I.e. if we are too quick in starting processes they have the incorrect cpuset and that never gets rectified.

Entrypoint of only one task

i.e. removing the wrapper script and setting the entrypoint to

 ENTRYPOINT ["/usr/local/bin/cpusetchanges", "start", "100000"]

so that cpusetchanges is now PID 1 shows that nomad changes the cpuset of pid1 around 50ms after the process has started:

2022-01-14T11:17:05.343787883Z start pid 1's current affinity list: 0-7
2022-01-14T11:17:05.385178070Z start pid 1's current affinity list: 0-3

Findings

Nomad / Docker is starting processes with the incorrect cpuset.
Nomad changes the cpuset to the correct version only for PID 1 and not its child processes
Anything started after around 50ms has the correct cpuset

Workarounds

One hacky solution is to ensure there is a sleep of a few hundred miliseconds at the start of the docker container's entrypoint, but clearly this isn't great!

Expected behaviour

I'd expect the cpuset to be correct before any processes have been started

Potential Fixes / Mitigations

The cgroup should be set at container initialization time (maybe by adding)
Cgroup: task.Resources.LinuxResources.CpusetCgroupPath, here although I'm sure there's a good reason as to why this wasn't done initally
All child processes of pid here should be included

Note: This behaviour is consistent across the versions 1.1.6, 1.1.6+ent, 1.2.3.

tgross · 2022-01-14T13:38:45Z

Hi @aholyoake-bc!

Any processes started before nomad has managed to modify the cpuset will remain on the old cpuset and use cores that were not intended for them

Ok, that was my suspicion on originally reading this issue, so thanks for verifying that for me. I seem to recall this was intentional in the original design, so I went back to our internal design document for this and found the following (ref NMD-098 for any HashiCorp-internal folks who might be reading this):

The docker driver poses a slight problem where we cannot set the cgroup directly through docker’s API. This means that for a very short duration during task startup, the process will not be in the correct cgroup. After starting the container the docker driver will then manually move the process to the correct cgroup. For a majority of users this will have no impact, but for users using reserved CPUs the container will not initially be constrained to the reserved CPUs.

But I took a look at the current Docker Remote API documentation for ContainerCreate and that doesn't seem to be true (see HostConfig.CgroupParent and HostConfig.Cgroup). And it looks like our Docker client library supports it as well (ref https://pkg.go.dev/github.com/fsouza/go-dockerclient@v1.6.5#HostConfig).

So there are two takeaways here:

We should be able to update the docker driver to pass the expected parent cgroup and cgroup (this will probably help get us ready to do cpuset support with cgroupsv2 #11289 too).
In the meantime, you can workaround the problem by slightly delaying forking. A shell script that runs as PID1, waits for the cpuset to be correct, and the exec's into the "real" PID1 you want would do the trick.

aholyoake-bc · 2022-01-14T14:10:59Z

That's great - glad our fixes match up!

This PR introduces support for using Nomad on systems with cgroups v2 [1] enabled as the cgroups controller mounted on /sys/fs/cgroups. Newer Linux distros like Ubuntu 21.10 are shipping with cgroups v2 only, causing problems for Nomad users. Nomad mostly "just works" with cgroups v2 due to the indirection via libcontainer, but not so for managing cpuset cgroups. Before, Nomad has been making use of a feature in v1 where a PID could be a member of more than one cgroup. In v2 this is no longer possible, and so the logic around computing cpuset values must be modified. When Nomad detects v2, it manages cpuset values in-process, rather than making use of cgroup heirarchy inheritence via shared/reserved parents. Nomad will only activate the v2 logic when it detects cgroups2 is mounted at /sys/fs/cgroups. This means on systems running in hybrid mode with cgroups2 mounted at /sys/fs/cgroups/unified (as is typical) Nomad will continue to use the v1 logic, and should operate as before. Systems that do not support cgroups v2 are also not affected. When v2 is activated, Nomad will create a parent called nomad.slice (unless otherwise configured in Client conifg), and create cgroups for tasks using naming convention <allocID>-<task>.scope. These follow the naming convention set by systemd and also used by Docker when cgroups v2 is detected. Client nodes now export a new fingerprint attribute, unique.cgroups.version which will be set to "v1" or "v2" to indicate the cgroups regime in use by Nomad. The new cpuset management strategy fixes #11705, where docker tasks that spawned processes on startup would "leak". In cgroups v2, the PIDs are started in the cgroup they will always live in, and thus the cause of the leak is eliminated. [1] https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v2.html Closes #11289 Fixes #11705 #11773 #11933

This PR introduces support for using Nomad on systems with cgroups v2 [1] enabled as the cgroups controller mounted on /sys/fs/cgroups. Newer Linux distros like Ubuntu 21.10 are shipping with cgroups v2 only, causing problems for Nomad users. Nomad mostly "just works" with cgroups v2 due to the indirection via libcontainer, but not so for managing cpuset cgroups. Before, Nomad has been making use of a feature in v1 where a PID could be a member of more than one cgroup. In v2 this is no longer possible, and so the logic around computing cpuset values must be modified. When Nomad detects v2, it manages cpuset values in-process, rather than making use of cgroup heirarchy inheritence via shared/reserved parents. Nomad will only activate the v2 logic when it detects cgroups2 is mounted at /sys/fs/cgroups. This means on systems running in hybrid mode with cgroups2 mounted at /sys/fs/cgroups/unified (as is typical) Nomad will continue to use the v1 logic, and should operate as before. Systems that do not support cgroups v2 are also not affected. When v2 is activated, Nomad will create a parent called nomad.slice (unless otherwise configured in Client conifg), and create cgroups for tasks using naming convention <allocID>-<task>.scope. These follow the naming convention set by systemd and also used by Docker when cgroups v2 is detected. Client nodes now export a new fingerprint attribute, unique.cgroups.version which will be set to 'v1' or 'v2' to indicate the cgroups regime in use by Nomad. The new cpuset management strategy fixes #11705, where docker tasks that spawned processes on startup would "leak". In cgroups v2, the PIDs are started in the cgroup they will always live in, and thus the cause of the leak is eliminated. [1] https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v2.html Closes #11289 Fixes #11705 #11773 #11933

github-actions · 2022-10-10T02:44:10Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

gitrgoliveira added the type/bug label Dec 17, 2021

tgross added the hcc/cst Admin - internal label Jan 10, 2022

mikenomitch added the theme/cgroups cgroups issues label Jan 18, 2022

shoenig mentioned this issue Mar 21, 2022

client: enable cpuset support for cgroups.v2 #12274

Merged

shoenig closed this as completed in #12274 Mar 24, 2022

github-actions bot locked as resolved and limited conversation to collaborators Oct 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nomad CPU pinning is moving the container after it's created. #11705

Nomad CPU pinning is moving the container after it's created. #11705

gitrgoliveira commented Dec 17, 2021

gitrgoliveira commented Jan 5, 2022

tgross commented Jan 5, 2022

aholyoake-bc commented Jan 13, 2022 •

edited

Loading

tgross commented Jan 14, 2022

aholyoake-bc commented Jan 14, 2022

github-actions bot commented Oct 10, 2022

Nomad CPU pinning is moving the container after it's created. #11705

Nomad CPU pinning is moving the container after it's created. #11705

Comments

gitrgoliveira commented Dec 17, 2021

Nomad version

Issue

Reproduction steps

Expected Result

Actual Result

gitrgoliveira commented Jan 5, 2022

tgross commented Jan 5, 2022

aholyoake-bc commented Jan 13, 2022 • edited Loading

Test job (See gist)

Test Cases

Entrypoint with 3 tasks before the long running task

Entrypoint with only a background and long running task

Entrypoint of only one task

Findings

Workarounds

Expected behaviour

Potential Fixes / Mitigations

tgross commented Jan 14, 2022

aholyoake-bc commented Jan 14, 2022

github-actions bot commented Oct 10, 2022

aholyoake-bc commented Jan 13, 2022 •

edited

Loading