cgroups: consider pre-setting cpuset in docker driver #12374

shoenig · 2022-03-24T18:46:09Z

With the cgroups v2 implementation, we could potentially set the cpuset value in the TaskConfig to the reserved cores. The tradeoff is I don't think we have access to thh shared cores at that point, meaning the task will be limited to only its reserved cores at launch time.

The text was updated successfully, but these errors were encountered:

aholyoake-bc · 2022-03-28T21:24:47Z

For better visibility from #12274

Our trading infrastructure reads the cpus in the cgroup on startup and index into the cpuset to pin threads to specific cores. If the cgroup gets modified after we've done this thread <-> core pinning, I'm not entirely sure what happens but it's bad as either threads will still be pinned to CPUs outside of the allocated cgroup or it just breaks altogether.

shoenig · 2022-03-29T13:43:35Z

@aholyoake-bc thanks for pointing out your use case, that is something where there is a detectable change in behavior - with the cgroups v1 controller you could read just the reserved cgroup cpuset, but with v2 there is only 1 cpuset that contains both reserved and shared cores.

The immediate workaround would be to either avoid using a system with only cgroup v2 support (the v2 behavior is disabled on v1 and hybrid systems), or avoid upgrading Nomad. We could also consider adding a resource option flag for disabling shared cores for a task. Or perhaps a simple environment variable exposed to the task like NOMAD_RESERVED_CORES that could be read instead of the cpuset.

While working on cgroups v2 support I mapped out the plumbing necessary for proper NUMA aware scheduling. An important part of that will be to enable strict cpu core pinning based on system topology (i.e. memory & pcie device associativity). I think that's going to be the long term solution for use cases like yours.

aholyoake-bc · 2022-03-29T14:02:48Z

FWIW, we would be very supportive of having this information exposed in an environment variable as it would make the migration to having nomad manage the cores significantly easier.

I might be pushing my luck here, but exposing it as something like

NOMAD_RESERVED_CORE_0=
NOMAD_RESERVED_CORE_1= ...

would be even dreamier

aholyoake-bc · 2022-04-01T09:59:49Z

Out of interest, what are your current ideas for the definition for the numa-aware scheduling (basically what it looks like in the job file)?

shoenig · 2022-04-01T14:28:37Z

This is off the top of my head and 1000% subject to change, but something like

job "models" {
  datacenters = ["dc1"]

  group "build-models" {
    task "model1" {
      driver = "exec"
      config {
        command = "/bin/builder"
      }
      resources {
        cores  = 8
        memory = 8192
        numa {
          mode    = "<disable|preferred|required>"
          devices = ["nvidia/gpu"]
        }
        device "nvidia/gpu" {
          # ... 
        }
        device "other" {
          # ... 
        }
      }
    }
  }
}

Where disable is the current (unaware) behavior, preferred implies the scheduler will place the allocation on a Nomad client where the cores, memory, and device requirements are available on a single numa node if available or elsewhere if not, and required means scheduling the alloc will block until the requirements are available.

As an aside I think the Client configuration could also take into account isolcpus by setting the (currently undocumented) reservable_cores to "isolcpus" and Nomad will lookup and defer to those values.

I realize this only scratches the surface of NUMA topology complexity, so by all means I'm open to feedback!

This PR injects the 'NOMAD_CPU_CORES' environment variable into tasks that have been allocated reserved cpu cores. The value uses normal cpuset notation, as found in cpuset.cpu cgroup interface files. Note this value is not necessiarly the same as the content of the actual cpuset.cpus interface file, which will also include shared cpu cores when using cgroups v2. This variable is a workaround for users who used to be able to read the reserved cgroup cpuset file, but lose the information about distinct reserved cores when using cgroups v2. Side discussion in: #12374

shoenig · 2022-04-07T19:17:05Z

#12496 sets a NOMAD_CPU_CORES environment variable, containing the cpuset - style list of cores reserved for the task (not including shared cores). (missed the beta, but will be in Nomad 1.3)

I'm going to go head and close this ticket, because there's no way to achieve the exactly correct behavior of setting the docker task config's cpuset to reserved + shared before launching the task, because that value is dynamic and unknowable until the task is actually launching. If we were to do something different, it would be trading one incorrect behavior for another, and in that context it seems sensible to just leave the existing behavior as-is.

github-actions · 2022-10-09T02:45:52Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

shoenig added stage/thinking theme/cgroups cgroups issues labels Mar 24, 2022

aholyoake-bc mentioned this issue Mar 28, 2022

client: enable cpuset support for cgroups.v2 #12274

Merged

shoenig mentioned this issue Apr 7, 2022

client: set environment variable indicating set of reserved cpu cores #12496

Merged

shoenig closed this as completed Apr 7, 2022

github-actions bot locked as resolved and limited conversation to collaborators Oct 9, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cgroups: consider pre-setting cpuset in docker driver #12374

cgroups: consider pre-setting cpuset in docker driver #12374

shoenig commented Mar 24, 2022

aholyoake-bc commented Mar 28, 2022

shoenig commented Mar 29, 2022

aholyoake-bc commented Mar 29, 2022

aholyoake-bc commented Apr 1, 2022

shoenig commented Apr 1, 2022 •

edited

Loading

shoenig commented Apr 7, 2022

github-actions bot commented Oct 9, 2022

cgroups: consider pre-setting cpuset in docker driver #12374

cgroups: consider pre-setting cpuset in docker driver #12374

Comments

shoenig commented Mar 24, 2022

aholyoake-bc commented Mar 28, 2022

shoenig commented Mar 29, 2022

aholyoake-bc commented Mar 29, 2022

aholyoake-bc commented Apr 1, 2022

shoenig commented Apr 1, 2022 • edited Loading

shoenig commented Apr 7, 2022

github-actions bot commented Oct 9, 2022

shoenig commented Apr 1, 2022 •

edited

Loading