Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cgroups: consider pre-setting cpuset in docker driver #12374

Closed
shoenig opened this issue Mar 24, 2022 · 7 comments
Closed

cgroups: consider pre-setting cpuset in docker driver #12374

shoenig opened this issue Mar 24, 2022 · 7 comments
Labels

Comments

@shoenig
Copy link
Member

shoenig commented Mar 24, 2022

With the cgroups v2 implementation, we could potentially set the cpuset value in the TaskConfig to the reserved cores. The tradeoff is I don't think we have access to thh shared cores at that point, meaning the task will be limited to only its reserved cores at launch time.

@aholyoake-bc
Copy link
Contributor

For better visibility from #12274

Our trading infrastructure reads the cpus in the cgroup on startup and index into the cpuset to pin threads to specific cores. If the cgroup gets modified after we've done this thread <-> core pinning, I'm not entirely sure what happens but it's bad as either threads will still be pinned to CPUs outside of the allocated cgroup or it just breaks altogether.

@shoenig
Copy link
Member Author

shoenig commented Mar 29, 2022

@aholyoake-bc thanks for pointing out your use case, that is something where there is a detectable change in behavior - with the cgroups v1 controller you could read just the reserved cgroup cpuset, but with v2 there is only 1 cpuset that contains both reserved and shared cores.

The immediate workaround would be to either avoid using a system with only cgroup v2 support (the v2 behavior is disabled on v1 and hybrid systems), or avoid upgrading Nomad. We could also consider adding a resource option flag for disabling shared cores for a task. Or perhaps a simple environment variable exposed to the task like NOMAD_RESERVED_CORES that could be read instead of the cpuset.

While working on cgroups v2 support I mapped out the plumbing necessary for proper NUMA aware scheduling. An important part of that will be to enable strict cpu core pinning based on system topology (i.e. memory & pcie device associativity). I think that's going to be the long term solution for use cases like yours.

@aholyoake-bc
Copy link
Contributor

FWIW, we would be very supportive of having this information exposed in an environment variable as it would make the migration to having nomad manage the cores significantly easier.

I might be pushing my luck here, but exposing it as something like

NOMAD_RESERVED_CORE_0=
NOMAD_RESERVED_CORE_1= ...

would be even dreamier

@aholyoake-bc
Copy link
Contributor

Out of interest, what are your current ideas for the definition for the numa-aware scheduling (basically what it looks like in the job file)?

@shoenig
Copy link
Member Author

shoenig commented Apr 1, 2022

This is off the top of my head and 1000% subject to change, but something like

job "models" {
  datacenters = ["dc1"]

  group "build-models" {
    task "model1" {
      driver = "exec"
      config {
        command = "/bin/builder"
      }
      resources {
        cores  = 8
        memory = 8192
        numa {
          mode    = "<disable|preferred|required>"
          devices = ["nvidia/gpu"]
        }
        device "nvidia/gpu" {
          # ... 
        }
        device "other" {
          # ... 
        }
      }
    }
  }
}

Where disable is the current (unaware) behavior, preferred implies the scheduler will place the allocation on a Nomad client where the cores, memory, and device requirements are available on a single numa node if available or elsewhere if not, and required means scheduling the alloc will block until the requirements are available.

As an aside I think the Client configuration could also take into account isolcpus by setting the (currently undocumented) reservable_cores to "isolcpus" and Nomad will lookup and defer to those values.

I realize this only scratches the surface of NUMA topology complexity, so by all means I'm open to feedback!

shoenig added a commit that referenced this issue Apr 7, 2022
This PR injects the 'NOMAD_CPU_CORES' environment variable into
tasks that have been allocated reserved cpu cores. The value uses
normal cpuset notation, as found in cpuset.cpu cgroup interface files.

Note this value is not necessiarly the same as the content of the actual
cpuset.cpus interface file, which will also include shared cpu cores when
using cgroups v2. This variable is a workaround for users who used to be
able to read the reserved cgroup cpuset file, but lose the information
about distinct reserved cores when using cgroups v2.

Side discussion in: #12374
shoenig added a commit that referenced this issue Apr 7, 2022
This PR injects the 'NOMAD_CPU_CORES' environment variable into
tasks that have been allocated reserved cpu cores. The value uses
normal cpuset notation, as found in cpuset.cpu cgroup interface files.

Note this value is not necessiarly the same as the content of the actual
cpuset.cpus interface file, which will also include shared cpu cores when
using cgroups v2. This variable is a workaround for users who used to be
able to read the reserved cgroup cpuset file, but lose the information
about distinct reserved cores when using cgroups v2.

Side discussion in: #12374
@shoenig
Copy link
Member Author

shoenig commented Apr 7, 2022

#12496 sets a NOMAD_CPU_CORES environment variable, containing the cpuset - style list of cores reserved for the task (not including shared cores). (missed the beta, but will be in Nomad 1.3)

I'm going to go head and close this ticket, because there's no way to achieve the exactly correct behavior of setting the docker task config's cpuset to reserved + shared before launching the task, because that value is dynamic and unknowable until the task is actually launching. If we were to do something different, it would be trading one incorrect behavior for another, and in that context it seems sensible to just leave the existing behavior as-is.

@shoenig shoenig closed this as completed Apr 7, 2022
@github-actions
Copy link

github-actions bot commented Oct 9, 2022

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 9, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

2 participants