CPU Manager for Nomad #8473

shishir-a412ed · 2020-07-20T20:47:45Z

CPU Manager for Nomad

Overview

The completely fair scheduler or CFS (also referred to as the kernel task scheduler), as the name suggests is completely fair :), which means it treats all available CPUs equally and assigns process threads to any available CPU. However CFS is preemptive, and if other process threads are starving for a long time, CFS will preempt currently running process threads to make room for waiting threads.

E.g. In the above 4 core system, CFS scheduled Process A, B, C, and D on the available 4 cores. After some time the other processes {E, F, G, and H} start starving and CFS will preempt the currently running processes to schedule {E, F, G, and H}.

This is great for multitasking and achieving a high CPU utilization, however, it’s not that great for latency-sensitive workloads. A latency-sensitive workload gets kicked out in favor of a starving workload, and its performance is impacted. We need a way to run these low latency workloads on a dedicated CPU set which CFS doesn’t control.

CPU as a resource

What is a CPU?

In most Linux distributions CPU is viewed as a collection of resource controls.

CFS shares: This treats CPU in the notion of time. It is defined as what is my weighted fair share of CPU time on the system.

E.g. if we say 1 core = 1024 shares on a 4 core system. A container or a process requesting 512 shares will get 1/2 core on the system i.e if a CPU cycle is 500 microseconds, every CPU cycle it gets 250 microseconds of execution time.
CFS quota: This also treats CPU in the notion of time. It is defined as what is my hard cap of CPU time over a period. To understand CFS quota we need to understand two knobs.
- cpu.cfs_quota_us: the total available run-time within a period (in microseconds)
- cpu.cfs_period_us: the length of a period (in microseconds)

E.g If cpu.cfs_quota_us = 250 and cpu.cfs_period_us = 250, the process is getting 1 full CPU i.e. it will be the only process executing during that CPU cycle (period).

Another example: If cpu.cfs_quota_us = 10 and cpu.cfs_period_us = 50, the process is getting 20% of the CPU every execution cycle (period). Once the process hits the quota, the application will be throttled.

These are applied at the cgroup level.

CPU Affinity: Used by Kubernetes (k8s) which tells which logical CPU the process is allowed to execute.

How do Kubernetes do it today?

Kubernetes (k8s) uses CFS quota (explained above under “CPU as a resource”) as a resource control to manage CPU’s.

The k8s operator first set --cpu-manager-policy=static as a kubelet option. This will isolate a bunch of CPUs from the CFS view and can be allocated for dedicated usage. Exclusivity is enforced using cpuset cgroup controller

A user can then request CPU units under 3 groups of classes. A user has to specify requests and limits. Based on these values it’s class can be determined.

Guaranteed (requests == limits): You get exclusive access to a set of CPUs in this class. E.g. if requests=4 and limits=4, users will get guaranteed access to 4 CPU units. This is good for latency-sensitive applications that require dedicated CPU access.
Burstable (requests < limits) You get dedicated access up to requests and can burst up to limits if resources are available in the system. E.g. if requests=4 and limits=10, users will get guaranteed access to 4 CPU units and the application can burst up to 10 CPUs if resources are available. The extra 6 CPU units can be preempted by the system if a higher priority job needs it. This is good for jobs that can have lower requests value which will increase their probability of getting placed in the system quickly and can then burst later if resources are available.
Best effort (requests == 0) This is the bottom of the barrel, where the system makes no guarantees and will make the best effort to allocate whatever is possible to the application.

Here a CPU unit is:

1 AWS vCPU
1 GCP Core
1 Azure vCore
1 Hyperthread on a bare-metal Intel processor with Hyperthreading

Example guaranteed QOS job

apiVersion: v1
kind: Pod
metadata:
  name: exclusive-2
spec:
  containers:
  - image: quay.io/connordoyle/cpuset-visualizer
    name: exclusive-2
    resources:
      # Pod is in the Guaranteed QoS class because requests == limits
      requests:
        # CPU request is an integer
        cpu: 2
        memory: "256M"
      limits:
        cpu: 2
        memory: "256M"

How should nomad do it?

Key takeaway from k8s - Kubernetes primarily uses cgroups (cpu and cpuset subsystem or resource controllers) to isolate and control CPU’s.

Let’s take an example of an 8 core intel system with hyperthreading enabled. Here 8 physical cores = 16 virtual cores.

Cores = {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15}

Nomad client: When the nomad client daemon comes up, it should reserve some CPUs for exclusive access, and remove them from CFS view so that workloads assigned to those CPU’s should not get preempted.

CPUs for exclusive access = Number of Cores (0-15) - system reserved cores (cores needed for system work) - nomad reserved cores (cores needed for nomad client)

Let’s say both system and nomad need two cores each.

System reserved cores = 14,15
Nomad reserved cores = 12,13

CPUs for exclusive access = {0,1,2,3,4,5,6,7,8,9,10,11} [6 physical cores]

Nomad client should create a cgroup under cpuset subsystem or resource controller and assign {0-11} to cpuset.cpus and enable (1) cpuset.cpu_exclusive flag for exclusive access.

$ echo “0-11” > /sys/fs/cgroup/cpuset/nomad/cpuset.cpus
$ echo “1” > /sys/fs/cgroup/cpuset/nomad/cpuset.cpu_exclusive

At this point, nomad client has exclusive access to CPU units 0-11.
Now, when the user launches a nomad job with the following spec.


job "example" {
  datacenters = ["dc1"]

  group "cache" {
    task "redis" {
      driver = "docker"

      config {
        image = "redis:3.2"

        port_map {
          db = 6379
        }
      }

      resources {
        # cpu    = 500
        cpu-cores = 2
        memory = 256

        network {
          mbits = 10
          port  "db"  {}
        }
      }
    }
  }
}

For the above job, nomad client should create a cgroup example under nomad parent cgroup and assign two cores to it.

$ echo “0-1” > /sys/fs/cgroup/cpuset/nomad/example/cpuset.cpus
$ echo “1” > /sys/fs/cgroup/cpuset/nomad/example/cpuset.cpu_exclusive

When nomad client launches the job (example) it should attach the job PIDs to example cgroup. You can achieve this by adding the example job PIDs to /sys/fs/cgroup/cpuset/nomad/example/cgroup.procs file.

Design consideration

We can keep cpu (e.g. 500 Mhz) and cpu-cores (e.g. 2 cores) as mutually exclusive options i.e if a user is requesting cpu (to request for CPUs in MHZ in a shared setting) s/he cannot request cpu-cores (for exclusive access) at the same time.

Nomad should throw an error back to the user in case both are being set.

Error: cpu and cpu-cores are mutually exclusive options, and only one of them should be set.

This also maintains backward compatibility for all the jobs that have been using cpu.

References

The text was updated successfully, but these errors were encountered:

llchan · 2020-07-21T14:51:57Z

A couple misc questions/comments:

It wasn't explicitly stated that cpu-cores is integral, but perhaps fractional cpu cores could be useful, e.g. if two jobs want a dedicated half-core each
It may also be good to start thinking about NUMA-related specificiation, e.g. if the job needs multiple cores on the same socket.

chuckyz · 2020-07-21T18:32:59Z

@llchan

cgroups cpusets are presently whole "cpu" only, to my current understanding fractional cpu usage is accomplished in cgroups currently through CFS, which can be used with Nomad right now. There's a different discussion to be had about the UX around using mhz (e.g.: cpu = 500) and if you could specify cpu = 1. My current opinion is if you need that, write/use a wrapper (ala Levant) that can consume a template and do the math for a user (e.g.: cpu = [[multiply .cpu 3700]] for a 3.7ghz processor)
This spec does not cover NUMA, and without dragging this into a huge discussion about NUMA-awareness I agree it is very important. That said, something shipped and usable is better than something that's never finished. I think it's a perfect "v2" feature.

llchan · 2020-07-22T01:04:08Z

Yeah I would expect fractional cpu allocations to be shared-based, but iiuc the current cpu sharing is host-wide and not cpuset-bound. For fractional cpu-cores we could create an exclusive cgroup cpuset for the allocated core(s), and allow multiple children with the appropriate sharing weights to be placed inside that cgroup. This way they are bound to a core and have more predictable neighbors, but can still be "multi-tenant" on that core.
Very much agreed that we can wait to implement the NUMA stuff, was just mentioning it as something to keep in the back of our minds as we plan out config specs.

robloxrob · 2020-08-04T16:49:33Z

This would be helpful for running game workloads. It can be an issue of the process switches amongst cores or across NUMA boundaries.

james-masson · 2020-11-06T21:33:20Z

How do you see this interacting with the usual tunings in this space eg. isolcpus and systemd's CPUAffninity?

Generally if you set things like these, you're expecting to allocate your processes in certain CPU ranges - having Nomad choose CPUs outside these ranges is counterproductive.

I think there's three possible approaches

allowing the user to select particular CPUs through Nomad - rather than just "allocate me 2 cpus please"
allowing the user to select particular pre-created cgroups
Have Nomad understand isolcpus and CPUAffinity

shishir-a412ed · 2020-11-11T20:10:10Z

@james-masson isolcpus is just another way to achieve CPU exclusivity where you set a kernel parameter to exclude a set of CPU from the kernel task scheduler (CFS). Those CPUs can then be dedicated for running latency-sensitive workloads as they won't be subjected to preemption by CFS.

We are solving this exact same problem using the cgroups cpuset resource controller. I guess it's just a different approach to the same problem.

Regarding systemd's CPUAffinity IIUC, that is to control the CPU Affinity of a systemd process. That should be orthogonal to what we are doing here, since we are not using systemd to launch our workloads, rather the workloads are launched by nomad (the orchestration system) using a task driver e.g. docker.

I don't think the user should care about which CPU's "0-2" or "3-5" s/he gets assigned as long as the user's workload has dedicated access to three (3) CPUs. There are use-cases e.g. NUMA aware applications where you would want to pin the CPU's however that's is not the problem I am trying to solve in this proposal.

If you are interested in (1) I have an open PR #8291 for CPU pinning using docker driver.

I don't think user should ever care about the underlying cgroups. That is a really low level construct to be exposed to the end user of the orchestration system.
This proposal is for achieving CPU exclusivity using the cgroups cpuset subsystem. I think if you are interested in making nomad isolcpus or systemd CPUAffinity aware, a separate proposal would be better.

Having said that, this is proposed as an optional parameter, so if you (hypothetically) have someway to isolate your CPU's using isolcpus you can choose not to use this.

james-masson · 2020-11-11T21:21:52Z

@james-masson isolcpus is just another way to achieve CPU exclusivity where you set a kernel parameter to exclude a set of CPU from the kernel task scheduler (CFS). Those CPUs can then be dedicated for running latency-sensitive workloads as they won't be subjected to preemption by CFS.

I think it also has an effect on the kernel thread scheduling - not just user-space. You tend to use it when you want control above-and beyond userspace. Commonly used with manual IRQ pinning too. It's a go-to tuning for minimising jitter when you really don't want a context switch.

Regarding systemd's CPUAffinity IIUC, that is to control the CPU Affinity of a systemd process. That should be orthogonal to what we are doing here, since we are not using systemd to launch our workloads, rather the workloads are launched by nomad (the orchestration system) using a task driver e.g. docker.

Yes -systemd's CPUAffinity is all about pulling the rest of the OS - including Nomad itself - away from the cores you want to use for your high-performance/low-jitter workloads

The interaction between isolcpus and systemd should be to leave a large set of cores running nothing - not even kernel threads. Ready for your sensitive workloads.

My point is - my customers in this space generally already have systems with isolcpus and systemd CPUAffinity ( and optimal IRQ affinity, nohz_full and more) - large multi-socket systems tuned to the hilt for performance.

While I've used Nomad before for this sort of workload, it's always involved a custom layer to manage the CPU allocations.
I was hoping that this feature would make this custom layer unnecessary - at it's simplest it could be a Nomad agent config that says - use cores 6-11 and 18-23 for the CPU manager feature.

shishir-a412ed · 2020-11-16T22:09:52Z

@james-masson Apart from being NUMA aware? What is it that this proposal doesn't address for you?
You can still use the cgroups cpuset resource controller for exclusive access to CPU's and running your workloads on a dedicated CPU with no context switch.

What this proposal doesn't guarantee is which CPU you will get allocated, which is similar to Kubernetes CPU manager as it also doesn't offer this guarantee: https://kubernetes.io/blog/2018/07/24/feature-highlight-cpu-manager/#limitations

After reading your comments, it looks like your customers have high performance/low jitter workloads which need NUMA aware CPU's e.g. running your workload on a CPU which is near to the bus connecting to a high-performance NIC so that it can avoid cross socket traffic.

Not saying NUMA is not important, but we are intentionally keeping it out of this proposal to make the initial pass easier to implement and more inline with the k8s CPU manager.

Also, there are some internal discussions going on within Hashicorp (I am not fully aware, but maybe someone from Hashicorp can chime in) on how they want to roll out this feature. They might already have NUMA on their roadmap.

PS:

I was hoping that this feature would make this custom layer unnecessary - at its simplest, it could be a Nomad agent config that says - use cores 6-11 and 18-23 for the CPU manager feature.

This is already covered in this proposal: Under How should nomad do it? ---> Nomad client

tgross · 2021-05-07T18:01:47Z

Shipped in Nomad 1.1.0-beta

github-actions · 2022-10-20T02:44:03Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

shishir-a412ed mentioned this issue Jul 20, 2020

Add cpuset_cpus to docker driver. #8291

Merged

shoenig added theme/core stage/thinking labels Jul 23, 2020

shoenig added this to Needs Triage in Nomad - Community Issues Triage via automation Jul 23, 2020

tgross added stage/needs-discussion and removed stage/thinking labels Aug 24, 2020

tgross moved this from Needs Triage to Needs Roadmapping in Nomad - Community Issues Triage Feb 12, 2021

tgross removed this from Needs Roadmapping in Nomad - Community Issues Triage Mar 4, 2021

flyinprogrammer mentioned this issue Mar 25, 2021

Docker Driver Fails With Upper Limit of 262144 CPU Shares #7731

Open

tgross closed this as completed May 7, 2021

github-actions bot locked as resolved and limited conversation to collaborators Oct 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CPU Manager for Nomad #8473

CPU Manager for Nomad #8473

shishir-a412ed commented Jul 20, 2020 •

edited

Loading

llchan commented Jul 21, 2020

chuckyz commented Jul 21, 2020

llchan commented Jul 22, 2020

robloxrob commented Aug 4, 2020

james-masson commented Nov 6, 2020 •

edited

Loading

shishir-a412ed commented Nov 11, 2020

james-masson commented Nov 11, 2020

shishir-a412ed commented Nov 16, 2020

tgross commented May 7, 2021

github-actions bot commented Oct 20, 2022

CPU Manager for Nomad #8473

CPU Manager for Nomad #8473

Comments

shishir-a412ed commented Jul 20, 2020 • edited Loading

CPU Manager for Nomad

Overview

CPU as a resource

How do Kubernetes do it today?

How should nomad do it?

Design consideration

References

llchan commented Jul 21, 2020

chuckyz commented Jul 21, 2020

llchan commented Jul 22, 2020

robloxrob commented Aug 4, 2020

james-masson commented Nov 6, 2020 • edited Loading

shishir-a412ed commented Nov 11, 2020

james-masson commented Nov 11, 2020

shishir-a412ed commented Nov 16, 2020

tgross commented May 7, 2021

github-actions bot commented Oct 20, 2022

shishir-a412ed commented Jul 20, 2020 •

edited

Loading

james-masson commented Nov 6, 2020 •

edited

Loading