Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CPU Manager for Nomad #8473

Closed
shishir-a412ed opened this issue Jul 20, 2020 · 10 comments
Closed

CPU Manager for Nomad #8473

shishir-a412ed opened this issue Jul 20, 2020 · 10 comments

Comments

@shishir-a412ed
Copy link
Contributor

shishir-a412ed commented Jul 20, 2020

CPU Manager for Nomad

Overview

The completely fair scheduler or CFS (also referred to as the kernel task scheduler), as the name suggests is completely fair :), which means it treats all available CPUs equally and assigns process threads to any available CPU. However CFS is preemptive, and if other process threads are starving for a long time, CFS will preempt currently running process threads to make room for waiting threads.

CFS

E.g. In the above 4 core system, CFS scheduled Process A, B, C, and D on the available 4 cores. After some time the other processes {E, F, G, and H} start starving and CFS will preempt the currently running processes to schedule {E, F, G, and H}.

This is great for multitasking and achieving a high CPU utilization, however, it’s not that great for latency-sensitive workloads. A latency-sensitive workload gets kicked out in favor of a starving workload, and its performance is impacted. We need a way to run these low latency workloads on a dedicated CPU set which CFS doesn’t control.

CPU as a resource

What is a CPU?

In most Linux distributions CPU is viewed as a collection of resource controls.

  • CFS shares: This treats CPU in the notion of time. It is defined as what is my weighted fair share of CPU time on the system.

    E.g. if we say 1 core = 1024 shares on a 4 core system. A container or a process requesting 512 shares will get 1/2 core on the system i.e if a CPU cycle is 500 microseconds, every CPU cycle it gets 250 microseconds of execution time.

  • CFS quota: This also treats CPU in the notion of time. It is defined as what is my hard cap of CPU time over a period. To understand CFS quota we need to understand two knobs.

    • cpu.cfs_quota_us: the total available run-time within a period (in microseconds)
    • cpu.cfs_period_us: the length of a period (in microseconds)

E.g If cpu.cfs_quota_us = 250 and cpu.cfs_period_us = 250, the process is getting 1 full CPU i.e. it will be the only process executing during that CPU cycle (period).

Another example: If cpu.cfs_quota_us = 10 and cpu.cfs_period_us = 50, the process is getting 20% of the CPU every execution cycle (period). Once the process hits the quota, the application will be throttled.

These are applied at the cgroup level.

  • CPU Affinity: Used by Kubernetes (k8s) which tells which logical CPU the process is allowed to execute.

How do Kubernetes do it today?

Kubernetes (k8s) uses CFS quota (explained above under “CPU as a resource”) as a resource control to manage CPU’s.

The k8s operator first set --cpu-manager-policy=static as a kubelet option. This will isolate a bunch of CPUs from the CFS view and can be allocated for dedicated usage. Exclusivity is enforced using cpuset cgroup controller

A user can then request CPU units under 3 groups of classes. A user has to specify requests and limits. Based on these values it’s class can be determined.

  • Guaranteed (requests == limits): You get exclusive access to a set of CPUs in this class. E.g. if requests=4 and limits=4, users will get guaranteed access to 4 CPU units. This is good for latency-sensitive applications that require dedicated CPU access.

  • Burstable (requests < limits) You get dedicated access up to requests and can burst up to limits if resources are available in the system. E.g. if requests=4 and limits=10, users will get guaranteed access to 4 CPU units and the application can burst up to 10 CPUs if resources are available. The extra 6 CPU units can be preempted by the system if a higher priority job needs it. This is good for jobs that can have lower requests value which will increase their probability of getting placed in the system quickly and can then burst later if resources are available.

  • Best effort (requests == 0) This is the bottom of the barrel, where the system makes no guarantees and will make the best effort to allocate whatever is possible to the application.

Here a CPU unit is:

  • 1 AWS vCPU
  • 1 GCP Core
  • 1 Azure vCore
  • 1 Hyperthread on a bare-metal Intel processor with Hyperthreading

Example guaranteed QOS job

apiVersion: v1
kind: Pod
metadata:
  name: exclusive-2
spec:
  containers:
  - image: quay.io/connordoyle/cpuset-visualizer
    name: exclusive-2
    resources:
      # Pod is in the Guaranteed QoS class because requests == limits
      requests:
        # CPU request is an integer
        cpu: 2
        memory: "256M"
      limits:
        cpu: 2
        memory: "256M"

How should nomad do it?

Key takeaway from k8s - Kubernetes primarily uses cgroups (cpu and cpuset subsystem or resource controllers) to isolate and control CPU’s.

Let’s take an example of an 8 core intel system with hyperthreading enabled. Here 8 physical cores = 16 virtual cores.

Cores = {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15}

Nomad client: When the nomad client daemon comes up, it should reserve some CPUs for exclusive access, and remove them from CFS view so that workloads assigned to those CPU’s should not get preempted.

CPUs for exclusive access = Number of Cores (0-15) - system reserved cores (cores needed for system work) - nomad reserved cores (cores needed for nomad client)

Let’s say both system and nomad need two cores each.

System reserved cores = 14,15
Nomad reserved cores = 12,13

CPUs for exclusive access = {0,1,2,3,4,5,6,7,8,9,10,11} [6 physical cores]

Nomad client should create a cgroup under cpuset subsystem or resource controller and assign {0-11} to cpuset.cpus and enable (1) cpuset.cpu_exclusive flag for exclusive access.

$ echo “0-11” > /sys/fs/cgroup/cpuset/nomad/cpuset.cpus
$ echo “1” > /sys/fs/cgroup/cpuset/nomad/cpuset.cpu_exclusive

At this point, nomad client has exclusive access to CPU units 0-11.
Now, when the user launches a nomad job with the following spec.


job "example" {
  datacenters = ["dc1"]

  group "cache" {
    task "redis" {
      driver = "docker"

      config {
        image = "redis:3.2"

        port_map {
          db = 6379
        }
      }

      resources {
        # cpu    = 500
        cpu-cores = 2
        memory = 256

        network {
          mbits = 10
          port  "db"  {}
        }
      }
    }
  }
}

For the above job, nomad client should create a cgroup example under nomad parent cgroup and assign two cores to it.

$ echo “0-1” > /sys/fs/cgroup/cpuset/nomad/example/cpuset.cpus
$ echo “1” > /sys/fs/cgroup/cpuset/nomad/example/cpuset.cpu_exclusive

When nomad client launches the job (example) it should attach the job PIDs to example cgroup. You can achieve this by adding the example job PIDs to /sys/fs/cgroup/cpuset/nomad/example/cgroup.procs file.

Design consideration

We can keep cpu (e.g. 500 Mhz) and cpu-cores (e.g. 2 cores) as mutually exclusive options i.e if a user is requesting cpu (to request for CPUs in MHZ in a shared setting) s/he cannot request cpu-cores (for exclusive access) at the same time.

Nomad should throw an error back to the user in case both are being set.

Error: cpu and cpu-cores are mutually exclusive options, and only one of them should be set.

This also maintains backward compatibility for all the jobs that have been using cpu.

References

@llchan
Copy link

llchan commented Jul 21, 2020

A couple misc questions/comments:

  • It wasn't explicitly stated that cpu-cores is integral, but perhaps fractional cpu cores could be useful, e.g. if two jobs want a dedicated half-core each
  • It may also be good to start thinking about NUMA-related specificiation, e.g. if the job needs multiple cores on the same socket.

@chuckyz
Copy link
Contributor

chuckyz commented Jul 21, 2020

@llchan

  • cgroups cpusets are presently whole "cpu" only, to my current understanding fractional cpu usage is accomplished in cgroups currently through CFS, which can be used with Nomad right now. There's a different discussion to be had about the UX around using mhz (e.g.: cpu = 500) and if you could specify cpu = 1. My current opinion is if you need that, write/use a wrapper (ala Levant) that can consume a template and do the math for a user (e.g.: cpu = [[multiply .cpu 3700]] for a 3.7ghz processor)

  • This spec does not cover NUMA, and without dragging this into a huge discussion about NUMA-awareness I agree it is very important. That said, something shipped and usable is better than something that's never finished. I think it's a perfect "v2" feature.

@llchan
Copy link

llchan commented Jul 22, 2020

  • Yeah I would expect fractional cpu allocations to be shared-based, but iiuc the current cpu sharing is host-wide and not cpuset-bound. For fractional cpu-cores we could create an exclusive cgroup cpuset for the allocated core(s), and allow multiple children with the appropriate sharing weights to be placed inside that cgroup. This way they are bound to a core and have more predictable neighbors, but can still be "multi-tenant" on that core.
  • Very much agreed that we can wait to implement the NUMA stuff, was just mentioning it as something to keep in the back of our minds as we plan out config specs.

@robloxrob
Copy link

This would be helpful for running game workloads. It can be an issue of the process switches amongst cores or across NUMA boundaries.

@james-masson
Copy link

james-masson commented Nov 6, 2020

How do you see this interacting with the usual tunings in this space eg. isolcpus and systemd's CPUAffninity?

Generally if you set things like these, you're expecting to allocate your processes in certain CPU ranges - having Nomad choose CPUs outside these ranges is counterproductive.

I think there's three possible approaches

  1. allowing the user to select particular CPUs through Nomad - rather than just "allocate me 2 cpus please"
  2. allowing the user to select particular pre-created cgroups
  3. Have Nomad understand isolcpus and CPUAffinity

@shishir-a412ed
Copy link
Contributor Author

@james-masson isolcpus is just another way to achieve CPU exclusivity where you set a kernel parameter to exclude a set of CPU from the kernel task scheduler (CFS). Those CPUs can then be dedicated for running latency-sensitive workloads as they won't be subjected to preemption by CFS.

We are solving this exact same problem using the cgroups cpuset resource controller. I guess it's just a different approach to the same problem.

Regarding systemd's CPUAffinity IIUC, that is to control the CPU Affinity of a systemd process. That should be orthogonal to what we are doing here, since we are not using systemd to launch our workloads, rather the workloads are launched by nomad (the orchestration system) using a task driver e.g. docker.

  1. I don't think the user should care about which CPU's "0-2" or "3-5" s/he gets assigned as long as the user's workload has dedicated access to three (3) CPUs. There are use-cases e.g. NUMA aware applications where you would want to pin the CPU's however that's is not the problem I am trying to solve in this proposal.

If you are interested in (1) I have an open PR #8291 for CPU pinning using docker driver.

  1. I don't think user should ever care about the underlying cgroups. That is a really low level construct to be exposed to the end user of the orchestration system.

  2. This proposal is for achieving CPU exclusivity using the cgroups cpuset subsystem. I think if you are interested in making nomad isolcpus or systemd CPUAffinity aware, a separate proposal would be better.

Having said that, this is proposed as an optional parameter, so if you (hypothetically) have someway to isolate your CPU's using isolcpus you can choose not to use this.

@james-masson
Copy link

@james-masson isolcpus is just another way to achieve CPU exclusivity where you set a kernel parameter to exclude a set of CPU from the kernel task scheduler (CFS). Those CPUs can then be dedicated for running latency-sensitive workloads as they won't be subjected to preemption by CFS.

I think it also has an effect on the kernel thread scheduling - not just user-space. You tend to use it when you want control above-and beyond userspace. Commonly used with manual IRQ pinning too. It's a go-to tuning for minimising jitter when you really don't want a context switch.

Regarding systemd's CPUAffinity IIUC, that is to control the CPU Affinity of a systemd process. That should be orthogonal to what we are doing here, since we are not using systemd to launch our workloads, rather the workloads are launched by nomad (the orchestration system) using a task driver e.g. docker.

Yes -systemd's CPUAffinity is all about pulling the rest of the OS - including Nomad itself - away from the cores you want to use for your high-performance/low-jitter workloads

The interaction between isolcpus and systemd should be to leave a large set of cores running nothing - not even kernel threads. Ready for your sensitive workloads.

My point is - my customers in this space generally already have systems with isolcpus and systemd CPUAffinity ( and optimal IRQ affinity, nohz_full and more) - large multi-socket systems tuned to the hilt for performance.

While I've used Nomad before for this sort of workload, it's always involved a custom layer to manage the CPU allocations.
I was hoping that this feature would make this custom layer unnecessary - at it's simplest it could be a Nomad agent config that says - use cores 6-11 and 18-23 for the CPU manager feature.

@shishir-a412ed
Copy link
Contributor Author

@james-masson Apart from being NUMA aware? What is it that this proposal doesn't address for you?
You can still use the cgroups cpuset resource controller for exclusive access to CPU's and running your workloads on a dedicated CPU with no context switch.

What this proposal doesn't guarantee is which CPU you will get allocated, which is similar to Kubernetes CPU manager as it also doesn't offer this guarantee: https://kubernetes.io/blog/2018/07/24/feature-highlight-cpu-manager/#limitations

After reading your comments, it looks like your customers have high performance/low jitter workloads which need NUMA aware CPU's e.g. running your workload on a CPU which is near to the bus connecting to a high-performance NIC so that it can avoid cross socket traffic.

Not saying NUMA is not important, but we are intentionally keeping it out of this proposal to make the initial pass easier to implement and more inline with the k8s CPU manager.

Also, there are some internal discussions going on within Hashicorp (I am not fully aware, but maybe someone from Hashicorp can chime in) on how they want to roll out this feature. They might already have NUMA on their roadmap.

PS:

I was hoping that this feature would make this custom layer unnecessary - at its simplest, it could be a Nomad agent config that says - use cores 6-11 and 18-23 for the CPU manager feature.

This is already covered in this proposal: Under How should nomad do it? ---> Nomad client

@tgross
Copy link
Member

tgross commented May 7, 2021

Shipped in Nomad 1.1.0-beta

@tgross tgross closed this as completed May 7, 2021
@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 20, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

7 participants