Add cpuset_cpus to docker driver. #8291

shishir-a412ed · 2020-06-25T20:07:32Z

We have an internal HPC customer who could also benefit from the ability to pin CPUs to a docker container.

It would be ideal to have:

--cpuset-cpus which will allow pinning CPUs to a docker container.
If the user tries to pin the same CPU (e.g. CPU 0) to another docker container, it should error out (if on the same node) OR
schedule on CPU 0 on a different node.

Currently (2) is not supported by docker. This PR only addresses (1).
As a follow-up, we can try nomad docker driver to do some bookkeeping, and achieve (2).

shishir-a412ed · 2020-06-25T20:09:37Z

Website changes

shishir-a412ed · 2020-06-25T20:48:24Z

ping @tgross @notnoop

tgross

👍 I've left a couple requests for docs and logging tweaks but other than that this LGTM. Thanks so much for this @shishir-a412ed!

I've tested this out and can confirm the cpu set configuration is getting set for Docker containers:

$ docker inspect 753 | jq '.[0].HostConfig.CpusetC
pus'
"0"

website/pages/docs/drivers/docker.mdx

drivers/docker/driver.go

notnoop · 2020-06-26T14:41:11Z

@shishir-a412ed Thank you so much for the contributions. I wonder how this useful without scheduler support for exclusivity by the client or scheduler.

One concern is interference. Let's say a node is running two allocations, allocation A with cpuset set to 0-1, while B is unrestricted. Will this mean that alloc B can interfere with A; B can use all cpus but A only uses two; A here will be artificially constrained and that seem not so ideal. Is the benefit of NUMA locality in this case override the interference of other jobs in your experience?

The second concern concern is usability without exclusivity support by scheduler. Assume you have two jobs that you want to configure with cpusets, how would you ensure that they don't end up using the same cpusets on a host. I assume operators will need to statically partition jobs on nodes to avoid conflicts. Is that something you are considering in your HPC setup?

I'd be in support of adding cpuset support - it's very useful indeed, specially with NUMA-aware app. We plan to add support for specifying number of cpus instead of MHz "soon", when we do, it'll be easier to add cpusets with exclusivity.

One potential possible alternative is to have the docker driver manage cpusets on the client. i.e. operator specifies cpus=2 # number of cores and the docker driver will allocate cpusets exclusively for the task (e.g. 1-2) and pass them to docker. This will address the second concern, but not the first. Would that work better for your usecase?

shishir-a412ed · 2020-06-26T17:45:29Z

@notnoop Yeah, interference is a legit concern. That's why I presented it as a two-step solution.

We expose docker --cpuset-cpus option to nomad.

After merging (1) the feature is not completely useful as you mentioned if allocation A is running with cpuset 0-1 while allocation B is unrestricted, there will be interference.

We have to build an added logic of top of (1) to achieve exclusivity. I can think of two ways to achieve that.

a) Hacky solution (not so ideal): Have a local state (e.g. an in-memory map in docker driver on each nomad client)
which tracks what CPU's are already pinned.

         map[string]bool{
                 "0": false,
                 "1": false,
                 "2": true,
                 "3": true,
         }

Here {0,1} are not available since they are already pinned.

So e.g. if allocation B lands on the same node, the docker driver will say CPU's 0-1 is already allocated and will error out the allocation. My understanding is when that happens scheduler will detect the failed allocation and try to allocate it on another node. This will happen till, either allocation will get placed on another fit node, or reaches its maximum number of failed attempts (at that point the job should fail).

b) Ideal solution: Have a global state at the scheduler level e.g. map[string][]string (key=nodeID, value={0,1,2,4 CPU's already pinned})

With this global state, the scheduler will launch the job only on a node which has available CPU's to be pinned.

chuckyz · 2020-06-26T18:31:53Z

To get a clarification of what I want, I've discussed this with @shishir-a412ed internally a bit. From my perspective, there are a few layers of value here.

Expose cpuset-cpus. This allows total DIY use-cases, e.g.: anyone building tooling on top of Nomad, who has tooling that may be NUMA-aware itself, this allows the use-case of; given a 4 cpu server with 16 cores per CPU, 0-15 for CPU1, 16-32 for CPU2, 33-48 for CPU3, and 49-64 for CPU4.

That would allow 4 large containers per node, which could then be enforced through node pinning. This could also perhaps provide exclusive access per "datacenter" (like having "dc-hpc1," "dc-hpc2," "dc-hpcn"). We run ~20+ "datacenters" today without any trouble, so I think this should fit quite a few specific large use-cases.

This does not prevent this case: #2303 (comment)

Expose the concept of "cpus," I'm imagining this is implemented as a 5th resource. So, given mhz, memory, disk, bandwidth, and cpus. The user-experience I'm expecting here is that cpus is synonymous with number of threads (don't care about ht/non-ht cores, enforce that to be controlled by the operator), with each core taking the appropriate slice out of mhz.

From a user perspective this would be something like "I want 2 CPUs," and that could be 0,1 or 2,3 or 15,35. This would not be NUMA-aware from my perspective, at least not starting as NUMA-aware.

This covers this case: #2303 (comment)

This covers the first bullet point here: #2303 (comment)
and not the second one.

Reading back through #2303 though, I think the second point (even ignoring NUMA), gives enough of a community-level benefit that it'd be great.

Also, this PR specifically allows users to write some tooling on top of Nomad to cover the very specific use-cases of NUMA-neighbors + NUMA-exclusivity.

shishir-a412ed · 2020-07-08T16:50:38Z

@notnoop @tgross Any updates on this?

shishir-a412ed · 2020-07-20T21:06:58Z

@notnoop @tgross So we discussed this a bit more internally, and we felt while this patch is useful for someone who is trying to use CPU pinning in docker by setting --cpuset-cpus flag (with the caveat that they don't get pinned to those CPU's with exclusive access, which can either be documented or enhanced by doing some bookkeeping and making it exclusive), CPU isolation is a bigger problem and needs to be solved at the orchestration layer.

The ideas are very similar to Kubernetes CPU Manager

So I have opened another issue #8473 which has an initial spec on how I would approach this problem.
I would suggest to take CPU manager related discussions there and keep this PR just in the scope of docker driver --cpuset-cpus flag for pinning CPUs.

lmk what do you guys think?

shishir-a412ed · 2020-08-04T18:32:28Z

@notnoop @tgross Any updates on this PR?

notnoop · 2020-08-20T11:17:18Z

Hi! Just wanted to reach out and apologize for the slow response. This is on my plate and I intend to follow up shortly.

shishir-a412ed · 2020-08-20T16:31:45Z

@notnoop Thank you for the update!

Asara · 2020-09-08T15:17:37Z

Very eager for this PR. Thanks for the awesome work @shishir-a412ed

shishir-a412ed · 2020-10-29T18:49:30Z

@notnoop Any updates on this one?

notnoop

Sorry for taking so long. Thank you so much for your patience here. This looks good to me, considering the small objective.

I'm inclined to add a beta marker to config, just in case we need to modify the semantics when we introduce global cpu tracking and need to do a backward incompatible change.

Thanks you again!

the requested changes got addressed

github-actions · 2022-12-11T02:19:23Z

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

tgross previously requested changes Jun 26, 2020

View reviewed changes

website/pages/docs/drivers/docker.mdx Outdated Show resolved Hide resolved

drivers/docker/driver.go Outdated Show resolved Hide resolved

shishir-a412ed force-pushed the cpusets branch from c5acb50 to 5140900 Compare June 26, 2020 18:15

shishir-a412ed force-pushed the cpusets branch 2 times, most recently from c005eee to 46211fd Compare July 15, 2020 16:52

shishir-a412ed force-pushed the cpusets branch from 46211fd to 90b4dde Compare July 20, 2020 18:00

shishir-a412ed force-pushed the cpusets branch 2 times, most recently from b6d8ed3 to a2d0edc Compare August 4, 2020 18:31

shishir-a412ed force-pushed the cpusets branch 2 times, most recently from eabe6be to 009d0b5 Compare September 17, 2020 17:38

shishir-a412ed force-pushed the cpusets branch 2 times, most recently from a44a776 to 0df06e2 Compare September 28, 2020 18:33

shishir-a412ed force-pushed the cpusets branch 2 times, most recently from 7a48c71 to a00ab2f Compare October 2, 2020 19:30

shishir-a412ed force-pushed the cpusets branch 2 times, most recently from f7f7746 to c485267 Compare October 20, 2020 17:10

vercel bot deployed to Preview October 20, 2020 17:10 View deployment

shishir-a412ed force-pushed the cpusets branch from c485267 to 7a66bde Compare October 28, 2020 19:07

vercel bot had a problem deploying to Preview October 28, 2020 19:07 Failure

shishir-a412ed force-pushed the cpusets branch from 7a66bde to ec489d2 Compare November 4, 2020 17:47

vercel bot deployed to Preview November 4, 2020 17:47 View deployment

shishir-a412ed mentioned this pull request Nov 11, 2020

CPU Manager for Nomad #8473

Closed

shishir-a412ed added 3 commits November 11, 2020 12:30

Add cpuset_cpus to docker driver.

161bec4

Fix circleci.

8506946

Fix review comments.

8f27c08

shishir-a412ed force-pushed the cpusets branch from ec489d2 to 8f27c08 Compare November 11, 2020 20:30

vercel bot had a problem deploying to Preview November 11, 2020 20:30 Failure

notnoop approved these changes Nov 11, 2020

View reviewed changes

notnoop merged commit de5a21f into hashicorp:master Nov 11, 2020

notnoop mentioned this pull request Nov 12, 2020

Nomad should pin tasks to CPUs underneath the hood #2303

Closed

github-actions bot locked as resolved and limited conversation to collaborators Dec 11, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add cpuset_cpus to docker driver. #8291

Add cpuset_cpus to docker driver. #8291

shishir-a412ed commented Jun 25, 2020

shishir-a412ed commented Jun 25, 2020

shishir-a412ed commented Jun 25, 2020

tgross left a comment

notnoop commented Jun 26, 2020 •

edited

Loading

shishir-a412ed commented Jun 26, 2020 •

edited

Loading

chuckyz commented Jun 26, 2020

shishir-a412ed commented Jul 8, 2020

shishir-a412ed commented Jul 20, 2020 •

edited

Loading

shishir-a412ed commented Aug 4, 2020

notnoop commented Aug 20, 2020

shishir-a412ed commented Aug 20, 2020

Asara commented Sep 8, 2020

shishir-a412ed commented Oct 29, 2020

notnoop left a comment

github-actions bot commented Dec 11, 2022

Add cpuset_cpus to docker driver. #8291

Add cpuset_cpus to docker driver. #8291

Conversation

shishir-a412ed commented Jun 25, 2020

shishir-a412ed commented Jun 25, 2020

shishir-a412ed commented Jun 25, 2020

tgross left a comment

Choose a reason for hiding this comment

notnoop commented Jun 26, 2020 • edited Loading

shishir-a412ed commented Jun 26, 2020 • edited Loading

chuckyz commented Jun 26, 2020

shishir-a412ed commented Jul 8, 2020

shishir-a412ed commented Jul 20, 2020 • edited Loading

shishir-a412ed commented Aug 4, 2020

notnoop commented Aug 20, 2020

shishir-a412ed commented Aug 20, 2020

Asara commented Sep 8, 2020

shishir-a412ed commented Oct 29, 2020

notnoop left a comment

Choose a reason for hiding this comment

github-actions bot commented Dec 11, 2022

notnoop commented Jun 26, 2020 •

edited

Loading

shishir-a412ed commented Jun 26, 2020 •

edited

Loading

shishir-a412ed commented Jul 20, 2020 •

edited

Loading