Support for swarm mode in Docker 1.12 #141

alantrrs · 2016-07-18T19:48:18Z

I'm trying to use nvidia-docker with the swarm functionality introduced in the new Docker 1.12.

I tried to create a service docker service create .. with nvidia-docker (nvidia-docker service create ..) and didn't work. I haven't seen any way to pass devices to docker service create so I'm wondering if it's even supported on docker's side.

Any thoughts?

The text was updated successfully, but these errors were encountered:

flx42 · 2016-07-18T21:13:14Z

This question was asked on the docker GitHub a few hours ago:
moby/moby#24750

Currently, nvidia-docker doesn't support Docker Swarm and thus service create will be simply pass-through to the docker CLI.
You are right that there doesn't seem to be a way to pass devices to service create, so far we can only mount the volume:

docker service create --mount type=volume,source=nvidia_driver_367.35,target=/usr/local/nvidia,volume-driver=nvidia-docker [...]

But that's not enough, we can't get around the device cgroup.

Even if we could, in a cluster environment with Swarm there will also be a problem if different machines have a different number of GPUs.

@3XX0 thoughts?

Josca · 2016-12-06T15:46:28Z

+1 to add nvidia-docker support for docker swarm.

davad · 2017-01-29T04:07:32Z

@flx42 @3XX0 any movement on this? I looked over the related issues and don't see anything recent. I'm itching to orchestrate CUDA jobs via docker across multiple machines 😄

el3ment · 2017-03-31T18:43:29Z

Any progress on this?

anaderi · 2017-03-31T21:49:08Z

would be cool, no?

3XX0 · 2017-04-04T20:32:17Z

This is basically what we need for basic GPU support:

moby/swarmkit#2090

mjp0 · 2017-05-27T03:13:55Z

@3XX0 It got merged today! I've been managing nvidia-docker containers manually via docker-compose so having swarmkit and all the v3 deploy things work would be absolutely fantastic.

Is there any potential merge conflicts that has to be dealt with before this can be merged to nvidia-docker?

cheyang · 2017-05-30T03:20:54Z

@0fork , can you share any docs about how you played with it? So we can also try this cool feature. Thanks.

3XX0 · 2017-05-30T19:46:10Z

Yes, this is big step forward to get GPU working within Swarm. However, we're not quite there yet, we still need to add support in Docker itself and we are still missing some pieces which should come with nvidia-docker 2.0.

Stay tuned ;)

luiscborbon · 2017-05-31T01:36:11Z

+1

erbas · 2017-06-09T18:46:12Z

@3XX0 What's the timeframe for this to come together?

thommiano · 2017-06-14T23:06:14Z

My team is working on a machine with several GPUs, and we're using Docker to containerize all of our projects. I'm trying to figure out the best way to schedule GPU jobs that are running in Docker containers so that users don't accidentally interfere with existing jobs or have to sit around until one of the other team members frees up a GPU. Would swarm functionality solve this problem? Our current approach is to use NV_GPU=n in our nvidia-docker run command to isolate a GPU to that container, as referenced here, and I'm hoping that we can do away with this with job scheduling.

omerh · 2017-06-19T18:12:41Z

This is great feature and a must for docker swarm.
I am going to solve it with pre backed AMI and autoscale group.
But, only cause it fits my use case.
Waiting for updates on both moby project and nvidia-docker

fvillarr · 2017-06-25T09:49:08Z

@0fork , can I get also any information about how you played with it?
Thanks.

mjp0 · 2017-06-25T10:03:18Z

@fvillarr & @cheyang I'm sorry I don't understand what you want to know :) We've been using nvidia-docker via nvidia-docker-compose, not this swarmkit feature we're all anxiously waiting. Word of caution: using nvidia-docker in scale is a PITA right now. You have to manage each server separately because nvidia-docker-compose needs to generate specific mount points for NVIDIA drivers to work via compose. There's nothing available to automate this and I don't think we can scale this much further with the current setup of scripts and manual effort. I don't have any docs because it's all in nvidia-docker-compose repo, we just took it to scale.

88plug · 2017-08-26T05:43:14Z

+1

3XX0 · 2017-11-14T06:23:39Z

Closing, most of the issues remaining are on the Docker side. You can track our progress here:
moby/moby#33439

nikoargo · 2018-01-09T19:43:43Z

Any update on this now that all the PRs in moby/moby#33439 have been merged? They allow placing services according to generic resources, but I'm not sure how to actually mount the GPU inside the service's container.

3XX0 · 2018-01-10T00:23:46Z

@nikoargo with 17.12.0-ce you can configure the docker daemon to expose your GPUs to swarm:

Create an override for the dockerd configuration, changing your default runtime and adding GPU resources. You can generate the resource flags like this:
nvidia-smi -a | grep UUID | awk '{print "--node-generic-resource gpu="substr($4,0,12)}' | paste -d' ' -s

sudo systemctl edit docker

[Service]
ExecStart=
ExecStart=/usr/bin/dockerd -H fd:// --default-runtime=nvidia <resource output from the above>

Uncomment swarm-resource under /etc/nvidia-container-runtime/config.toml
Restart the docker daemon, create your swarm and create a new service requesting GPUs:
docker service create -t --generic-resource "gpu=1" ubuntu bash

Note, there is currently a bug: moby/moby#35970 it should normally be --node-generic-resources, this will be fixed in the future

nikoargo · 2018-01-10T01:58:27Z

This is incredible. Thank you so much!

romilbhardwaj · 2018-02-19T10:26:10Z

@3XX0 This is fantastic, thanks a lot!

An observation: this also seems to enforce exclusive allocation of the GPUs at the orchestration layer. For example, if I have a machine with two physical GPUs, I cannot create more than two services (each of which requests one GPU). Adding a third service results in a no suitable node (insufficient resources) message and docker swarm waits for a running service to end before scheduling the new one.

Is there any way to overcome this and allow sharing of GPUs across services while maintaining isolation? For instance, adding a third service in the above example should create a service and have it share a GPU with one of the existing services.

This can be achieved by using node labels (have a count label for GPUs at the node and any service that requires less GPUs than the count is deployed on the node), but this approach is incognizant of the resource requirements of the service and does not enforce isolation - all GPUs on the machine will be visible to a service which may require only one GPU.

3XX0 · 2018-02-20T09:00:59Z

Unfortunately we do not support sharing GPUs. We have the same limitation with Kubernetes and we're looking into relaxing this constraint.
Having said that, the hardware doesn't support true multitenancy, so doing this can be quite costly. We usually recommend writing your application with this in mind instead, and implement your own scheduling/batching taking full advantage of the whole GPU.

CharlesJQuarra · 2018-04-21T11:16:29Z

one must change the default runtime for a given node in order to use the gpu for swarm services? can the gpu generic resource be added to a swarm node while leaving runc as default runtime?

hholst80 · 2019-03-04T10:51:00Z

Runtime have to be specified as shown above on the dockerd level, as services does not support the runtime directive (yet).

flx42 added upstream issue unsupported labels Jul 18, 2016

flx42 mentioned this issue Jul 18, 2016

NVIDIA GPU support moby/moby#23917

Closed

anaderi mentioned this issue Mar 31, 2017

support for running containers on nodes with GPU everware/everware#194

Open

3 tasks

flx42 mentioned this issue Aug 9, 2017

Does Nvidia-docker support "nvidia-docker service create"? #447

Closed

3XX0 closed this as completed Nov 14, 2017

mwilliammyers mentioned this issue Jan 28, 2018

Nvidia support? docker/docker-py#1877

Closed

galp mentioned this issue Mar 22, 2018

Add node-generic-resource support in docker-compose docker/compose#5814

Closed

RenaudWasTaken removed the unsupported label May 7, 2019

david-gwa mentioned this issue Nov 22, 2019

run vulkan app in docker without display #1132

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for swarm mode in Docker 1.12 #141

Support for swarm mode in Docker 1.12 #141

alantrrs commented Jul 18, 2016 •

edited

Loading

flx42 commented Jul 18, 2016

Josca commented Dec 6, 2016

davad commented Jan 29, 2017

el3ment commented Mar 31, 2017

anaderi commented Mar 31, 2017

3XX0 commented Apr 4, 2017 •

edited

Loading

mjp0 commented May 27, 2017

cheyang commented May 30, 2017 •

edited

Loading

3XX0 commented May 30, 2017

luiscborbon commented May 31, 2017

erbas commented Jun 9, 2017

thommiano commented Jun 14, 2017 •

edited

Loading

omerh commented Jun 19, 2017

fvillarr commented Jun 25, 2017

mjp0 commented Jun 25, 2017

88plug commented Aug 26, 2017

3XX0 commented Nov 14, 2017

nikoargo commented Jan 9, 2018

3XX0 commented Jan 10, 2018 •

edited

Loading

nikoargo commented Jan 10, 2018

romilbhardwaj commented Feb 19, 2018

3XX0 commented Feb 20, 2018

CharlesJQuarra commented Apr 21, 2018

hholst80 commented Mar 4, 2019 •

edited

Loading

Support for swarm mode in Docker 1.12 #141

Support for swarm mode in Docker 1.12 #141

Comments

alantrrs commented Jul 18, 2016 • edited Loading

flx42 commented Jul 18, 2016

Josca commented Dec 6, 2016

davad commented Jan 29, 2017

el3ment commented Mar 31, 2017

anaderi commented Mar 31, 2017

3XX0 commented Apr 4, 2017 • edited Loading

mjp0 commented May 27, 2017

cheyang commented May 30, 2017 • edited Loading

3XX0 commented May 30, 2017

luiscborbon commented May 31, 2017

erbas commented Jun 9, 2017

thommiano commented Jun 14, 2017 • edited Loading

omerh commented Jun 19, 2017

fvillarr commented Jun 25, 2017

mjp0 commented Jun 25, 2017

88plug commented Aug 26, 2017

3XX0 commented Nov 14, 2017

nikoargo commented Jan 9, 2018

3XX0 commented Jan 10, 2018 • edited Loading

nikoargo commented Jan 10, 2018

romilbhardwaj commented Feb 19, 2018

3XX0 commented Feb 20, 2018

CharlesJQuarra commented Apr 21, 2018

hholst80 commented Mar 4, 2019 • edited Loading

alantrrs commented Jul 18, 2016 •

edited

Loading

3XX0 commented Apr 4, 2017 •

edited

Loading

cheyang commented May 30, 2017 •

edited

Loading

thommiano commented Jun 14, 2017 •

edited

Loading

3XX0 commented Jan 10, 2018 •

edited

Loading

hholst80 commented Mar 4, 2019 •

edited

Loading