Add support for connecting a CUDAWorker to a cluster object #428

jacobtomlinson · 2020-10-27T15:52:40Z

While working on remote clusters like in Dask Cloudprovider or Dask Kubernetes it became apparent that you may end up wasting the local GPU.

For instance if I launch a GPU cluster on AWS from a GPU session in Sagemaker the Sagemaker GPU is not part of the cluster.

from dask_cloudprovider import EC2Cluster
from dask.distributed import Client

cluster = EC2Cluster(**gpu_kwargs)
client = Client(cluster)  # Cluster with remote GPUs

This PR makes it possible to include the local GPU in the cluster in the same way you would connect a client.

from dask_cloudprovider import EC2Cluster
from dask.distributed import Client
from dask_cuda import CUDAWorker

cluster = EC2Cluster(**gpu_kwargs)
local_worker = CUDAWorker(cluster)
client = Client(cluster)  # Cluster with remote GPUs and local GPU

jacobtomlinson · 2020-10-27T15:53:29Z

dask_cuda/local_cuda_cluster.py

@@ -163,7 +163,7 @@ def __init__(
        if n_workers is None:
            n_workers = len(CUDA_VISIBLE_DEVICES)
        self.host_memory_limit = parse_memory_limit(
-            memory_limit, threads_per_worker, n_workers
+            memory_limit, threads_per_worker, n_workers or 1


This is just to avoid a divide by zero error when creating LocalCUDACluster with n_workers=0 in my test. Which is probably a rare situation.

Makes sense to me, I still don't understand why would anyone startup a LocalCUDACluster with 0 workers though, is there more to it besides your test case? If you're only creating the cluster for the scheduler, wouldn't it make more sense to test creating a scheduler and then a worker instead of LocalCUDACluster with 0 workers?

Yeah I don't see this happening except for my test case.

Sorry @jacobtomlinson , probably my fault this has stalled, but after taking another look at it, maybe we should change to n_workers=1 in the test, and then await client.wait_for_workers(2), rather than leaving room for users referring to an unsupported behavior. As a bonus, we could check and error when n_workers < 1. What do you think?

Thanks @pentschev that sounds reasonable.

Updated to reflect this.

jacobtomlinson · 2020-10-27T16:44:47Z

I wonder if we should also detect some of the arguments from the cluster object and pass them on to the worker. Perhaps @pentschev has thoughts?

pentschev · 2020-10-27T20:22:54Z

I wonder if we should also detect some of the arguments from the cluster object and pass them on to the worker. Perhaps @pentschev has thoughts?

I'm not sure about this either, and to be fair, I also don't see anybody instantiating CUDAWorker directly, so I'm not sure if something may break, although the specialized use cases are often created given constructor arguments, such as CPU affinity and RMM, UCX configs and DeviceHostFile for device spilling.

pentschev

Thanks @jacobtomlinson , I posted a couple of comments, one of them will probably resolve the failing test.

dask_cuda/local_cuda_cluster.py

dask_cuda/tests/test_local_cuda_cluster.py

Co-authored-by: Peter Andreas Entschev <peter@entschev.com>

codecov-io · 2020-12-08T17:56:18Z

Codecov Report

Merging #428 (2ba8a94) into branch-0.18 (fee2eee) will increase coverage by 4.74%.
The diff coverage is 100.00%.

@@               Coverage Diff               @@
##           branch-0.18     #428      +/-   ##
===============================================
+ Coverage        57.02%   61.77%   +4.74%     
===============================================
  Files               19       20       +1     
  Lines             1473     1847     +374     
===============================================
+ Hits               840     1141     +301     
- Misses             633      706      +73

Impacted Files	Coverage Δ
dask_cuda/local_cuda_cluster.py	`80.48% <ø> (-2.47%)`	⬇️
dask_cuda/cuda_worker.py	`77.01% <100.00%> (+4.72%)`	⬆️
dask_cuda/benchmarks/utils.py	`0.00% <0.00%> (ø)`
dask_cuda/benchmarks/local_cudf_merge.py	`0.00% <0.00%> (ø)`
dask_cuda/benchmarks/local_cupy_transpose_sum.py	`0.00% <0.00%> (ø)`
dask_cuda/proxy_object.py	`87.11% <0.00%> (ø)`
dask_cuda/cli/dask_cuda_worker.py	`96.82% <0.00%> (+0.05%)`	⬆️
dask_cuda/device_host_file.py	`98.86% <0.00%> (+0.21%)`	⬆️
dask_cuda/utils.py	`90.65% <0.00%> (+1.69%)`	⬆️
... and 2 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update fee2eee...2ba8a94. Read the comment docs.

pentschev · 2020-12-08T17:57:06Z

Thanks @jacobtomlinson , does this need to be in 0.17 or can we retarget it to 0.18?

jacobtomlinson · 2020-12-08T19:48:07Z

No rush here. Happy to retarget.

pentschev

Thanks for confirming, I retargeted it and all looks good to me, thanks Jacob!

jakirkham · 2020-12-09T01:42:51Z

Thanks Jacob for working on this and Peter for reviewing! 😄

) While working on remote clusters like in Dask Cloudprovider or Dask Kubernetes it became apparent that you may end up wasting the local GPU. For instance if I launch a GPU cluster on AWS from a GPU session in Sagemaker the Sagemaker GPU is not part of the cluster. ```python from dask_cloudprovider import EC2Cluster from dask.distributed import Client cluster = EC2Cluster(**gpu_kwargs) client = Client(cluster) # Cluster with remote GPUs ``` This PR makes it possible to include the local GPU in the cluster in the same way you would connect a client. ```python from dask_cloudprovider import EC2Cluster from dask.distributed import Client from dask_cuda import CUDAWorker cluster = EC2Cluster(**gpu_kwargs) local_worker = CUDAWorker(cluster) client = Client(cluster) # Cluster with remote GPUs and local GPU ``` Authors: - Jacob Tomlinson <jtomlinson@nvidia.com> - Jacob Tomlinson <jacobtomlinson@users.noreply.github.com> Approvers: - Peter Andreas Entschev URL: rapidsai#428

jacobtomlinson added 2 commits October 27, 2020 15:46

Add support for connecting a CUDAWorker to a cluster object

9459a27

Remove unnecessary await

6cca94a

jacobtomlinson requested a review from pentschev October 27, 2020 15:52

jacobtomlinson requested a review from a team as a code owner October 27, 2020 15:52

jacobtomlinson commented Oct 27, 2020

View reviewed changes

Do not allow fewer than 1 worker

26da57d

pentschev requested changes Dec 1, 2020

View reviewed changes

dask_cuda/local_cuda_cluster.py Outdated Show resolved Hide resolved

dask_cuda/tests/test_local_cuda_cluster.py Show resolved Hide resolved

jacobtomlinson and others added 2 commits December 8, 2020 16:05

Update dask_cuda/local_cuda_cluster.py

f1d4ea3

Co-authored-by: Peter Andreas Entschev <peter@entschev.com>

Update dask_cuda/tests/test_local_cuda_cluster.py

2ba8a94

Co-authored-by: Peter Andreas Entschev <peter@entschev.com>

jacobtomlinson added 3 - Ready for Review Ready for review by team proposal A code change suggestion to be considered or discussed labels Dec 8, 2020

pentschev added non-breaking Non-breaking change improvement Improvement / enhancement to an existing function labels Dec 8, 2020

pentschev changed the base branch from branch-0.17 to branch-0.18 December 8, 2020 20:51

pentschev approved these changes Dec 8, 2020

View reviewed changes

pentschev added 6 - Okay to Auto-Merge and removed 3 - Ready for Review Ready for review by team labels Dec 8, 2020

rapids-bot bot merged commit 40f3b40 into rapidsai:branch-0.18 Dec 8, 2020

jacobtomlinson deleted the cluster-worker branch December 9, 2020 11:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for connecting a CUDAWorker to a cluster object #428

Add support for connecting a CUDAWorker to a cluster object #428

jacobtomlinson commented Oct 27, 2020

jacobtomlinson Oct 27, 2020 •

edited

Loading

pentschev Oct 27, 2020

jacobtomlinson Oct 28, 2020

pentschev Nov 30, 2020

jacobtomlinson Dec 1, 2020

jacobtomlinson commented Oct 27, 2020

pentschev commented Oct 27, 2020

pentschev left a comment

codecov-io commented Dec 8, 2020 •

edited

Loading

pentschev commented Dec 8, 2020

jacobtomlinson commented Dec 8, 2020

pentschev left a comment

jakirkham commented Dec 9, 2020

Add support for connecting a CUDAWorker to a cluster object #428

Add support for connecting a CUDAWorker to a cluster object #428

Conversation

jacobtomlinson commented Oct 27, 2020

jacobtomlinson Oct 27, 2020 • edited Loading

Choose a reason for hiding this comment

pentschev Oct 27, 2020

Choose a reason for hiding this comment

jacobtomlinson Oct 28, 2020

Choose a reason for hiding this comment

pentschev Nov 30, 2020

Choose a reason for hiding this comment

jacobtomlinson Dec 1, 2020

Choose a reason for hiding this comment

jacobtomlinson commented Oct 27, 2020

pentschev commented Oct 27, 2020

pentschev left a comment

Choose a reason for hiding this comment

codecov-io commented Dec 8, 2020 • edited Loading

Codecov Report

pentschev commented Dec 8, 2020

jacobtomlinson commented Dec 8, 2020

pentschev left a comment

Choose a reason for hiding this comment

jakirkham commented Dec 9, 2020

jacobtomlinson Oct 27, 2020 •

edited

Loading

codecov-io commented Dec 8, 2020 •

edited

Loading