Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for connecting a CUDAWorker to a cluster object #428

Merged
merged 5 commits into from
Dec 8, 2020

Conversation

jacobtomlinson
Copy link
Member

While working on remote clusters like in Dask Cloudprovider or Dask Kubernetes it became apparent that you may end up wasting the local GPU.

For instance if I launch a GPU cluster on AWS from a GPU session in Sagemaker the Sagemaker GPU is not part of the cluster.

from dask_cloudprovider import EC2Cluster
from dask.distributed import Client

cluster = EC2Cluster(**gpu_kwargs)
client = Client(cluster)  # Cluster with remote GPUs

This PR makes it possible to include the local GPU in the cluster in the same way you would connect a client.

from dask_cloudprovider import EC2Cluster
from dask.distributed import Client
from dask_cuda import CUDAWorker

cluster = EC2Cluster(**gpu_kwargs)
local_worker = CUDAWorker(cluster)
client = Client(cluster)  # Cluster with remote GPUs and local GPU

@jacobtomlinson jacobtomlinson requested a review from a team as a code owner October 27, 2020 15:52
@@ -163,7 +163,7 @@ def __init__(
if n_workers is None:
n_workers = len(CUDA_VISIBLE_DEVICES)
self.host_memory_limit = parse_memory_limit(
memory_limit, threads_per_worker, n_workers
memory_limit, threads_per_worker, n_workers or 1
Copy link
Member Author

@jacobtomlinson jacobtomlinson Oct 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just to avoid a divide by zero error when creating LocalCUDACluster with n_workers=0 in my test. Which is probably a rare situation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to me, I still don't understand why would anyone startup a LocalCUDACluster with 0 workers though, is there more to it besides your test case? If you're only creating the cluster for the scheduler, wouldn't it make more sense to test creating a scheduler and then a worker instead of LocalCUDACluster with 0 workers?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I don't see this happening except for my test case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry @jacobtomlinson , probably my fault this has stalled, but after taking another look at it, maybe we should change to n_workers=1 in the test, and then await client.wait_for_workers(2), rather than leaving room for users referring to an unsupported behavior. As a bonus, we could check and error when n_workers < 1. What do you think?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @pentschev that sounds reasonable.

Updated to reflect this.

@jacobtomlinson
Copy link
Member Author

I wonder if we should also detect some of the arguments from the cluster object and pass them on to the worker. Perhaps @pentschev has thoughts?

@pentschev
Copy link
Member

I wonder if we should also detect some of the arguments from the cluster object and pass them on to the worker. Perhaps @pentschev has thoughts?

I'm not sure about this either, and to be fair, I also don't see anybody instantiating CUDAWorker directly, so I'm not sure if something may break, although the specialized use cases are often created given constructor arguments, such as CPU affinity and RMM, UCX configs and DeviceHostFile for device spilling.

Copy link
Member

@pentschev pentschev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jacobtomlinson , I posted a couple of comments, one of them will probably resolve the failing test.

jacobtomlinson and others added 2 commits December 8, 2020 16:05
Co-authored-by: Peter Andreas Entschev <peter@entschev.com>
Co-authored-by: Peter Andreas Entschev <peter@entschev.com>
@jacobtomlinson jacobtomlinson added 3 - Ready for Review Ready for review by team proposal A code change suggestion to be considered or discussed labels Dec 8, 2020
@pentschev pentschev added non-breaking Non-breaking change improvement Improvement / enhancement to an existing function labels Dec 8, 2020
@codecov-io
Copy link

codecov-io commented Dec 8, 2020

Codecov Report

Merging #428 (2ba8a94) into branch-0.18 (fee2eee) will increase coverage by 4.74%.
The diff coverage is 100.00%.

Impacted file tree graph

@@               Coverage Diff               @@
##           branch-0.18     #428      +/-   ##
===============================================
+ Coverage        57.02%   61.77%   +4.74%     
===============================================
  Files               19       20       +1     
  Lines             1473     1847     +374     
===============================================
+ Hits               840     1141     +301     
- Misses             633      706      +73     
Impacted Files Coverage Δ
dask_cuda/local_cuda_cluster.py 80.48% <ø> (-2.47%) ⬇️
dask_cuda/cuda_worker.py 77.01% <100.00%> (+4.72%) ⬆️
dask_cuda/benchmarks/utils.py 0.00% <0.00%> (ø)
dask_cuda/benchmarks/local_cudf_merge.py 0.00% <0.00%> (ø)
dask_cuda/benchmarks/local_cupy_transpose_sum.py 0.00% <0.00%> (ø)
dask_cuda/proxy_object.py 87.11% <0.00%> (ø)
dask_cuda/cli/dask_cuda_worker.py 96.82% <0.00%> (+0.05%) ⬆️
dask_cuda/device_host_file.py 98.86% <0.00%> (+0.21%) ⬆️
dask_cuda/utils.py 90.65% <0.00%> (+1.69%) ⬆️
... and 2 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update fee2eee...2ba8a94. Read the comment docs.

@pentschev
Copy link
Member

Thanks @jacobtomlinson , does this need to be in 0.17 or can we retarget it to 0.18?

@jacobtomlinson
Copy link
Member Author

No rush here. Happy to retarget.

@pentschev pentschev changed the base branch from branch-0.17 to branch-0.18 December 8, 2020 20:51
Copy link
Member

@pentschev pentschev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for confirming, I retargeted it and all looks good to me, thanks Jacob!

@pentschev pentschev added 6 - Okay to Auto-Merge and removed 3 - Ready for Review Ready for review by team labels Dec 8, 2020
@rapids-bot rapids-bot bot merged commit 40f3b40 into rapidsai:branch-0.18 Dec 8, 2020
@jakirkham
Copy link
Member

Thanks Jacob for working on this and Peter for reviewing! 😄

@jacobtomlinson jacobtomlinson deleted the cluster-worker branch December 9, 2020 11:40
madsbk pushed a commit to madsbk/dask-cuda that referenced this pull request Dec 10, 2020
)

While working on remote clusters like in Dask Cloudprovider or Dask Kubernetes it became apparent that you may end up wasting the local GPU.

For instance if I launch a GPU cluster on AWS from a GPU session in Sagemaker the Sagemaker GPU is not part of the cluster.

```python
from dask_cloudprovider import EC2Cluster
from dask.distributed import Client

cluster = EC2Cluster(**gpu_kwargs)
client = Client(cluster)  # Cluster with remote GPUs
```

This PR makes it possible to include the local GPU in the cluster in the same way you would connect a client.

```python
from dask_cloudprovider import EC2Cluster
from dask.distributed import Client
from dask_cuda import CUDAWorker

cluster = EC2Cluster(**gpu_kwargs)
local_worker = CUDAWorker(cluster)
client = Client(cluster)  # Cluster with remote GPUs and local GPU
```

Authors:
  - Jacob Tomlinson <jtomlinson@nvidia.com>
  - Jacob Tomlinson <jacobtomlinson@users.noreply.github.com>

Approvers:
  - Peter Andreas Entschev

URL: rapidsai#428
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improvement Improvement / enhancement to an existing function non-breaking Non-breaking change proposal A code change suggestion to be considered or discussed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants