-
Notifications
You must be signed in to change notification settings - Fork 96
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for connecting a CUDAWorker to a cluster object #428
Add support for connecting a CUDAWorker to a cluster object #428
Conversation
dask_cuda/local_cuda_cluster.py
Outdated
@@ -163,7 +163,7 @@ def __init__( | |||
if n_workers is None: | |||
n_workers = len(CUDA_VISIBLE_DEVICES) | |||
self.host_memory_limit = parse_memory_limit( | |||
memory_limit, threads_per_worker, n_workers | |||
memory_limit, threads_per_worker, n_workers or 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just to avoid a divide by zero error when creating LocalCUDACluster
with n_workers=0
in my test. Which is probably a rare situation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense to me, I still don't understand why would anyone startup a LocalCUDACluster
with 0 workers though, is there more to it besides your test case? If you're only creating the cluster for the scheduler, wouldn't it make more sense to test creating a scheduler and then a worker instead of LocalCUDACluster
with 0 workers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I don't see this happening except for my test case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry @jacobtomlinson , probably my fault this has stalled, but after taking another look at it, maybe we should change to n_workers=1
in the test, and then await client.wait_for_workers(2)
, rather than leaving room for users referring to an unsupported behavior. As a bonus, we could check and error when n_workers < 1
. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @pentschev that sounds reasonable.
Updated to reflect this.
I wonder if we should also detect some of the arguments from the cluster object and pass them on to the worker. Perhaps @pentschev has thoughts? |
I'm not sure about this either, and to be fair, I also don't see anybody instantiating |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jacobtomlinson , I posted a couple of comments, one of them will probably resolve the failing test.
Co-authored-by: Peter Andreas Entschev <peter@entschev.com>
Co-authored-by: Peter Andreas Entschev <peter@entschev.com>
Codecov Report
@@ Coverage Diff @@
## branch-0.18 #428 +/- ##
===============================================
+ Coverage 57.02% 61.77% +4.74%
===============================================
Files 19 20 +1
Lines 1473 1847 +374
===============================================
+ Hits 840 1141 +301
- Misses 633 706 +73
Continue to review full report at Codecov.
|
Thanks @jacobtomlinson , does this need to be in 0.17 or can we retarget it to 0.18? |
No rush here. Happy to retarget. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for confirming, I retargeted it and all looks good to me, thanks Jacob!
Thanks Jacob for working on this and Peter for reviewing! 😄 |
) While working on remote clusters like in Dask Cloudprovider or Dask Kubernetes it became apparent that you may end up wasting the local GPU. For instance if I launch a GPU cluster on AWS from a GPU session in Sagemaker the Sagemaker GPU is not part of the cluster. ```python from dask_cloudprovider import EC2Cluster from dask.distributed import Client cluster = EC2Cluster(**gpu_kwargs) client = Client(cluster) # Cluster with remote GPUs ``` This PR makes it possible to include the local GPU in the cluster in the same way you would connect a client. ```python from dask_cloudprovider import EC2Cluster from dask.distributed import Client from dask_cuda import CUDAWorker cluster = EC2Cluster(**gpu_kwargs) local_worker = CUDAWorker(cluster) client = Client(cluster) # Cluster with remote GPUs and local GPU ``` Authors: - Jacob Tomlinson <jtomlinson@nvidia.com> - Jacob Tomlinson <jacobtomlinson@users.noreply.github.com> Approvers: - Peter Andreas Entschev URL: rapidsai#428
While working on remote clusters like in Dask Cloudprovider or Dask Kubernetes it became apparent that you may end up wasting the local GPU.
For instance if I launch a GPU cluster on AWS from a GPU session in Sagemaker the Sagemaker GPU is not part of the cluster.
This PR makes it possible to include the local GPU in the cluster in the same way you would connect a client.