-
Notifications
You must be signed in to change notification settings - Fork 96
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Simplify UCX configs, permitting UCX_TLS=all #792
Simplify UCX configs, permitting UCX_TLS=all #792
Conversation
Codecov Report
@@ Coverage Diff @@
## branch-22.02 #792 +/- ##
===============================================
Coverage ? 89.30%
===============================================
Files ? 16
Lines ? 2057
Branches ? 0
===============================================
Hits ? 1837
Misses ? 220
Partials ? 0 Continue to review full report at Codecov.
|
rerun tests |
Since this is a somewhat large change, I'm targeting 22.02 to avoid any major breakage at this time. If anyone feels differently and think this is a great feature to have in 21.12 and we shouldn't wait, please let me know. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work @pentschev
This is necessary to fill the missing create_cuda_context configuration option recently added to Distributed. In rapidsai#792 it will be used to simplify UCX configuration.
This is necessary to fill the missing create_cuda_context configuration option recently added in Distributed. In rapidsai#792 it will be used to simplify UCX configuration.
This is necessary to fill the missing create_cuda_context configuration option recently added in Distributed. In #792 it will be used to simplify UCX configuration. Authors: - Peter Andreas Entschev (https://github.com/pentschev) Approvers: - Mads R. B. Kristensen (https://github.com/madsbk) URL: #801
rerun tests |
Thanks @madsbk for the review here! |
@gpucibot merge |
Thanks Peter! 😄 |
Up until now, we require users to specify what transports should be used by UCX, pushing the configuration burden onto the user, being also error-prone. We can now reduce this configuration burden with just one configuration being added in dask/distributed#5526:
DASK_DISTRIBUTED__COMM__UCX__CREATE_CUDA_CONTEXT
/distributed.comm.ucx.create_cuda_context
, which creates the CUDA context before UCX is initialized.This is an example of how to setup a cluster with
dask-cuda-worker
after this change:Similarly, one can setup:
LocalCUDACluster(protocol="ucx", interface="ib0")
.Note above how
ib0
is intentionally specified. That is mandatory to use RDMACM, as it is necessary to have listeners bind to an InfiniBand interface, but can be left unspecified when using systems without InfiniBand or if RDMACM isn't required (discouraged on systems that have InfiniBand connectivity). TheUCX_MEMTYPE_REG_WHOLE_ALLOC_TYPES=cuda
option is specified for optimal InfiniBand performance with CUDA and will be default in UCX 1.12, when specifying it won't be necessary anymore.Changes introduced here are backwards-compatible, meaning the old options such as
--enable-nvlink
/enable_nvlink=True
are still valid. However, if any of those options is specified, the user is responsible to enable/disable all desired transports, which can also be useful for benchmarking specific transports.Finally, creating a CUDA context may not be necessary by UCX in the future, at a point where it will be possible to remove
DASK_DISTRIBUTED__COMM__UCX__CREATE_CUDA_CONTEXT=True
from scheduler/client processes entirely.