Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deprecation warning for ucx_net_devices='auto' on UCX 1.11+ #681

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 11 additions & 0 deletions dask_cuda/cuda_worker.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@
from .utils import (
CPUAffinity,
RMMSetup,
_ucx_111,
cuda_visible_devices,
get_cpu_affinity,
get_n_gpus,
Expand Down Expand Up @@ -164,6 +165,16 @@ def del_pid_file():
"RMM managed memory and NVLink are currently incompatible."
)

if _ucx_111 and net_devices == "auto":
warnings.warn(
"Starting with UCX 1.11, `ucx_net_devices='auto' is deprecated, "
"it should now be left unspecified for the same behavior. "
"Please make sure to read the updated UCX Configuration section in "
"https://docs.rapids.ai/api/dask-cuda/nightly/ucx.html, "
"where significant performance considerations for InfiniBand with "
"UCX 1.11 and above is documented.",
)

# Ensure this parent dask-cuda-worker process uses the same UCX
# configuration as child worker processes created by it.
initialize(
Expand Down
25 changes: 19 additions & 6 deletions dask_cuda/local_cuda_cluster.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@
from .utils import (
CPUAffinity,
RMMSetup,
_ucx_111,
cuda_visible_devices,
get_cpu_affinity,
get_ucx_config,
Expand Down Expand Up @@ -286,13 +287,25 @@ def __init__(
raise TypeError("Enabling InfiniBand or NVLink requires protocol='ucx'")

if ucx_net_devices == "auto":
try:
from ucp._libs.topological_distance import TopologicalDistance # NOQA
except ImportError:
raise ValueError(
"ucx_net_devices set to 'auto' but UCX-Py is not "
"installed or it's compiled without hwloc support"
if _ucx_111:
warnings.warn(
"Starting with UCX 1.11, `ucx_net_devices='auto' is deprecated, "
"it should now be left unspecified for the same behavior. "
"Please make sure to read the updated UCX Configuration section in "
"https://docs.rapids.ai/api/dask-cuda/nightly/ucx.html, "
"where significant performance considerations for InfiniBand with "
"UCX 1.11 and above is documented.",
)
else:
try:
from ucp._libs.topological_distance import ( # NOQA
TopologicalDistance,
)
except ImportError:
raise ValueError(
"ucx_net_devices set to 'auto' but UCX-Py is not "
"installed or it's compiled without hwloc support"
)
elif ucx_net_devices == "":
raise ValueError("ucx_net_devices can not be an empty string")
self.ucx_net_devices = ucx_net_devices
Expand Down
3 changes: 3 additions & 0 deletions dask_cuda/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -192,6 +192,9 @@ def get_device_total_memory(index=0):
def get_ucx_net_devices(
cuda_device_index, ucx_net_devices, get_openfabrics=True, get_network=False
):
if ucp.get_ucx_version() >= (1, 11, 0) and ucx_net_devices == "auto":
return None

if cuda_device_index is None and (
callable(ucx_net_devices) or ucx_net_devices == "auto"
):
Expand Down
6 changes: 4 additions & 2 deletions docs/source/ucx.rst
Original file line number Diff line number Diff line change
Expand Up @@ -53,14 +53,16 @@ However, some will affect related libraries, such as RMM:
- ``ucx.net-devices: <str>`` -- **recommended for UCX 1.9 and older.**

Explicitly sets ``UCX_NET_DEVICES`` instead of defaulting to ``"all"``, which can result in suboptimal performance.
If using InfiniBand, set to ``"auto"`` to automatically detect the InfiniBand interface closest to each GPU.
If using InfiniBand, set to ``"auto"`` to automatically detect the InfiniBand interface closest to each GPU on UCX 1.9 and below.
If InfiniBand is disabled, set to a UCX-compatible ethernet interface, e.g. ``"enp1s0f0"`` on a DGX-1.
All available UCX-compatible interfaces can be listed by running ``ucx_info -d``.

UCX 1.11 and above is capable of identifying closest interfaces without setting ``"auto"``, it is recommended not to set ``ucx.net-devices``, but some recommendations for optimal performance apply, see the documentation on ``ucx.infiniband`` above fore details.
UCX 1.11 and above is capable of identifying closest interfaces without setting ``"auto"`` (**deprecated for UCX 1.11 and above**), it is recommended not to set ``ucx.net-devices`` in most cases. However, some recommendations for optimal performance apply, see the documentation on ``ucx.infiniband`` above fore details.

.. warning::
Setting ``ucx.net-devices: "auto"`` assumes that all InfiniBand interfaces on the system are connected and properly configured; undefined behavior may occur otherwise.
**``ucx.net-devices: "auto"`` is *DEPRECATED* for UCX 1.11 and above.**



- ``rmm.pool-size: <str|int>`` -- **recommended.**
Expand Down