Use Rabit tracker get_host_ip('auto') to pick best tracker IP address #40

javabrett · 2019-05-17T00:55:41Z

Discussion

Best to also review the notes in #23.

Currently when starting XGBoost (which has its own cluster/tracker/worker network), dask-xgboost is feeding the hostname of the Client scheduler - e.g. dask-scheduler. The IP/adapter for this hostname is not always available in the container that is actually running the scheduler. This is true in cases where there is a service reverse-proxy, such as when deploying in k8s using the current stable/dask Helm chart, when dask-scheduler and its address point to service/dask-scheduler not pod/dask-scheduler....

The simplest approach to fix is to just allow the Rabit tracker code to choose the local adapter/IP to bind the tracker to (in the container/host running scheduler), which is then advertised to XGBoost Rabit workers via env.

Downsides:

In k8s case, the XGBoost Rabit tracker is accessed directly between Dask scheduler/worker nodes, i.e. not via the longer-running service/dask-scheduler. Probably not a big concern given the Rabit network should be short-lived, and restartable on any new scheduler/worker pods.
Bad host networking configuration could conceivably choose a bad adapter for the Rabit tracker to bind to - but it should choose an adapter which matches the Client scheduler hostname anyway.

Changes

start_tracker now accepts host=None and in that case calls Rabit code get_host_ip('auto'), which attempts to find the best local adapter address
client._run_on_scheduler(start_tracker passes host=None to trigger this logic
Added some logging

Testing

To perform a manual test of the bug/fix, you will need:

Docker
A k8s cluster - probably any will do, certainly either Docker Desktop for Mac 2.0.0.3 with local k8s, or AWS EKS will work. minicube will also probably work but not tested.
Helm client
Tiller installed, helm --init.

During testing I found EXTRA_PIP_PACKAGES a two-edged sword - convenient, but the pip installs are long-running and repetitive on each node, and the service doesn't detect when they complete, and the Helm chart doesn't have readiness probes, so the service looks dead until this completes on the Jupyter node. I preferred to build and tag a couple of pairs of local images with dask-xgboost and deps pre-installed - daskdev/dask-notebook and daskdev/dask for pre/post-fix versions of dask-xgboost. To do that, create the Dockerfile below and build/tag four images:

Dockerfile:

ARG BASE_IMAGE
FROM ${BASE_IMAGE}
ARG DASK_XGBOOST_VERSION=master
RUN pip install -U pip && \
    pip install dask-ml git+https://github.com/javabrett/dask-xgboost@${DASK_XGBOOST_VERSION} --upgrade

Run:

docker build --build-arg BASE_IMAGE="daskdev/dask:1.2.0" -t daskdev/dask:xgboost .
docker build --build-arg BASE_IMAGE="daskdev/dask:1.2.0" --build-arg DASK_XGBOOST_VERSION="23-rabit-tracker-bind-address" -t daskdev/dask:xgboost-fixed .
docker build --build-arg BASE_IMAGE="daskdev/dask-notebook:1.2.0" -t daskdev/dask-notebook:xgboost-fixed .
docker build --build-arg BASE_IMAGE="daskdev/dask-notebook:1.2.0" --build-arg DASK_XGBOOST_VERSION="23-rabit-tracker-bind-address" -t daskdev/dask-notebook:xgboost-fixed .

You can now deploy the Helm chart to test pre/post fix:

Pre-fix:

export DASK_TAG=xgboost
helm upgrade --install dask stable/dask --set scheduler.image.tag=${DASK_TAG} --set worker.image.tag=${DASK_TAG} --set jupyter.image.tag=${DASK_TAG} --recreate-pods

Once the cluster is up, go to http://localhost, start a new notebook and run:

from distributed import Client
import dask
from dask_xgboost import XGBClassifier
from dask_ml.datasets import make_classification


client = Client()
X, y = make_classification(chunks=20)
X, y = dask.persist(X, y)
XGBClassifier().fit(X, y)

This will fail with:

/opt/conda/lib/python3.7/site-packages/dask_xgboost/tracker.py in __init__()
    166         for port in range(port, port_end):
    167             try:
--> 168                 logging.error('sock.bind %s:%d', hostIP, port)
    169                 sock.bind((hostIP, port))
    170                 self.port = port

OSError: [Errno 99] Cannot assign requested address

Post-fix:

export DASK_TAG=xgboost-fixed
helm upgrade --install dask stable/dask --set scheduler.image.tag=${DASK_TAG} --set worker.image.tag=${DASK_TAG} --set jupyter.image.tag=${DASK_TAG} --recreate-pods

Allow all pods to restart, then repeat the notebook test above, which will now pass and return a classifier result.

Fixed #23.

mrocklin · 2019-05-17T01:00:05Z

cc @RAMitchell you may want to be aware of issues / feature requests like this

@javabrett note that there is a move to push some of this code into XGBoost itself. Your engagement comes at a good time to influence that effort.

Fixed dask#23. Signed-off-by: Brett Randall <javabrett@gmail.com>

gforsyth · 2020-06-09T20:48:44Z

@TomAugspurger -- any objections to merging this in? I've tested it on a few deployments and it fixes network connectivity issues for the rabit network.
I've opened an issue on dmlc/xgboost referenced above to raise this with them and they seem to be aware of it, with possibly a fix slated for next release, but I think this would be a good stopgap.

TomAugspurger · 2020-06-09T20:52:46Z

Sure, thanks for checking. Sorry I missed this originally!

This adopts the solution used in dask/dask-xgboost#40 which employs the get_host_ip from dmlc-core tracker.

mrocklin mentioned this pull request May 17, 2019

Run the central Rabit process on a worker #41

Open

javabrett force-pushed the 23-rabit-tracker-bind-address branch 2 times, most recently from e229058 to bca8fe8 Compare December 14, 2019 23:23

Use Rabit tracker get_host_ip('auto') to pick best tracker IP address.

d7703a5

Fixed dask#23. Signed-off-by: Brett Randall <javabrett@gmail.com>

javabrett force-pushed the 23-rabit-tracker-bind-address branch from bca8fe8 to d7703a5 Compare December 14, 2019 23:40

gforsyth mentioned this pull request Jun 8, 2020

Host IP resolution problem w/ dask on kubernetes or dask-gateway dmlc/xgboost#5765

Closed

TomAugspurger merged commit 9fd6362 into dask:master Jun 9, 2020

jameslamb mentioned this pull request Jul 29, 2020

new release? #75

Closed

trivialfis mentioned this pull request Dec 7, 2020

Fix dask ip resolution. dmlc/xgboost#6475

Merged

1 task

hcho3 pushed a commit to dmlc/xgboost that referenced this pull request Dec 8, 2020

Fix dask ip resolution. (#6475)

0ffaf0f

This adopts the solution used in dask/dask-xgboost#40 which employs the get_host_ip from dmlc-core tracker.

hcho3 pushed a commit to dmlc/xgboost that referenced this pull request Dec 8, 2020

Fix dask ip resolution. (#6475)

1bf3899

This adopts the solution used in dask/dask-xgboost#40 which employs the get_host_ip from dmlc-core tracker.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use Rabit tracker get_host_ip('auto') to pick best tracker IP address #40

Use Rabit tracker get_host_ip('auto') to pick best tracker IP address #40

javabrett commented May 17, 2019

mrocklin commented May 17, 2019

gforsyth commented Jun 9, 2020

TomAugspurger commented Jun 9, 2020

Use Rabit tracker get_host_ip('auto') to pick best tracker IP address #40

Use Rabit tracker get_host_ip('auto') to pick best tracker IP address #40

Conversation

javabrett commented May 17, 2019

Discussion

Changes

Testing

Pre-fix:

Post-fix:

mrocklin commented May 17, 2019

gforsyth commented Jun 9, 2020

TomAugspurger commented Jun 9, 2020