Skip to content
This repository has been archived by the owner on Jul 16, 2021. It is now read-only.

Use Rabit tracker get_host_ip('auto') to pick best tracker IP address #40

Merged
merged 1 commit into from
Jun 9, 2020

Conversation

javabrett
Copy link
Contributor

Discussion

Best to also review the notes in #23.

Currently when starting XGBoost (which has its own cluster/tracker/worker network), dask-xgboost is feeding the hostname of the Client scheduler - e.g. dask-scheduler. The IP/adapter for this hostname is not always available in the container that is actually running the scheduler. This is true in cases where there is a service reverse-proxy, such as when deploying in k8s using the current stable/dask Helm chart, when dask-scheduler and its address point to service/dask-scheduler not pod/dask-scheduler....

The simplest approach to fix is to just allow the Rabit tracker code to choose the local adapter/IP to bind the tracker to (in the container/host running scheduler), which is then advertised to XGBoost Rabit workers via env.

Downsides:

  • In k8s case, the XGBoost Rabit tracker is accessed directly between Dask scheduler/worker nodes, i.e. not via the longer-running service/dask-scheduler. Probably not a big concern given the Rabit network should be short-lived, and restartable on any new scheduler/worker pods.
  • Bad host networking configuration could conceivably choose a bad adapter for the Rabit tracker to bind to - but it should choose an adapter which matches the Client scheduler hostname anyway.

Changes

  • start_tracker now accepts host=None and in that case calls Rabit code get_host_ip('auto'), which attempts to find the best local adapter address
  • client._run_on_scheduler(start_tracker passes host=None to trigger this logic
  • Added some logging

Testing

To perform a manual test of the bug/fix, you will need:

  • Docker
  • A k8s cluster - probably any will do, certainly either Docker Desktop for Mac 2.0.0.3 with local k8s, or AWS EKS will work. minicube will also probably work but not tested.
  • Helm client
  • Tiller installed, helm --init.

During testing I found EXTRA_PIP_PACKAGES a two-edged sword - convenient, but the pip installs are long-running and repetitive on each node, and the service doesn't detect when they complete, and the Helm chart doesn't have readiness probes, so the service looks dead until this completes on the Jupyter node. I preferred to build and tag a couple of pairs of local images with dask-xgboost and deps pre-installed - daskdev/dask-notebook and daskdev/dask for pre/post-fix versions of dask-xgboost. To do that, create the Dockerfile below and build/tag four images:

Dockerfile:

ARG BASE_IMAGE
FROM ${BASE_IMAGE}
ARG DASK_XGBOOST_VERSION=master
RUN pip install -U pip && \
    pip install dask-ml git+https://github.com/javabrett/dask-xgboost@${DASK_XGBOOST_VERSION} --upgrade

Run:

  • docker build --build-arg BASE_IMAGE="daskdev/dask:1.2.0" -t daskdev/dask:xgboost .
  • docker build --build-arg BASE_IMAGE="daskdev/dask:1.2.0" --build-arg DASK_XGBOOST_VERSION="23-rabit-tracker-bind-address" -t daskdev/dask:xgboost-fixed .
  • docker build --build-arg BASE_IMAGE="daskdev/dask-notebook:1.2.0" -t daskdev/dask-notebook:xgboost-fixed .
  • docker build --build-arg BASE_IMAGE="daskdev/dask-notebook:1.2.0" --build-arg DASK_XGBOOST_VERSION="23-rabit-tracker-bind-address" -t daskdev/dask-notebook:xgboost-fixed .

You can now deploy the Helm chart to test pre/post fix:

Pre-fix:

export DASK_TAG=xgboost
helm upgrade --install dask stable/dask --set scheduler.image.tag=${DASK_TAG} --set worker.image.tag=${DASK_TAG} --set jupyter.image.tag=${DASK_TAG} --recreate-pods

Once the cluster is up, go to http://localhost, start a new notebook and run:

from distributed import Client
import dask
from dask_xgboost import XGBClassifier
from dask_ml.datasets import make_classification


client = Client()
X, y = make_classification(chunks=20)
X, y = dask.persist(X, y)
XGBClassifier().fit(X, y)

This will fail with:

/opt/conda/lib/python3.7/site-packages/dask_xgboost/tracker.py in __init__()
    166         for port in range(port, port_end):
    167             try:
--> 168                 logging.error('sock.bind %s:%d', hostIP, port)
    169                 sock.bind((hostIP, port))
    170                 self.port = port

OSError: [Errno 99] Cannot assign requested address

Post-fix:

export DASK_TAG=xgboost-fixed
helm upgrade --install dask stable/dask --set scheduler.image.tag=${DASK_TAG} --set worker.image.tag=${DASK_TAG} --set jupyter.image.tag=${DASK_TAG} --recreate-pods

Allow all pods to restart, then repeat the notebook test above, which will now pass and return a classifier result.

Fixed #23.

@mrocklin
Copy link
Member

cc @RAMitchell you may want to be aware of issues / feature requests like this

@javabrett note that there is a move to push some of this code into XGBoost itself. Your engagement comes at a good time to influence that effort.

@javabrett javabrett force-pushed the 23-rabit-tracker-bind-address branch 2 times, most recently from e229058 to bca8fe8 Compare December 14, 2019 23:23
Fixed dask#23.

Signed-off-by: Brett Randall <javabrett@gmail.com>
@gforsyth
Copy link

gforsyth commented Jun 9, 2020

@TomAugspurger -- any objections to merging this in? I've tested it on a few deployments and it fixes network connectivity issues for the rabit network.
I've opened an issue on dmlc/xgboost referenced above to raise this with them and they seem to be aware of it, with possibly a fix slated for next release, but I think this would be a good stopgap.

@TomAugspurger
Copy link
Member

Sure, thanks for checking. Sorry I missed this originally!

@TomAugspurger TomAugspurger merged commit 9fd6362 into dask:master Jun 9, 2020
@jameslamb jameslamb mentioned this pull request Jul 29, 2020
hcho3 pushed a commit to dmlc/xgboost that referenced this pull request Dec 8, 2020
This adopts the solution used in dask/dask-xgboost#40 which employs the get_host_ip from dmlc-core tracker.
hcho3 pushed a commit to dmlc/xgboost that referenced this pull request Dec 8, 2020
This adopts the solution used in dask/dask-xgboost#40 which employs the get_host_ip from dmlc-core tracker.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Cannot assign requested address
4 participants