This repository has been archived by the owner on Jul 16, 2021. It is now read-only.
Use Rabit tracker get_host_ip('auto') to pick best tracker IP address #40
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Discussion
Best to also review the notes in #23.
Currently when starting XGBoost (which has its own cluster/tracker/worker network),
dask-xgboost
is feeding the hostname of theClient
scheduler - e.g.dask-scheduler
. The IP/adapter for this hostname is not always available in the container that is actually running the scheduler. This is true in cases where there is a service reverse-proxy, such as when deploying in k8s using the currentstable/dask
Helm chart, whendask-scheduler
and its address point toservice/dask-scheduler
notpod/dask-scheduler...
.The simplest approach to fix is to just allow the Rabit tracker code to choose the local adapter/IP to bind the tracker to (in the container/host running scheduler), which is then advertised to XGBoost Rabit workers via
env
.Downsides:
service/dask-scheduler
. Probably not a big concern given the Rabit network should be short-lived, and restartable on any new scheduler/worker pods.Client
scheduler hostname anyway.Changes
start_tracker
now acceptshost=None
and in that case calls Rabit codeget_host_ip('auto')
, which attempts to find the best local adapter addressclient._run_on_scheduler(start_tracker
passeshost=None
to trigger this logicTesting
To perform a manual test of the bug/fix, you will need:
helm --init
.During testing I found
EXTRA_PIP_PACKAGES
a two-edged sword - convenient, but thepip
installs are long-running and repetitive on each node, and the service doesn't detect when they complete, and the Helm chart doesn't have readiness probes, so the service looks dead until this completes on the Jupyter node. I preferred to build and tag a couple of pairs of local images withdask-xgboost
and deps pre-installed -daskdev/dask-notebook
anddaskdev/dask
for pre/post-fix versions ofdask-xgboost
. To do that, create theDockerfile
below and build/tag four images:Dockerfile
:Run:
docker build --build-arg BASE_IMAGE="daskdev/dask:1.2.0" -t daskdev/dask:xgboost .
docker build --build-arg BASE_IMAGE="daskdev/dask:1.2.0" --build-arg DASK_XGBOOST_VERSION="23-rabit-tracker-bind-address" -t daskdev/dask:xgboost-fixed .
docker build --build-arg BASE_IMAGE="daskdev/dask-notebook:1.2.0" -t daskdev/dask-notebook:xgboost-fixed .
docker build --build-arg BASE_IMAGE="daskdev/dask-notebook:1.2.0" --build-arg DASK_XGBOOST_VERSION="23-rabit-tracker-bind-address" -t daskdev/dask-notebook:xgboost-fixed .
You can now deploy the Helm chart to test pre/post fix:
Pre-fix:
Once the cluster is up, go to http://localhost, start a new notebook and run:
This will fail with:
Post-fix:
Allow all pods to restart, then repeat the notebook test above, which will now pass and return a classifier result.
Fixed #23.