Dask XGBoost hangs during training with multiple GPU workers #6649

elaineejiang · 2021-01-27T22:41:58Z

Hi, I am using XGBoost (v.1.1.1) with Dask (v. 2020.12.0). I have a Dask cluster that connects to remote GPU workers using k8s (v.1.14). I've noticed that if I train on multiple GPU workers, the dispatch-training tasks will hang on Rabit initialization. For example, this is what the call stack looks like in one of the workers:

Key: dispatched_train-60c6f36b-c5df-4871-9ff1-bf5fc547ccdf-0
File "[...]/ext/public/python/3/7/x/dist/lib/python3.7/threading.py", line 890, in _bootstrap self._bootstrap_inner()

File "[...]/ext/public/python/3/7/x/dist/lib/python3.7/threading.py", line 926, in _bootstrap_inner self.run()

File "[...]/ext/public/python/3/7/x/dist/lib/python3.7/threading.py", line 870, in run self._target(*self._args, **self._kwargs)

File "[...]/ext/public/python/distributed/2020/12/0/dist/lib/python3.7/distributed/threadpoolexecutor.py", line 55, in _worker task.run()

File "[...]/ext/public/python/distributed/2020/12/0/dist/lib/python3.7/distributed/_concurrent_futures_thread.py", line 65, in run result = self.fn(*self.args, **self.kwargs)

File "[...]/ext/public/python/distributed/2020/12/0/dist/lib/python3.7/distributed/worker.py", line 3425, in apply_function result = function(*args, **kwargs)

File "[...]/ext/public/python/xgboost/1/1/1/dist/lib/python3.7/xgboost/dask.py", line 418, in dispatched_train with RabitContext(rabit_args):

File "[...]/ext/public/python/xgboost/1/1/1/dist/lib/python3.7/xgboost/dask.py", line 82, in __enter__ rabit.init(self.args)

File "[...]/ext/public/python/xgboost/1/1/1/dist/lib/python3.7/xgboost/rabit.py", line 27, in init _LIB.RabitInit(len(arr), arr)

Here is a reproducible example:

# Prereq: Set up Dask cluster and client with >1 GPU workers

import pandas as pd
import numpy as np
import xgboost as xgb
import dask.dataframe as dd

seed = 42
random_state = np.random.RandomState(seed)

df = pd.DataFrame(random_state.random_sample((1000, 4)), columns=list(['x1', 'x2', 'x3', 'y']))
xcols = ['x1', 'x2', 'x3']
ycol = ['y']

X, y = (dd.from_pandas(df[xcols], npartitions=4), dd.from_pandas(df[ycol], npartitions=4))

learner = xgb.dask.DaskXGBRegressor(objective='reg:squarederror',
    n_estimators=16,
    max_depth=8,
    learning_rate=0.1,
    verbosity=3,
    tree_method="gpu_hist")

learner.fit(X, y)

I've noticed this happening with multiple CPU workers as well, but it occurs less frequently. It seems like the issue could be related to #6604 and #6469, although I tried the patch provided in #6469, and the workers were still hanging during training. Any ideas on how to resolve this?

The text was updated successfully, but these errors were encountered:

trivialfis · 2021-01-27T23:26:00Z

Could you please try 1.3.3?

elaineejiang · 2021-01-28T17:12:42Z

@trivialfis Thanks for the quick response! I'm trying to build 1.3.3., but keep hitting this error:

[...]/xgboost/1/3/3/build/python3.7/include/xgboost/parameter.h(92): error: no instance of function template "dmlc::Parameter<PType>::UpdateAllowUnknown [with PType=xgboost::tree::TrainParam]" matches the argument list
            argument types are: (const xgboost::Args, __nv_bool *)
          detected during:
            instantiation of "xgboost::Args xgboost::XGBoostParameter<Type>::UpdateAllowUnknown(const Container &, __nv_bool *) [with Type=xgboost::tree::TrainParam, Container=xgboost::Args]"

hcho3 · 2021-01-28T17:14:52Z

Try running git submodule update --init --recursive.

elaineejiang · 2021-01-28T19:13:14Z

Thanks @hcho3 ! I was able to build 1.3.3, but still seeing the issue where the workers are hanging on Rabit initialization:

File "[...]/ext/public/python/3/7/x/dist/lib/python3.7/threading.py", line 890, in _bootstrap self._bootstrap_inner()

File "[...]/ext/public/python/3/7/x/dist/lib/python3.7/threading.py", line 926, in _bootstrap_inner self.run()

File "[...]/ext/public/python/3/7/x/dist/lib/python3.7/threading.py", line 870, in run self._target(*self._args, **self._kwargs)

File "[...]/ext/public/python/distributed/2020/12/0/dist/lib/python3.7/distributed/threadpoolexecutor.py", line 55, in _worker task.run()

File "[...]/ext/public/python/distributed/2020/12/0/dist/lib/python3.7/distributed/_concurrent_futures_thread.py", line 65, in run result = self.fn(*self.args, **self.kwargs)

File "[...]/ext/public/python/distributed/2020/12/0/dist/lib/python3.7/distributed/worker.py", line 3425, in apply_function result = function(*args, **kwargs)

File "[...]/ext/public/python/xgboost/1/3/3/dist/lib/python3.7/xgboost/dask.py", line 648, in dispatched_train with RabitContext(rabit_args):

File "[...]/ext/public/python/xgboost/1/3/3/dist/lib/python3.7/xgboost/dask.py", line 106, in __enter__ rabit.init(self.args)

File "[...]/ext/public/python/xgboost/1/3/3/dist/lib/python3.7/xgboost/rabit.py", line 27, in init _LIB.RabitInit(len(arr), arr)

elaineejiang · 2021-01-29T23:24:32Z

I was looking for a workaround and saw that the tests for the Dask API always contain these lines:

# Always call this before using distributed module
xgb.rabit.init()
rank = xgb.rabit.get_rank()
world = xgb.rabit.get_world_size()

(from https://github.com/dmlc/xgboost/blob/master/tests/distributed/distributed_gpu.py)
I added a call to rabit.init() at the top of dispatch_train (https://github.com/dmlc/xgboost/blob/v1.3.3/python-package/xgboost/dask.py#L646) and now multi-GPU training works sometimes. I wanted to give this update in case it gives more insight into possible solutions, cc: @hcho3 @trivialfis. Much appreciated!

trivialfis · 2021-02-05T14:21:39Z

Opened an issue in dask/distributed#4485

elaineejiang mentioned this issue Feb 4, 2021

Fix Dask XGBoost hanging on rabit initialization during training with multi-GPU multi-nodes #6677

Closed

trivialfis mentioned this issue Mar 5, 2021

[dask] Use distributed.MultiLock #6743

Merged

trivialfis closed this as completed in #6743 Mar 16, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dask XGBoost hangs during training with multiple GPU workers #6649

Dask XGBoost hangs during training with multiple GPU workers #6649

elaineejiang commented Jan 27, 2021

trivialfis commented Jan 27, 2021

elaineejiang commented Jan 28, 2021

hcho3 commented Jan 28, 2021

elaineejiang commented Jan 28, 2021

elaineejiang commented Jan 29, 2021

trivialfis commented Feb 5, 2021

Dask XGBoost hangs during training with multiple GPU workers #6649

Dask XGBoost hangs during training with multiple GPU workers #6649

Comments

elaineejiang commented Jan 27, 2021

trivialfis commented Jan 27, 2021

elaineejiang commented Jan 28, 2021

hcho3 commented Jan 28, 2021

elaineejiang commented Jan 28, 2021

elaineejiang commented Jan 29, 2021

trivialfis commented Feb 5, 2021