[dask] [gpu] Distributed training is VERY slow #4761

chixujohnny · 2021-11-01T10:07:11Z

Description

I have many Linux machines, every machine has A100 GPU *8, 128 threads per machine.
I found a problem thesedays in DaskLGB:

Fastest plan is using 1 machine 1*GPU (single mode), 8 threads. Running time=2.5min. Use 1300m GPU memory.
Slow plan is using 1 machine 2*GPU (local distribution mode), 8 threads per machine, the more GPUs, the slower the speed. Running time=4min. Use 1100m+700m memory in 2 GPUs.
VERY SLOW plan is 2*machine, every machine only 1 GPU, 8 threads per machine. It is too slow to use. Running time=30min. Use 1100m+700m memory in 2 GPUs.

I used the Dask command line to build a distributed system, doc: https://docs.dask.org/en/latest/how-to/deploy-dask/cli.html

So I'm very confuse that what is the significance of the existence of distributed DaskLGB??

Faster? NO
Saving GPU memory? NO

So have you encountered this problem when using multi-machine distributed LGB?

Environment info

LightGBM version or commit hash: 3.2.1(gpu)

chixujohnny · 2021-11-01T10:09:59Z

`client = Client(address='xxx.xxx.xxx.xxx:12345') # this is the scheduler ip

params = {
        'n_estimators'       : 1500,
        'objective'          : 'rmse',
        'reg_lambda'         : 0.3731450715226679,
        'reg_alpha'          : 0.30424044277458473,
        'subsample'          : 0.6413363492779808,
        'learning_rate'      : 0.013556055988623302,
        'min_child_weight'   : 91,
        'max_bin'            : 63,
        'random_state'       : 1111,
        'device_type'        : 'gpu',
        'colsample_bytree'   : 0.4237786478391063,
        'gpu_use_dp'         : False,
        'metric'             : 'None',
        'min_data_in_leaf'   : 880,
        'first_metric_only'  : True,
        'num_leaves'         : 248,
        'max_depth'          : 9,
        'verbosity'          : -1,
        }
model = lgb.DaskLGBMRegressor(client=client, **params)
print(f'X_train.shape={X_train.shape}   y_train.shape={y_train.shape}')

print(f'Process X to dask.array')
X_train = dask.array.from_array(X_train, chunks=(100000,3095)); X_train.compute() ; print(f'type(X_train)={type(X_train)}  X_train.shape={X_train.shape}')
y_train = dask.array.from_array(y_train, chunks=(100000,)); y_train.compute() ; print(f'type(y_train)={type(y_train)}  y_train.shape={y_train.shape}')

st = datetime.datetime.now()
print(f'Training start time:{st}')
model.fit(X_train, y_train)
et = datetime.datetime.now()
print(f'Training end time: {et}')
print(f'Running cost: {et-st}')`

jmoralez · 2021-11-01T15:59:55Z

@chixujohnny thanks for using LightGBM!

Right now the dask interface doesn't directly support distributed training using GPU, you can subscribe to #3776 if you're interested in that. Are you getting any warnings about this? I think it probably isn't using the GPU at all.

Furthermore, if your data fits in a single machine then it's probably best not using distributed training at all. The dask interface is there to help you train a model on data that doesn't fit on a single machine by having partitions of the data on different machines which communicate with each other, which adds some overhead compared to single-node training.

If you want to use multiple GPUs on a single machine you can try the CUDA version and set num_gpu to a value greater than 1.

chixujohnny · 2021-11-02T07:52:36Z

@jmoralez Thanks for your reply.
I find the problem now.

The GPU is really working, not CPU mode.
Using command to watch: watch -n 1 nvidia-smi

To sovle this promblem, just using dask-cuda package.

Please see this doc to deploy your worker: https://docs.rapids.ai/api/dask-cuda/nightly/api.html

Just using command: dask-cuda-worker xxx.xxx.xxx.xxx:9876
Instead of: dask-worker xxx.xxx.xxx.xxx:9876

But, compare with dask-cude XGB, it is more usable than LGB. But it won't be particularly obvious.

jmoralez · 2021-11-03T15:34:35Z

Thanks for the follow up @chixujohnny. Were you able to train faster using dask-cuda workers?

chixujohnny · 2021-11-05T06:03:20Z

Thanks for the follow up @chixujohnny. Were you able to train faster using dask-cuda workers?

Faster, but slower than 1 GPU mode.
I use 1 machine + 2 GPUs + LocalCUDACluster mode.

chixujohnny · 2021-11-05T06:09:37Z

Thanks for the follow up @chixujohnny. Were you able to train faster using dask-cuda workers?

By the way, in a normal situation, when I training a regression job on only 1GPU. What is the useage rate of GPU? On nvidia-A100 or nvidia-V100 GPU, the useage just 35%. It’s not like XGB has 100% usage. Is this normal?

jameslamb · 2021-11-05T14:48:09Z

On nvidia-A100 or nvidia-V100 GPU, the useage just 35%. Is this normal?

LightGBM's existing CUDA-based implementation does some work on the GPU and some on CPU (#4082 (comment)), which is why you might not see high GPU utilization.

This is a known issue, and @shiyu1994 and others are working on it. I recommend subscribing to updates on the following PRs to track the progress on a new implementation that should better-utilize the GPU:

The reviews on those pull requests are going to get quite large, so if you have questions about the plans please open new issues here that reference them, instead of commenting on the PRs directly.

chixujohnny · 2021-11-08T02:16:00Z

On nvidia-A100 or nvidia-V100 GPU, the useage just 35%. Is this normal?

LightGBM's existing CUDA-based implementation does some work on the GPU and some on CPU (#4082 (comment)), which is why you might not see high GPU utilization.

This is a known issue, and @shiyu1994 and others are working on it. I recommend subscribing to updates on the following PRs to track the progress on a new implementation that should better-utilize the GPU:

WIP: [CUDA] New CUDA version #4528

[CUDA] New CUDA version Part 1 #4630

The reviews on those pull requests are going to get quite large, so if you have questions about the plans please open new issues here that reference them, instead of commenting on the PRs directly.

Thank you very much~

no-response · 2021-12-10T12:39:32Z

This issue has been automatically closed because it has been awaiting a response for too long. When you have time to to work with the maintainers to resolve this issue, please post a new comment and it will be re-opened. If the issue has been locked for editing by the time you return to it, please open a new issue and reference this one. Thank you for taking the time to improve LightGBM!

github-actions · 2023-08-16T00:19:26Z

This issue has been automatically locked since there has not been any recent activity since it was closed.
To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues
including a reference to this.

chixujohnny changed the title ~~Distribution DaskLGB is VERY slow~~ Distribution DaskLGB training is VERY slow Nov 1, 2021

jameslamb added dask question labels Nov 1, 2021

jameslamb changed the title ~~Distribution DaskLGB training is VERY slow~~ [daks] Distribution DaskLGB training is VERY slow Nov 1, 2021

jmoralez changed the title ~~[daks] Distribution DaskLGB training is VERY slow~~ [dask] Distributed training is VERY slow Nov 1, 2021

jameslamb changed the title ~~[dask] Distributed training is VERY slow~~ [dask] [gpu] Distributed training is VERY slow Nov 5, 2021

StrikerRUS added the awaiting response label Nov 10, 2021

no-response bot closed this as completed Dec 10, 2021

github-actions bot removed the awaiting response label Aug 16, 2023

github-actions bot locked as resolved and limited conversation to collaborators Aug 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[dask] [gpu] Distributed training is VERY slow #4761

[dask] [gpu] Distributed training is VERY slow #4761

chixujohnny commented Nov 1, 2021 •

edited

Loading

chixujohnny commented Nov 1, 2021

jmoralez commented Nov 1, 2021 •

edited

Loading

chixujohnny commented Nov 2, 2021

jmoralez commented Nov 3, 2021

chixujohnny commented Nov 5, 2021

chixujohnny commented Nov 5, 2021

jameslamb commented Nov 5, 2021

chixujohnny commented Nov 8, 2021

no-response bot commented Dec 10, 2021

github-actions bot commented Aug 16, 2023

[dask] [gpu] Distributed training is VERY slow #4761

[dask] [gpu] Distributed training is VERY slow #4761

Comments

chixujohnny commented Nov 1, 2021 • edited Loading

Description

Environment info

chixujohnny commented Nov 1, 2021

jmoralez commented Nov 1, 2021 • edited Loading

chixujohnny commented Nov 2, 2021

jmoralez commented Nov 3, 2021

chixujohnny commented Nov 5, 2021

chixujohnny commented Nov 5, 2021

jameslamb commented Nov 5, 2021

chixujohnny commented Nov 8, 2021

no-response bot commented Dec 10, 2021

github-actions bot commented Aug 16, 2023

chixujohnny commented Nov 1, 2021 •

edited

Loading

jmoralez commented Nov 1, 2021 •

edited

Loading