-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[dask] [gpu] Distributed training is VERY slow #4761
Comments
`client = Client(address='xxx.xxx.xxx.xxx:12345') # this is the scheduler ip
|
@chixujohnny thanks for using LightGBM! Right now the dask interface doesn't directly support distributed training using GPU, you can subscribe to #3776 if you're interested in that. Are you getting any warnings about this? I think it probably isn't using the GPU at all. Furthermore, if your data fits in a single machine then it's probably best not using distributed training at all. The dask interface is there to help you train a model on data that doesn't fit on a single machine by having partitions of the data on different machines which communicate with each other, which adds some overhead compared to single-node training. If you want to use multiple GPUs on a single machine you can try the CUDA version and set num_gpu to a value greater than 1. |
@jmoralez Thanks for your reply. The GPU is really working, not CPU mode. To sovle this promblem, just using dask-cuda package. Please see this doc to deploy your worker: https://docs.rapids.ai/api/dask-cuda/nightly/api.html Just using command: dask-cuda-worker xxx.xxx.xxx.xxx:9876 But, compare with dask-cude XGB, it is more usable than LGB. But it won't be particularly obvious. |
Thanks for the follow up @chixujohnny. Were you able to train faster using dask-cuda workers? |
Faster, but slower than 1 GPU mode. |
By the way, in a normal situation, when I training a regression job on only 1GPU. What is the useage rate of GPU? On nvidia-A100 or nvidia-V100 GPU, the useage just 35%. It’s not like XGB has 100% usage. Is this normal? |
LightGBM's existing CUDA-based implementation does some work on the GPU and some on CPU (#4082 (comment)), which is why you might not see high GPU utilization. This is a known issue, and @shiyu1994 and others are working on it. I recommend subscribing to updates on the following PRs to track the progress on a new implementation that should better-utilize the GPU: The reviews on those pull requests are going to get quite large, so if you have questions about the plans please open new issues here that reference them, instead of commenting on the PRs directly. |
Thank you very much~ |
This issue has been automatically closed because it has been awaiting a response for too long. When you have time to to work with the maintainers to resolve this issue, please post a new comment and it will be re-opened. If the issue has been locked for editing by the time you return to it, please open a new issue and reference this one. Thank you for taking the time to improve LightGBM! |
This issue has been automatically locked since there has not been any recent activity since it was closed. |
Description
I have many Linux machines, every machine has A100 GPU *8, 128 threads per machine.
I found a problem thesedays in DaskLGB:
I used the Dask command line to build a distributed system, doc: https://docs.dask.org/en/latest/how-to/deploy-dask/cli.html
So I'm very confuse that what is the significance of the existence of distributed DaskLGB??
Faster? NO
Saving GPU memory? NO
So have you encountered this problem when using multi-machine distributed LGB?
Environment info
LightGBM version or commit hash: 3.2.1(gpu)
The text was updated successfully, but these errors were encountered: