-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[GPU] further improving GPU performance #768
Comments
@huanzhang12 another question: Can moving the GPU code to CUDA improve the speed ? |
Yes, LightGBM GPU can still be improved in many ways. Currently the GPU implementation only uses like 30%-50% of full GPU potential. The major reason the GPU is slow for small data is that, we need to transfer the histograms from GPU to CPU to find the best split after the feature histograms are built. This is not ideal. The overhead of data transfer is significant on dataset with a lot of features or small data; also, it requires the CPU to do too much job, and the CPU can become a bottleneck. For better efficiency, we should find the best split on GPU, preferably in GPU local memory. But it needs some work because we need to re-implement the histogram pool and some other functions (like historgram fixup, split finding for numerical/categorical features) on GPU, which are non-trivial. After these are implemented, I expect about 2X speedup on large datasets, and significant speedup on smaller datasets (since there are no data transfer overhead). GPU training could become a standard for GBDTs, like what we are doing for deep learning. Also, we need to work on enabling multi-GPU training. This seems not very hard, as it can be viewed as a special case of distributed learning. I have rough ideas on how to implement these things, and I really want to continue working on improving the GPU algorithm. But unfortunately right now I am quite busy with my internship, and can only work on this project during my limited spare time :'( I don't think I can finish it any time soon. Let me know if you have any better ideas on this issue. |
Multi-gpu and some of these other features can be harder in OpenCL due to library availability. We have nccl for p2p multi-gpu allreduce which helps a lot. I actually started off in OpenCL and switched to Cuda due to frustrations with boost compute. |
@RAMitchell Thank you for letting us know your valuable experience and suggestions. I will certainly consider using NCCL. I hope the parallel learners in LightGBM do not reply too much on the interconnection, because they have shown good scalability in distributed settings. |
A fresh paper about XGBoost GPU: |
more details: https://github.com/RAMitchell/GBM-Benchmarks |
The new XGBoost GPU implementation supports multi-GPU training. The experiments were done using 8 GPUs for XGBoost and 1 GPU for LightGBM. |
Closed in favor of being in #2302. We decided to keep all feature requests in one place. Welcome to contribute this feature! Please re-open this issue (or post a comment if you are not a topic starter) if you are actively working on implementing this feature. |
GPU is much more slower than CPU, more than 10 times, why? OS: Windows 10 x64 home edition Train on GPU: Train on CPU: |
Great! |
I installed latest lightGBM from github. Followed the higgs tutorial on lightbgm website For AUC calculation For L2 Not very impressive results. Need some guidance before I can submit a large job using lightgbm |
This comment was marked as off-topic.
This comment was marked as off-topic.
This was locked accidentally. I just unlocked it. We'd still welcome contributions related to this feature! |
Refer to a benchmark here: https://blogs.technet.microsoft.com/machinelearning/2017/07/25/lessons-learned-benchmarking-fast-machine-learning-algorithms/ and #620 (comment) .
It seems the LightGBM GPU still can be further improved. Current GPU implementation has a overhead, which is the additional memory copy cost between GPU and CPU.
As a result, when #data is small, using GPU may slow than CPU.
@huanzhang12 any updates for this ?
The text was updated successfully, but these errors were encountered: