Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GPU] further improving GPU performance #768

Closed
guolinke opened this issue Aug 2, 2017 · 14 comments
Closed

[GPU] further improving GPU performance #768

guolinke opened this issue Aug 2, 2017 · 14 comments
Assignees

Comments

@guolinke
Copy link
Collaborator

guolinke commented Aug 2, 2017

Refer to a benchmark here: https://blogs.technet.microsoft.com/machinelearning/2017/07/25/lessons-learned-benchmarking-fast-machine-learning-algorithms/ and #620 (comment) .

It seems the LightGBM GPU still can be further improved. Current GPU implementation has a overhead, which is the additional memory copy cost between GPU and CPU.
As a result, when #data is small, using GPU may slow than CPU.

@huanzhang12 any updates for this ?

@guolinke
Copy link
Collaborator Author

guolinke commented Aug 2, 2017

@huanzhang12 another question: Can moving the GPU code to CUDA improve the speed ?

@guolinke guolinke added this to the v3.0 milestone Aug 3, 2017
@huanzhang12
Copy link
Contributor

Yes, LightGBM GPU can still be improved in many ways. Currently the GPU implementation only uses like 30%-50% of full GPU potential.

The major reason the GPU is slow for small data is that, we need to transfer the histograms from GPU to CPU to find the best split after the feature histograms are built. This is not ideal. The overhead of data transfer is significant on dataset with a lot of features or small data; also, it requires the CPU to do too much job, and the CPU can become a bottleneck. For better efficiency, we should find the best split on GPU, preferably in GPU local memory. But it needs some work because we need to re-implement the histogram pool and some other functions (like historgram fixup, split finding for numerical/categorical features) on GPU, which are non-trivial.

After these are implemented, I expect about 2X speedup on large datasets, and significant speedup on smaller datasets (since there are no data transfer overhead). GPU training could become a standard for GBDTs, like what we are doing for deep learning.

Also, we need to work on enabling multi-GPU training. This seems not very hard, as it can be viewed as a special case of distributed learning.

I have rough ideas on how to implement these things, and I really want to continue working on improving the GPU algorithm. But unfortunately right now I am quite busy with my internship, and can only work on this project during my limited spare time :'( I don't think I can finish it any time soon.

Let me know if you have any better ideas on this issue.

@RAMitchell
Copy link

RAMitchell commented Aug 7, 2017

Multi-gpu and some of these other features can be harder in OpenCL due to library availability. We have nccl for p2p multi-gpu allreduce which helps a lot. I actually started off in OpenCL and switched to Cuda due to frustrations with boost compute.

@huanzhang12
Copy link
Contributor

@RAMitchell Thank you for letting us know your valuable experience and suggestions. I will certainly consider using NCCL. I hope the parallel learners in LightGBM do not reply too much on the interconnection, because they have shown good scalability in distributed settings.

@StrikerRUS
Copy link
Collaborator

A fresh paper about XGBoost GPU:
https://arxiv.org/abs/1806.11248

@chivee
Copy link
Collaborator

chivee commented Jul 5, 2018

@huanzhang12
Copy link
Contributor

The new XGBoost GPU implementation supports multi-GPU training. The experiments were done using 8 GPUs for XGBoost and 1 GPU for LightGBM.

@StrikerRUS
Copy link
Collaborator

Closed in favor of being in #2302. We decided to keep all feature requests in one place.

Welcome to contribute this feature! Please re-open this issue (or post a comment if you are not a topic starter) if you are actively working on implementing this feature.

@icejean
Copy link

icejean commented Sep 23, 2021

GPU is much more slower than CPU, more than 10 times, why?
I'v read the following article already.

OS: Windows 10 x64 home edition
GPU: GeForce RTX 2060 with Max-Q Design, 4G RAM, Cuda 10.1
CPU: Intel Core i7 10875H @2.30 G Hz, 16 core, 24G RAM

Train on GPU:
(base) D:\Github\LightGBM\examples\binary_classification>"../../Release/lightgbm.exe" config=train.conf data=binary.train valid=binary.test objective=binary device=gpu
[LightGBM] [Warning] objective is set=binary, objective=binary will be ignored. Current value: objective=binary
[LightGBM] [Warning] data is set=binary.train, data=binary.train will be ignored. Current value: data=binary.train
[LightGBM] [Warning] valid is set=binary.test, valid_data=binary.test will be ignored. Current value: valid=binary.test
[LightGBM] [Info] Finished loading parameters
[LightGBM] [Info] Using column number 0 as label
[LightGBM] [Info] Loading weights...
[LightGBM] [Info] Construct bin mappers from text data time 0.01 seconds
[LightGBM] [Info] Loading weights...
[LightGBM] [Info] Finished loading data in 0.047841 seconds
[LightGBM] [Info] Number of positive: 3716, number of negative: 3284
[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 6132
[LightGBM] [Info] Number of data points in the train set: 7000, number of used features: 28
[LightGBM] [Info] Using GPU Device: GeForce RTX 2060 with Max-Q Design, Vendor: NVIDIA Corporation
[LightGBM] [Info] Compiling OpenCL Kernel with 256 bins...
[LightGBM] [Info] GPU programs have been built
[LightGBM] [Info] Size of histogram bin entry: 8
[LightGBM] [Info] 27 dense feature groups (0.19 MB) transferred to GPU in 0.002940 secs. 1 sparse feature groups
[LightGBM] [Info] Finished initializing training
[LightGBM] [Info] Started training...
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.530877 -> initscore=0.123666
[LightGBM] [Info] Start training from score 0.123666
[LightGBM] [Info] Size of histogram bin entry: 8
[LightGBM] [Info] 27 dense feature groups (0.15 MB) transferred to GPU in 0.002903 secs. 1 sparse feature groups
[LightGBM] [Info] Iteration:1, training binary_logloss : 0.670374
[LightGBM] [Info] Iteration:1, training auc : 0.761949
[LightGBM] [Info] Iteration:1, valid_1 binary_logloss : 0.672837
[LightGBM] [Info] Iteration:1, valid_1 auc : 0.714528
[LightGBM] [Info] 0.025277 seconds elapsed, finished iteration 1
......
[LightGBM] [Info] Iteration:100, training binary_logloss : 0.220836
[LightGBM] [Info] Iteration:100, training auc : 0.99766
[LightGBM] [Info] Iteration:100, valid_1 binary_logloss : 0.492312
[LightGBM] [Info] Iteration:100, valid_1 auc : 0.839994
[LightGBM] [Info] 1.601896 seconds elapsed, finished iteration 100
[LightGBM] [Info] Finished training

Train on CPU:
(base) D:\Github\LightGBM\examples\binary_classification>"../../Release/lightgbm.exe" config=train.conf data=binary.train valid=binary.test objective=binary device=cpu
[LightGBM] [Warning] objective is set=binary, objective=binary will be ignored. Current value: objective=binary
[LightGBM] [Warning] data is set=binary.train, data=binary.train will be ignored. Current value: data=binary.train
[LightGBM] [Warning] valid is set=binary.test, valid_data=binary.test will be ignored. Current value: valid=binary.test
[LightGBM] [Info] Finished loading parameters
[LightGBM] [Info] Using column number 0 as label
[LightGBM] [Info] Loading weights...
[LightGBM] [Info] Construct bin mappers from text data time 0.01 seconds
[LightGBM] [Info] Loading weights...
[LightGBM] [Info] Finished loading data in 0.044767 seconds
[LightGBM] [Info] Number of positive: 3716, number of negative: 3284
[LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000864 seconds.
You can set force_col_wise=true to remove the overhead.
[LightGBM] [Info] Total Bins 6132
[LightGBM] [Info] Number of data points in the train set: 7000, number of used features: 28
[LightGBM] [Info] Finished initializing training
[LightGBM] [Info] Started training...
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.530877 -> initscore=0.123666
[LightGBM] [Info] Start training from score 0.123666
[LightGBM] [Info] Iteration:1, training binary_logloss : 0.670374
[LightGBM] [Info] Iteration:1, training auc : 0.761954
[LightGBM] [Info] Iteration:1, valid_1 binary_logloss : 0.672837
[LightGBM] [Info] Iteration:1, valid_1 auc : 0.714544
[LightGBM] [Info] 0.002136 seconds elapsed, finished iteration 1
......
[LightGBM] [Info] Iteration:100, training binary_logloss : 0.221658
[LightGBM] [Info] Iteration:100, training auc : 0.997396
[LightGBM] [Info] Iteration:100, valid_1 binary_logloss : 0.503137
[LightGBM] [Info] Iteration:100, valid_1 auc : 0.831562
[LightGBM] [Info] 0.314613 seconds elapsed, finished iteration 100
[LightGBM] [Info] Finished training

@shiyu1994
Copy link
Collaborator

Hi @icejean, we will update a new CUDA version for LightGBM soon. It is expected to large improve current GPU performance. You can refer to #4528 to check the progress.

@icejean
Copy link

icejean commented Sep 24, 2021

Great!

@asheetal
Copy link

I installed latest lightGBM from github. Followed the higgs tutorial on lightbgm website
GPU = GTX 1070 Cuda 11
CPU = AMD Ryzen Threadripper 2950X

For AUC calculation
GPU -> [LightGBM] [Info] 27.605478 seconds elapsed, finished iteration 50
CPU -> [LightGBM] [Info] 24.136331 seconds elapsed, finished iteration 50

For L2
GPU -> [LightGBM] [Info] 22.157089 seconds elapsed, finished iteration 50
CPU -> [LightGBM] [Info] 23.382834 seconds elapsed, finished iteration 50

Not very impressive results. Need some guidance before I can submit a large job using lightgbm

@github-actions

This comment was marked as off-topic.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 15, 2023
@jameslamb
Copy link
Collaborator

This was locked accidentally. I just unlocked it. We'd still welcome contributions related to this feature!

@microsoft microsoft unlocked this conversation Aug 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants