[GPU] further improving GPU performance #768

guolinke · 2017-08-02T14:23:18Z

Refer to a benchmark here: https://blogs.technet.microsoft.com/machinelearning/2017/07/25/lessons-learned-benchmarking-fast-machine-learning-algorithms/ and #620 (comment) .

It seems the LightGBM GPU still can be further improved. Current GPU implementation has a overhead, which is the additional memory copy cost between GPU and CPU.
As a result, when #data is small, using GPU may slow than CPU.

@huanzhang12 any updates for this ?

guolinke · 2017-08-02T14:52:31Z

@huanzhang12 another question: Can moving the GPU code to CUDA improve the speed ?

huanzhang12 · 2017-08-07T05:44:05Z

Yes, LightGBM GPU can still be improved in many ways. Currently the GPU implementation only uses like 30%-50% of full GPU potential.

The major reason the GPU is slow for small data is that, we need to transfer the histograms from GPU to CPU to find the best split after the feature histograms are built. This is not ideal. The overhead of data transfer is significant on dataset with a lot of features or small data; also, it requires the CPU to do too much job, and the CPU can become a bottleneck. For better efficiency, we should find the best split on GPU, preferably in GPU local memory. But it needs some work because we need to re-implement the histogram pool and some other functions (like historgram fixup, split finding for numerical/categorical features) on GPU, which are non-trivial.

After these are implemented, I expect about 2X speedup on large datasets, and significant speedup on smaller datasets (since there are no data transfer overhead). GPU training could become a standard for GBDTs, like what we are doing for deep learning.

Also, we need to work on enabling multi-GPU training. This seems not very hard, as it can be viewed as a special case of distributed learning.

I have rough ideas on how to implement these things, and I really want to continue working on improving the GPU algorithm. But unfortunately right now I am quite busy with my internship, and can only work on this project during my limited spare time :'( I don't think I can finish it any time soon.

Let me know if you have any better ideas on this issue.

RAMitchell · 2017-08-07T07:35:41Z

Multi-gpu and some of these other features can be harder in OpenCL due to library availability. We have nccl for p2p multi-gpu allreduce which helps a lot. I actually started off in OpenCL and switched to Cuda due to frustrations with boost compute.

huanzhang12 · 2017-08-16T15:45:27Z

@RAMitchell Thank you for letting us know your valuable experience and suggestions. I will certainly consider using NCCL. I hope the parallel learners in LightGBM do not reply too much on the interconnection, because they have shown good scalability in distributed settings.

StrikerRUS · 2018-07-04T22:44:09Z

A fresh paper about XGBoost GPU:
https://arxiv.org/abs/1806.11248

chivee · 2018-07-05T06:44:20Z

more details: https://github.com/RAMitchell/GBM-Benchmarks

huanzhang12 · 2018-07-06T19:00:06Z

The new XGBoost GPU implementation supports multi-GPU training. The experiments were done using 8 GPUs for XGBoost and 1 GPU for LightGBM.

StrikerRUS · 2019-08-01T16:44:01Z

Closed in favor of being in #2302. We decided to keep all feature requests in one place.

Welcome to contribute this feature! Please re-open this issue (or post a comment if you are not a topic starter) if you are actively working on implementing this feature.

icejean · 2021-09-23T07:29:14Z

GPU is much more slower than CPU, more than 10 times, why?
I'v read the following article already.

OS: Windows 10 x64 home edition
GPU: GeForce RTX 2060 with Max-Q Design, 4G RAM, Cuda 10.1
CPU: Intel Core i7 10875H @2.30 G Hz, 16 core, 24G RAM

Train on GPU:
(base) D:\Github\LightGBM\examples\binary_classification>"../../Release/lightgbm.exe" config=train.conf data=binary.train valid=binary.test objective=binary device=gpu
[LightGBM] [Warning] objective is set=binary, objective=binary will be ignored. Current value: objective=binary
[LightGBM] [Warning] data is set=binary.train, data=binary.train will be ignored. Current value: data=binary.train
[LightGBM] [Warning] valid is set=binary.test, valid_data=binary.test will be ignored. Current value: valid=binary.test
[LightGBM] [Info] Finished loading parameters
[LightGBM] [Info] Using column number 0 as label
[LightGBM] [Info] Loading weights...
[LightGBM] [Info] Construct bin mappers from text data time 0.01 seconds
[LightGBM] [Info] Loading weights...
[LightGBM] [Info] Finished loading data in 0.047841 seconds
[LightGBM] [Info] Number of positive: 3716, number of negative: 3284
[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 6132
[LightGBM] [Info] Number of data points in the train set: 7000, number of used features: 28
[LightGBM] [Info] Using GPU Device: GeForce RTX 2060 with Max-Q Design, Vendor: NVIDIA Corporation
[LightGBM] [Info] Compiling OpenCL Kernel with 256 bins...
[LightGBM] [Info] GPU programs have been built
[LightGBM] [Info] Size of histogram bin entry: 8
[LightGBM] [Info] 27 dense feature groups (0.19 MB) transferred to GPU in 0.002940 secs. 1 sparse feature groups
[LightGBM] [Info] Finished initializing training
[LightGBM] [Info] Started training...
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.530877 -> initscore=0.123666
[LightGBM] [Info] Start training from score 0.123666
[LightGBM] [Info] Size of histogram bin entry: 8
[LightGBM] [Info] 27 dense feature groups (0.15 MB) transferred to GPU in 0.002903 secs. 1 sparse feature groups
[LightGBM] [Info] Iteration:1, training binary_logloss : 0.670374
[LightGBM] [Info] Iteration:1, training auc : 0.761949
[LightGBM] [Info] Iteration:1, valid_1 binary_logloss : 0.672837
[LightGBM] [Info] Iteration:1, valid_1 auc : 0.714528
[LightGBM] [Info] 0.025277 seconds elapsed, finished iteration 1
......
[LightGBM] [Info] Iteration:100, training binary_logloss : 0.220836
[LightGBM] [Info] Iteration:100, training auc : 0.99766
[LightGBM] [Info] Iteration:100, valid_1 binary_logloss : 0.492312
[LightGBM] [Info] Iteration:100, valid_1 auc : 0.839994
[LightGBM] [Info] 1.601896 seconds elapsed, finished iteration 100
[LightGBM] [Info] Finished training

Train on CPU:
(base) D:\Github\LightGBM\examples\binary_classification>"../../Release/lightgbm.exe" config=train.conf data=binary.train valid=binary.test objective=binary device=cpu
[LightGBM] [Warning] objective is set=binary, objective=binary will be ignored. Current value: objective=binary
[LightGBM] [Warning] data is set=binary.train, data=binary.train will be ignored. Current value: data=binary.train
[LightGBM] [Warning] valid is set=binary.test, valid_data=binary.test will be ignored. Current value: valid=binary.test
[LightGBM] [Info] Finished loading parameters
[LightGBM] [Info] Using column number 0 as label
[LightGBM] [Info] Loading weights...
[LightGBM] [Info] Construct bin mappers from text data time 0.01 seconds
[LightGBM] [Info] Loading weights...
[LightGBM] [Info] Finished loading data in 0.044767 seconds
[LightGBM] [Info] Number of positive: 3716, number of negative: 3284
[LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000864 seconds.
You can set force_col_wise=true to remove the overhead.
[LightGBM] [Info] Total Bins 6132
[LightGBM] [Info] Number of data points in the train set: 7000, number of used features: 28
[LightGBM] [Info] Finished initializing training
[LightGBM] [Info] Started training...
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.530877 -> initscore=0.123666
[LightGBM] [Info] Start training from score 0.123666
[LightGBM] [Info] Iteration:1, training binary_logloss : 0.670374
[LightGBM] [Info] Iteration:1, training auc : 0.761954
[LightGBM] [Info] Iteration:1, valid_1 binary_logloss : 0.672837
[LightGBM] [Info] Iteration:1, valid_1 auc : 0.714544
[LightGBM] [Info] 0.002136 seconds elapsed, finished iteration 1
......
[LightGBM] [Info] Iteration:100, training binary_logloss : 0.221658
[LightGBM] [Info] Iteration:100, training auc : 0.997396
[LightGBM] [Info] Iteration:100, valid_1 binary_logloss : 0.503137
[LightGBM] [Info] Iteration:100, valid_1 auc : 0.831562
[LightGBM] [Info] 0.314613 seconds elapsed, finished iteration 100
[LightGBM] [Info] Finished training

shiyu1994 · 2021-09-23T15:53:02Z

Hi @icejean, we will update a new CUDA version for LightGBM soon. It is expected to large improve current GPU performance. You can refer to #4528 to check the progress.

icejean · 2021-09-24T01:15:45Z

Great!

asheetal · 2022-07-16T00:04:29Z

I installed latest lightGBM from github. Followed the higgs tutorial on lightbgm website
GPU = GTX 1070 Cuda 11
CPU = AMD Ryzen Threadripper 2950X

For AUC calculation
GPU -> [LightGBM] [Info] 27.605478 seconds elapsed, finished iteration 50
CPU -> [LightGBM] [Info] 24.136331 seconds elapsed, finished iteration 50

For L2
GPU -> [LightGBM] [Info] 22.157089 seconds elapsed, finished iteration 50
CPU -> [LightGBM] [Info] 23.382834 seconds elapsed, finished iteration 50

Not very impressive results. Need some guidance before I can submit a large job using lightgbm

jameslamb · 2023-08-18T03:01:04Z

This was locked accidentally. I just unlocked it. We'd still welcome contributions related to this feature!

guolinke added this to the v3.0 milestone Aug 3, 2017

guolinke mentioned this issue Aug 7, 2017

10 warnings for "C4267" in compile gpu build #788

Closed

guolinke mentioned this issue Aug 16, 2017

Training on GPU does not utilise GPU properly #836

Closed

Laurae2 assigned huanzhang12 Oct 1, 2017

guolinke added the feature request label Jun 13, 2018

guolinke mentioned this issue Jun 28, 2018

LightGBM doesn't use full power of GPU #1476

Closed

StrikerRUS mentioned this issue Jun 7, 2019

LightGBMError: GPU Tree Learner was not enabled in this build. #2222

Closed

guolinke mentioned this issue Aug 1, 2019

Feature Requests & Voting Hub #2302

Open

guolinke closed this as completed Aug 1, 2019

StrikerRUS mentioned this issue Aug 11, 2019

Python install - Misleading exception raised in Setup.py on GPU install #1121

Closed

StrikerRUS mentioned this issue Oct 25, 2019

How to reduce the GPU memory usage? #2509

Closed

huanzhang12 mentioned this issue Aug 11, 2020

Add support for CUDA-based GPU build #3160

Merged

StrikerRUS mentioned this issue Sep 21, 2020

LightGBMError: GPU Tree Learner was not enabled in this build (ubuntu 18.04 - anaconda - jupyter notebook env) #3310

Closed

This was referenced Nov 29, 2020

Optimisations for Apple Silicon #3606

Closed

why running GPU version but the GPU-Util is 0% #3619

Closed

jmoralez mentioned this issue Oct 31, 2022

[gpu] LightGBM GPU Trainer is stuck at Compiling OpenCL Kernel with 64 bins #5536

Closed

This comment was marked as off-topic.

Sign in to view

github-actions bot locked as resolved and limited conversation to collaborators Aug 15, 2023

microsoft unlocked this conversation Aug 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[GPU] further improving GPU performance #768

[GPU] further improving GPU performance #768

guolinke commented Aug 2, 2017

guolinke commented Aug 2, 2017

huanzhang12 commented Aug 7, 2017

RAMitchell commented Aug 7, 2017 •

edited

Loading

huanzhang12 commented Aug 16, 2017

StrikerRUS commented Jul 4, 2018

chivee commented Jul 5, 2018

huanzhang12 commented Jul 6, 2018

StrikerRUS commented Aug 1, 2019

icejean commented Sep 23, 2021

shiyu1994 commented Sep 23, 2021

icejean commented Sep 24, 2021

asheetal commented Jul 16, 2022

This comment was marked as off-topic.

jameslamb commented Aug 18, 2023

[GPU] further improving GPU performance #768

[GPU] further improving GPU performance #768

Comments

guolinke commented Aug 2, 2017

guolinke commented Aug 2, 2017

huanzhang12 commented Aug 7, 2017

RAMitchell commented Aug 7, 2017 • edited Loading

huanzhang12 commented Aug 16, 2017

StrikerRUS commented Jul 4, 2018

chivee commented Jul 5, 2018

huanzhang12 commented Jul 6, 2018

StrikerRUS commented Aug 1, 2019

icejean commented Sep 23, 2021

shiyu1994 commented Sep 23, 2021

icejean commented Sep 24, 2021

asheetal commented Jul 16, 2022

This comment was marked as off-topic.

jameslamb commented Aug 18, 2023

RAMitchell commented Aug 7, 2017 •

edited

Loading