Initial GPU acceleration support for LightGBM #368

huanzhang12 · 2017-03-27T06:23:15Z

This set of patches adds initial GPU acceleration for LightGBM, by accelerating histogram construction on GPUs. The implementation is highly modular and does not affect existing features of LightGBM; GPU kernel code is mostly independent of other parts of LightGBM, so long-term maintenance is easier.

I add a new type of tree learner in gpu_tree_learner.h and gpu_tree_learner.cpp. All GPU interfacing code is inside these two files. Changes to other parts of LightGBM are kept minimal; I only make some small interface changes to make necessary data structure available for the GPU tree learner. GPU code (in folder src/treelearner/ocl/) are implemented in OpenCL and tested on both AMD and NVIDIA GPUs.

For building and testing procedure, and some initial performance comparison, please see:
https://github.com/huanzhang12/lightgbm-gpu
(We probably want to move these instructions to Wiki after merging)

guolinke · 2017-04-06T02:55:12Z

src/treelearner/gpu_tree_learner.cpp

+      for (data_size_t i = 0; i < num_data; ++i) {
+        ordered_gradients[i] = gradients[data_indices[i]];
+      }
+      gradients_future_ = queue_.enqueue_write_buffer_async(device_gradients_, 0, num_data * sizeof(score_t), ptr_pinned_gradients_);


why BeforeFindBestSplit and ConstructGPUHistogramsAsync both need to copy ordered_grad and ordered_hess to GPU?

BeforeFindBestSplit just copies the ordered_grad and ordered_hess for the smaller leaf. For the larger leaf, if subtraction is not possible, ordered_grad and ordered_hess has to be copied before launching the GPU kernel.

OK. Why not doing all copy in the GPUTreeLearner::ConstructGPUHistogramsAsync?

Because I want to start data copying process as early as possible, hopefully overlap the copying process with other task on CPU and hide the data transfer cost. Currently, in most cases the copying procedure entirely overlaps with SerialTreeLearner::BeforeFindBestSplit. We only wait for these event futures (indices_future_, gradients_future_ and hessians_future_) immediately before launching the GPU kernel, and hopefully at that time all data has been transferred to GPU in the background and the overhead of data transfer is close to 0.

guolinke · 2017-04-06T03:05:07Z

src/treelearner/gpu_tree_learner.cpp

+void GPUTreeLearner::ConstructHistograms(const std::vector<int8_t>& is_feature_used, bool use_subtract) {
+  std::vector<int8_t> is_sparse_feature_used(num_features_, 0);
+  std::vector<int8_t> is_dense_feature_used(num_features_, 0);
+  for (int feature_index = 0; feature_index < num_features_; ++feature_index) {


change all loop with num_features_ and num_group (like https://github.com/Microsoft/LightGBM/blob/master/src/treelearner/serial_tree_learner.cpp#L450-L459 and ) to multi-threading? Or it will be slow when #feature is large.

guolinke · 2017-04-06T03:08:44Z

src/treelearner/gpu_tree_learner.cpp

+  }
+  // converted indices in is_feature_used to feature-group indices
+  std::vector<int8_t> is_feature_group_used(num_feature_groups_, 0);
+  for (int i = 0; i < num_features_; ++i) {


multi-threading here? refer to https://github.com/Microsoft/LightGBM/blob/master/src/io/dataset.cpp#L437-L448

guolinke · 2017-04-06T03:11:54Z

src/treelearner/gpu_tree_learner.cpp

+  }
+  // construct smaller leaf
+  HistogramBinEntry* ptr_smaller_leaf_hist_data = smaller_leaf_histogram_array_[0].RawData() - 1;
+  bool use_gpu = ConstructGPUHistogramsAsync(is_feature_used,


only when num_data==0 will return false? The return name use_gpu is confusing.

There are some cases GPU will not be actually used, like num_data==0 or all dense features are disabled (I will add this). In these cases ConstructGPUHistogramsAsync returns false, indicating that we don't need to call WaitAndGetHistograms, otherwise it will deadlock. I can change the variable name and add some comments to the code.

guolinke · 2017-04-06T03:13:31Z

src/treelearner/gpu_tree_learner.cpp

+    leaf = larger_leaf_splits_->LeafIndex();
+    auto larger_best_idx = ArrayArgs<SplitInfo>::ArgMax(larger_best);
+    best_split_per_leaf_[leaf] = larger_best[larger_best_idx];
+  }


These duplicate code is only for debug purpose ?

If yes, can you just call SerialTreelearner::FindBestThresholds(), and write the additional codes for debugging?

Yes, for now we can actually just use SerialTreelearner::FindBestThresholds(). But we should keep in mind that later on we also want to do FixHistograms and FindBestThreshold on GPU as well, without copying the histograms back to CPU.

guolinke · 2017-04-06T11:15:54Z

@huanzhang12
I am OK if you fix these multi-threading issues.

@chivee Do you have any comments ?

huanzhang12 · 2017-04-07T00:53:43Z

@guolinke I have added "omp parallel for" to the loops you mentioned, as well as some other places.

I found that for the two loops in ConstructGPUHistogramsAsync, the workload of the loop is too small, and it actually harms performance to do multithreading. For yahoo and epsilon, the entire training can be 20% slower (reproducible). So I added an additional if clause for OpenMP, enabling parallelization only when the loop count is large enough.

I also need to add constant hessian optimization to the GPU kernel before merging. But the GPUTreeLearner class will mostly remain unchanged.

guolinke · 2017-04-07T04:22:57Z

@huanzhang12
Thanks, I also fix some other "light" omp loops by using if .

huanzhang12 · 2017-04-08T07:15:05Z

@guolinke I have added constant hessian optimization to the GPU kernels. I changed the interface of TreeLearner a little bit: I also pass the is_constant_hessian flag to Init, as the GPUTreeLearner needs to compile different GPU code based on this flag during initialization and then prepare a lot of stuff before training starts. In function Train, if a different value is_constant_hessian is passed, it will recompile the GPU kernels accordingly. For other tree learners, it does not matter. Do you think this is okay?

huanzhang12 · 2017-04-08T07:24:41Z

@guolinke I am also considering adding TravisCI test for GPU, in case that some further updates in other parts break GPU functionality. In the testing environment there is no real GPU, but thanks to the good portability of OpenCL, it is possible to test the GPU code on CPU device after installing a CPU OpenCL environment (like AMD APP SDK or pocl). Ideally, all test should pass when device is set to both cpu and gpu. We don't need to change other test options.

What is the best way to globally set a parameter for LightGBM? For example I want to set device=gpu and then run all existing testing scripts. I don't want to do heavy modifications to these scripts and add device=xxx manually to every tests. In Theano, it takes an environment variable THEANO_FLAGS which can easily override a config option. Is there a similar thing in LightGBM?

guolinke · 2017-04-08T07:27:03Z

@huanzhang12 I think it didn't have this feature. But it should be easy to add one.
It is ok for the is_constant_hessian.

@wxchan Can you add this into python package?

guolinke · 2017-04-08T08:22:19Z

@huanzhang12
I think the test can be done by another PR after finishing this.

huanzhang12 · 2017-04-08T09:43:47Z

@guolinke Sure. Currently when I manually do the test there is one error when GPU is enabled (test_continue_train_multiclass). I will fix this. After that this PR should be ready ;)

wxchan · 2017-04-08T13:43:48Z

@guolinke ok. Do you mean add an argument is_constant_hessian to sklearn __init__?

guolinke · 2017-04-08T14:18:51Z

@wxchan No, I mean add something like THEANO_FLAGS(environment variable) to choose cpu/gpu device when call lightgbm.dll . I think it can be solved by detect the environment variable, and change the related parameters passed to lightgbm.dll .

huanzhang12 · 2017-04-08T23:09:35Z

@guolinke The variable filter_cnt passed to function BinMapper::FindBin seems to be calculated in different ways at the two places in dataset_io.cpp:

At https://github.com/Microsoft/LightGBM/blob/master/src/io/dataset_loader.cpp#L491-L492

  const data_size_t filter_cnt = static_cast<data_size_t>(
    static_cast<double>(io_config_.min_data_in_leaf * total_sample_size) / num_data);

Where total_sample_size seems to be the total number of examples;

But at https://github.com/Microsoft/LightGBM/blob/master/src/io/dataset_loader.cpp#L709-L710

  const data_size_t filter_cnt = static_cast<data_size_t>(
    static_cast<double>(io_config_.min_data_in_leaf* sample_values.size()) / dataset->num_data_);

sample_values.size() is the number of features.

I found this issue when debugging the failure test case. I guess one of them is a typo?

guolinke · 2017-04-09T07:11:59Z

oh, It is typo. I will fix it..

huanzhang12 · 2017-04-09T10:34:49Z

@guolinke I have fixed the test failure case. Also I added travisCI tests for GPU learner. You can make a final review and finish this PR now.

guolinke · 2017-04-26T12:09:27Z

@huanzhang12
It seems XGBoost just support the histogram algorithm on GPU.
Do you have chance to compare it ?

huanzhang12 · 2017-04-26T23:19:11Z

@guolinke I haven't got a chance to carefully compare with it yet.
I briefly take a look at the code, and the good thing is that in XGBoost everything is running on GPU (histogram, split, update, etc), so it could be faster than LightGBM (only doing histogram on GPU) when the dataset is small. However I am not sure if the GPU tree learner in XGBoost is exactly the same as the CPU one, because it basically re-implements everything.

The histogram building procedure on XGBoost is not as simple as the one in LightGBM and needs more bookkeeping, so I guess LightGBM could outperform it on large dataset.

tt83 · 2017-04-26T23:39:47Z

@huanzhang12 Is it possible to parallelize prediction on GPU?

huanzhang12 · 2017-04-28T03:50:00Z

@tt83 Yes I think it is possible. In fact, we can generate OpenCL code like #469 for GPU, and run it on GPU for prediction.

huanzhang12 added 30 commits February 10, 2017 03:35

add dummy gpu solver code

4810c79

initial GPU code

e41ba15

fix crash bug

6dde565

first working version

2dce7d1

use asynchronous copy

146b2dd

use a better kernel for root

1f39a03

parallel read histogram

435674d

sparse features now works, but no acceleration, compute on CPU

22f478a

compute sparse feature on CPU simultaneously

cfd77ae

fix big bug; add gpu selection; add kernel selection

40c3212

better debugging

c3398c9

clean up

76a13c7

add feature scatter

2dc4555

Add sparse_threshold control

d4c1c01

fix a bug in feature scatter

97da274

clean up debug

a96ca80

temporarily add OpenCL kernels for k=64,256

9be6438

fix up CMakeList and definition USE_GPU

cbef453

add OpenCL kernels as string literals

4d08152

Add boost.compute as a submodule

624d405

add boost dependency into CMakeList

11b241f

fix opencl pragma

5142f19

use pinned memory for histogram

508b48c

use pinned buffer for gradients and hessians

1a63b99

better debugging message

e2166b1

add double precision support on GPU

3b24e33

fix boost version in CMakeList

e7336ee

Add a README

b29fec7

reconstruct GPU initialization code for ResetTrainingData

97fed3e

move data to GPU in parallel

164dbd1

guolinke reviewed Apr 6, 2017

View reviewed changes

clean up FindBestThresholds; add some omp parallel

b948c1f

Merge remote-tracking branch 'upstream/master'

50f7da1

huanzhang12 added 2 commits April 7, 2017 14:14

Merge remote-tracking branch 'upstream/master'

3a16753

constant hessian optimization for GPU

2b0514e

huanzhang12 added 2 commits April 8, 2017 22:42

Fix GPUTreeLearner crash when there is zero feature

e72d8cd

use np.testing.assert_almost_equal() to compare lists of floats in tests

a68ae52

huanzhang12 added 2 commits April 9, 2017 03:10

travis for GPU

2ac5103

Merge remote-tracking branch 'upstream/master'

edb30a6

guolinke merged commit 0bb4a82 into microsoft:master Apr 9, 2017

Laurae2 mentioned this pull request Apr 9, 2017

[WIP] Install documentation for LightGBM on GPU #389

Closed

3 tasks

huanzhang12 changed the title ~~[WIP]Initial GPU acceleration support for LightGBM~~ Initial GPU acceleration support for LightGBM Apr 26, 2017

lock bot locked as resolved and limited conversation to collaborators Mar 12, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Initial GPU acceleration support for LightGBM #368

Initial GPU acceleration support for LightGBM #368

huanzhang12 commented Mar 27, 2017

guolinke Apr 6, 2017

huanzhang12 Apr 6, 2017

guolinke Apr 6, 2017

huanzhang12 Apr 6, 2017 •

edited

Loading

guolinke Apr 6, 2017

huanzhang12 Apr 6, 2017

guolinke Apr 6, 2017

huanzhang12 Apr 6, 2017

guolinke Apr 6, 2017

huanzhang12 Apr 6, 2017

guolinke Apr 6, 2017

guolinke Apr 6, 2017

huanzhang12 Apr 6, 2017 •

edited

Loading

guolinke commented Apr 6, 2017

huanzhang12 commented Apr 7, 2017

guolinke commented Apr 7, 2017

huanzhang12 commented Apr 8, 2017

huanzhang12 commented Apr 8, 2017 •

edited

Loading

guolinke commented Apr 8, 2017 •

edited

Loading

guolinke commented Apr 8, 2017

huanzhang12 commented Apr 8, 2017

wxchan commented Apr 8, 2017

guolinke commented Apr 8, 2017

huanzhang12 commented Apr 8, 2017

guolinke commented Apr 9, 2017

huanzhang12 commented Apr 9, 2017 •

edited

Loading

guolinke commented Apr 26, 2017

huanzhang12 commented Apr 26, 2017

tt83 commented Apr 26, 2017

huanzhang12 commented Apr 28, 2017

Initial GPU acceleration support for LightGBM #368

Initial GPU acceleration support for LightGBM #368

Conversation

huanzhang12 commented Mar 27, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

huanzhang12 Apr 6, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

huanzhang12 Apr 6, 2017 • edited Loading

Choose a reason for hiding this comment

guolinke commented Apr 6, 2017

huanzhang12 commented Apr 7, 2017

guolinke commented Apr 7, 2017

huanzhang12 commented Apr 8, 2017

huanzhang12 commented Apr 8, 2017 • edited Loading

guolinke commented Apr 8, 2017 • edited Loading

guolinke commented Apr 8, 2017

huanzhang12 commented Apr 8, 2017

wxchan commented Apr 8, 2017

guolinke commented Apr 8, 2017

huanzhang12 commented Apr 8, 2017

guolinke commented Apr 9, 2017

huanzhang12 commented Apr 9, 2017 • edited Loading

guolinke commented Apr 26, 2017

huanzhang12 commented Apr 26, 2017

tt83 commented Apr 26, 2017

huanzhang12 commented Apr 28, 2017

huanzhang12 Apr 6, 2017 •

edited

Loading

huanzhang12 Apr 6, 2017 •

edited

Loading

huanzhang12 commented Apr 8, 2017 •

edited

Loading

guolinke commented Apr 8, 2017 •

edited

Loading

huanzhang12 commented Apr 9, 2017 •

edited

Loading