Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial GPU acceleration support for LightGBM #368

Merged
merged 97 commits into from
Apr 9, 2017

Conversation

huanzhang12
Copy link
Contributor

This set of patches adds initial GPU acceleration for LightGBM, by accelerating histogram construction on GPUs. The implementation is highly modular and does not affect existing features of LightGBM; GPU kernel code is mostly independent of other parts of LightGBM, so long-term maintenance is easier.

I add a new type of tree learner in gpu_tree_learner.h and gpu_tree_learner.cpp. All GPU interfacing code is inside these two files. Changes to other parts of LightGBM are kept minimal; I only make some small interface changes to make necessary data structure available for the GPU tree learner. GPU code (in folder src/treelearner/ocl/) are implemented in OpenCL and tested on both AMD and NVIDIA GPUs.

For building and testing procedure, and some initial performance comparison, please see:
https://github.com/huanzhang12/lightgbm-gpu
(We probably want to move these instructions to Wiki after merging)

for (data_size_t i = 0; i < num_data; ++i) {
ordered_gradients[i] = gradients[data_indices[i]];
}
gradients_future_ = queue_.enqueue_write_buffer_async(device_gradients_, 0, num_data * sizeof(score_t), ptr_pinned_gradients_);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why BeforeFindBestSplit and ConstructGPUHistogramsAsync both need to copy ordered_grad and ordered_hess to GPU?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BeforeFindBestSplit just copies the ordered_grad and ordered_hess for the smaller leaf. For the larger leaf, if subtraction is not possible, ordered_grad and ordered_hess has to be copied before launching the GPU kernel.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. Why not doing all copy in the GPUTreeLearner::ConstructGPUHistogramsAsync?

Copy link
Contributor Author

@huanzhang12 huanzhang12 Apr 6, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because I want to start data copying process as early as possible, hopefully overlap the copying process with other task on CPU and hide the data transfer cost. Currently, in most cases the copying procedure entirely overlaps with SerialTreeLearner::BeforeFindBestSplit. We only wait for these event futures (indices_future_, gradients_future_ and hessians_future_) immediately before launching the GPU kernel, and hopefully at that time all data has been transferred to GPU in the background and the overhead of data transfer is close to 0.

void GPUTreeLearner::ConstructHistograms(const std::vector<int8_t>& is_feature_used, bool use_subtract) {
std::vector<int8_t> is_sparse_feature_used(num_features_, 0);
std::vector<int8_t> is_dense_feature_used(num_features_, 0);
for (int feature_index = 0; feature_index < num_features_; ++feature_index) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change all loop with num_features_ and num_group (like https://github.com/Microsoft/LightGBM/blob/master/src/treelearner/serial_tree_learner.cpp#L450-L459 and ) to multi-threading? Or it will be slow when #feature is large.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

}
// converted indices in is_feature_used to feature-group indices
std::vector<int8_t> is_feature_group_used(num_feature_groups_, 0);
for (int i = 0; i < num_features_; ++i) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

}
// construct smaller leaf
HistogramBinEntry* ptr_smaller_leaf_hist_data = smaller_leaf_histogram_array_[0].RawData() - 1;
bool use_gpu = ConstructGPUHistogramsAsync(is_feature_used,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only when num_data==0 will return false? The return name use_gpu is confusing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some cases GPU will not be actually used, like num_data==0 or all dense features are disabled (I will add this). In these cases ConstructGPUHistogramsAsync returns false, indicating that we don't need to call WaitAndGetHistograms, otherwise it will deadlock. I can change the variable name and add some comments to the code.

leaf = larger_leaf_splits_->LeafIndex();
auto larger_best_idx = ArrayArgs<SplitInfo>::ArgMax(larger_best);
best_split_per_leaf_[leaf] = larger_best[larger_best_idx];
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These duplicate code is only for debug purpose ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If yes, can you just call SerialTreelearner::FindBestThresholds(), and write the additional codes for debugging?

Copy link
Contributor Author

@huanzhang12 huanzhang12 Apr 6, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, for now we can actually just use SerialTreelearner::FindBestThresholds(). But we should keep in mind that later on we also want to do FixHistograms and FindBestThreshold on GPU as well, without copying the histograms back to CPU.

@guolinke
Copy link
Collaborator

guolinke commented Apr 6, 2017

@huanzhang12
I am OK if you fix these multi-threading issues.

@chivee Do you have any comments ?

@huanzhang12
Copy link
Contributor Author

@guolinke I have added "omp parallel for" to the loops you mentioned, as well as some other places.

I found that for the two loops in ConstructGPUHistogramsAsync, the workload of the loop is too small, and it actually harms performance to do multithreading. For yahoo and epsilon, the entire training can be 20% slower (reproducible). So I added an additional if clause for OpenMP, enabling parallelization only when the loop count is large enough.

I also need to add constant hessian optimization to the GPU kernel before merging. But the GPUTreeLearner class will mostly remain unchanged.

@guolinke
Copy link
Collaborator

guolinke commented Apr 7, 2017

@huanzhang12
Thanks, I also fix some other "light" omp loops by using if .

@huanzhang12
Copy link
Contributor Author

@guolinke I have added constant hessian optimization to the GPU kernels. I changed the interface of TreeLearner a little bit: I also pass the is_constant_hessian flag to Init, as the GPUTreeLearner needs to compile different GPU code based on this flag during initialization and then prepare a lot of stuff before training starts. In function Train, if a different value is_constant_hessian is passed, it will recompile the GPU kernels accordingly. For other tree learners, it does not matter. Do you think this is okay?

@huanzhang12
Copy link
Contributor Author

huanzhang12 commented Apr 8, 2017

@guolinke I am also considering adding TravisCI test for GPU, in case that some further updates in other parts break GPU functionality. In the testing environment there is no real GPU, but thanks to the good portability of OpenCL, it is possible to test the GPU code on CPU device after installing a CPU OpenCL environment (like AMD APP SDK or pocl). Ideally, all test should pass when device is set to both cpu and gpu. We don't need to change other test options.

What is the best way to globally set a parameter for LightGBM? For example I want to set device=gpu and then run all existing testing scripts. I don't want to do heavy modifications to these scripts and add device=xxx manually to every tests. In Theano, it takes an environment variable THEANO_FLAGS which can easily override a config option. Is there a similar thing in LightGBM?

@guolinke
Copy link
Collaborator

guolinke commented Apr 8, 2017

@huanzhang12 I think it didn't have this feature. But it should be easy to add one.
It is ok for the is_constant_hessian.

@wxchan Can you add this into python package?

@guolinke
Copy link
Collaborator

guolinke commented Apr 8, 2017

@huanzhang12
I think the test can be done by another PR after finishing this.

@huanzhang12
Copy link
Contributor Author

@guolinke Sure. Currently when I manually do the test there is one error when GPU is enabled (test_continue_train_multiclass). I will fix this. After that this PR should be ready ;)

@wxchan
Copy link
Contributor

wxchan commented Apr 8, 2017

@guolinke ok. Do you mean add an argument is_constant_hessian to sklearn __init__?

@guolinke
Copy link
Collaborator

guolinke commented Apr 8, 2017

@wxchan No, I mean add something like THEANO_FLAGS(environment variable) to choose cpu/gpu device when call lightgbm.dll . I think it can be solved by detect the environment variable, and change the related parameters passed to lightgbm.dll .

@huanzhang12
Copy link
Contributor Author

@guolinke The variable filter_cnt passed to function BinMapper::FindBin seems to be calculated in different ways at the two places in dataset_io.cpp:

At https://github.com/Microsoft/LightGBM/blob/master/src/io/dataset_loader.cpp#L491-L492

  const data_size_t filter_cnt = static_cast<data_size_t>(
    static_cast<double>(io_config_.min_data_in_leaf * total_sample_size) / num_data);

Where total_sample_size seems to be the total number of examples;

But at https://github.com/Microsoft/LightGBM/blob/master/src/io/dataset_loader.cpp#L709-L710

  const data_size_t filter_cnt = static_cast<data_size_t>(
    static_cast<double>(io_config_.min_data_in_leaf* sample_values.size()) / dataset->num_data_);

sample_values.size() is the number of features.

I found this issue when debugging the failure test case. I guess one of them is a typo?

@guolinke
Copy link
Collaborator

guolinke commented Apr 9, 2017

oh, It is typo. I will fix it..

@huanzhang12
Copy link
Contributor Author

huanzhang12 commented Apr 9, 2017

@guolinke I have fixed the test failure case. Also I added travisCI tests for GPU learner. You can make a final review and finish this PR now.

@guolinke guolinke merged commit 0bb4a82 into microsoft:master Apr 9, 2017
@guolinke
Copy link
Collaborator

@huanzhang12
It seems XGBoost just support the histogram algorithm on GPU.
Do you have chance to compare it ?

@huanzhang12 huanzhang12 changed the title [WIP]Initial GPU acceleration support for LightGBM Initial GPU acceleration support for LightGBM Apr 26, 2017
@huanzhang12
Copy link
Contributor Author

@guolinke I haven't got a chance to carefully compare with it yet.
I briefly take a look at the code, and the good thing is that in XGBoost everything is running on GPU (histogram, split, update, etc), so it could be faster than LightGBM (only doing histogram on GPU) when the dataset is small. However I am not sure if the GPU tree learner in XGBoost is exactly the same as the CPU one, because it basically re-implements everything.

The histogram building procedure on XGBoost is not as simple as the one in LightGBM and needs more bookkeeping, so I guess LightGBM could outperform it on large dataset.

@tt83
Copy link

tt83 commented Apr 26, 2017

@huanzhang12 Is it possible to parallelize prediction on GPU?

@huanzhang12
Copy link
Contributor Author

@tt83 Yes I think it is possible. In fact, we can generate OpenCL code like #469 for GPU, and run it on GPU for prediction.

@lock lock bot locked as resolved and limited conversation to collaborators Mar 12, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants