-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initial GPU acceleration support for LightGBM #368
Conversation
for (data_size_t i = 0; i < num_data; ++i) { | ||
ordered_gradients[i] = gradients[data_indices[i]]; | ||
} | ||
gradients_future_ = queue_.enqueue_write_buffer_async(device_gradients_, 0, num_data * sizeof(score_t), ptr_pinned_gradients_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why BeforeFindBestSplit and ConstructGPUHistogramsAsync both need to copy ordered_grad and ordered_hess to GPU?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BeforeFindBestSplit
just copies the ordered_grad and ordered_hess for the smaller leaf. For the larger leaf, if subtraction is not possible, ordered_grad and ordered_hess has to be copied before launching the GPU kernel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. Why not doing all copy in the GPUTreeLearner::ConstructGPUHistogramsAsync?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because I want to start data copying process as early as possible, hopefully overlap the copying process with other task on CPU and hide the data transfer cost. Currently, in most cases the copying procedure entirely overlaps with SerialTreeLearner::BeforeFindBestSplit
. We only wait for these event futures (indices_future_
, gradients_future_
and hessians_future_
) immediately before launching the GPU kernel, and hopefully at that time all data has been transferred to GPU in the background and the overhead of data transfer is close to 0.
src/treelearner/gpu_tree_learner.cpp
Outdated
void GPUTreeLearner::ConstructHistograms(const std::vector<int8_t>& is_feature_used, bool use_subtract) { | ||
std::vector<int8_t> is_sparse_feature_used(num_features_, 0); | ||
std::vector<int8_t> is_dense_feature_used(num_features_, 0); | ||
for (int feature_index = 0; feature_index < num_features_; ++feature_index) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
change all loop with num_features_
and num_group
(like https://github.com/Microsoft/LightGBM/blob/master/src/treelearner/serial_tree_learner.cpp#L450-L459 and ) to multi-threading? Or it will be slow when #feature is large.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
src/treelearner/gpu_tree_learner.cpp
Outdated
} | ||
// converted indices in is_feature_used to feature-group indices | ||
std::vector<int8_t> is_feature_group_used(num_feature_groups_, 0); | ||
for (int i = 0; i < num_features_; ++i) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
multi-threading here? refer to https://github.com/Microsoft/LightGBM/blob/master/src/io/dataset.cpp#L437-L448
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
src/treelearner/gpu_tree_learner.cpp
Outdated
} | ||
// construct smaller leaf | ||
HistogramBinEntry* ptr_smaller_leaf_hist_data = smaller_leaf_histogram_array_[0].RawData() - 1; | ||
bool use_gpu = ConstructGPUHistogramsAsync(is_feature_used, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
only when num_data==0
will return false? The return name use_gpu
is confusing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are some cases GPU will not be actually used, like num_data==0
or all dense features are disabled (I will add this). In these cases ConstructGPUHistogramsAsync
returns false, indicating that we don't need to call WaitAndGetHistograms
, otherwise it will deadlock. I can change the variable name and add some comments to the code.
src/treelearner/gpu_tree_learner.cpp
Outdated
leaf = larger_leaf_splits_->LeafIndex(); | ||
auto larger_best_idx = ArrayArgs<SplitInfo>::ArgMax(larger_best); | ||
best_split_per_leaf_[leaf] = larger_best[larger_best_idx]; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These duplicate code is only for debug purpose ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If yes, can you just call SerialTreelearner::FindBestThresholds(), and write the additional codes for debugging?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, for now we can actually just use SerialTreelearner::FindBestThresholds()
. But we should keep in mind that later on we also want to do FixHistograms
and FindBestThreshold
on GPU as well, without copying the histograms back to CPU.
@huanzhang12 @chivee Do you have any comments ? |
@guolinke I have added "omp parallel for" to the loops you mentioned, as well as some other places. I found that for the two loops in I also need to add constant hessian optimization to the GPU kernel before merging. But the GPUTreeLearner class will mostly remain unchanged. |
@huanzhang12 |
@guolinke I have added constant hessian optimization to the GPU kernels. I changed the interface of TreeLearner a little bit: I also pass the |
@guolinke I am also considering adding TravisCI test for GPU, in case that some further updates in other parts break GPU functionality. In the testing environment there is no real GPU, but thanks to the good portability of OpenCL, it is possible to test the GPU code on CPU device after installing a CPU OpenCL environment (like AMD APP SDK or What is the best way to globally set a parameter for LightGBM? For example I want to set |
@huanzhang12 I think it didn't have this feature. But it should be easy to add one. @wxchan Can you add this into python package? |
@huanzhang12 |
@guolinke Sure. Currently when I manually do the test there is one error when GPU is enabled ( |
@guolinke ok. Do you mean add an argument |
@wxchan No, I mean add something like |
@guolinke The variable At https://github.com/Microsoft/LightGBM/blob/master/src/io/dataset_loader.cpp#L491-L492
Where But at https://github.com/Microsoft/LightGBM/blob/master/src/io/dataset_loader.cpp#L709-L710
I found this issue when debugging the failure test case. I guess one of them is a typo? |
oh, It is typo. I will fix it.. |
@guolinke I have fixed the test failure case. Also I added travisCI tests for GPU learner. You can make a final review and finish this PR now. |
@huanzhang12 |
@guolinke I haven't got a chance to carefully compare with it yet. The histogram building procedure on XGBoost is not as simple as the one in LightGBM and needs more bookkeeping, so I guess LightGBM could outperform it on large dataset. |
@huanzhang12 Is it possible to parallelize prediction on GPU? |
This set of patches adds initial GPU acceleration for LightGBM, by accelerating histogram construction on GPUs. The implementation is highly modular and does not affect existing features of LightGBM; GPU kernel code is mostly independent of other parts of LightGBM, so long-term maintenance is easier.
I add a new type of tree learner in gpu_tree_learner.h and gpu_tree_learner.cpp. All GPU interfacing code is inside these two files. Changes to other parts of LightGBM are kept minimal; I only make some small interface changes to make necessary data structure available for the GPU tree learner. GPU code (in folder src/treelearner/ocl/) are implemented in OpenCL and tested on both AMD and NVIDIA GPUs.
For building and testing procedure, and some initial performance comparison, please see:
https://github.com/huanzhang12/lightgbm-gpu
(We probably want to move these instructions to Wiki after merging)