RF: memset and batch size optimization for computing splits #4001

venkywonka · 2021-06-22T09:18:05Z

optimization 1: Increase the default maximum number of nodes that can be processed per batch (the max_batch_size hyperparameter)
- However, this causes an increase in GPU memory, but for practical workloads, this hardly exceeds 200 MB.
optimization 2: reduce the amount of memory accessed in the memset operations per kernel call

The current PR drastically reduces total number of kernel invocations (while increasing work-per-invocation) and also memsets required per kernel invocation. This can be seen in the following plot on the year dataset.
- x-axis: (with/without optimization 1 x with/without optimization 2) , y-axis: times (s)
- CSRK = computeSplitRegressionKernel

With n_estimators: 10, n_streams: 4, max_depth:32 (rest default) the following are the gbm-bench plots:
- (main: branch-21.08 , devel: current PR, skl: scikit-learn RF)
- scores are accuracy for classification and MSE for regression
- Note: scikit-learn runs on n_jobs=-1 so it's leveraging all the 24 CPUs in my machine

…devel

…enh-ext-partial-memset-and-batch-size-optimization

…e-optimization

RAMitchell

Looks very nice, simple changes and awesome results.

RAMitchell · 2021-06-22T22:41:43Z

cpp/src/decisiontree/batched-levelalgo/builder_base.cuh

@@ -338,13 +338,16 @@ struct Builder {
    raft::update_device(curr_nodes, h_nodes.data() + node_start, batchSize, s);

    int total_samples_in_curr_batch = 0;
+    int n_large_nodes_in_curr_batch = 0;


Comment what this variable is. e.g. nodes with number of training instances larger than block size. These nodes require global memory for histograms.

done 👍🏾

RAMitchell · 2021-06-22T22:42:40Z

cpp/src/decisiontree/batched-levelalgo/builder_base.cuh

@@ -446,7 +452,9 @@ struct ClsTraits {
    // Pick the max of two
    size_t smemSize = std::max(smemSize1, smemSize2);
    dim3 grid(b.total_num_blocks, colBlks, 1);
-    CUDA_CHECK(cudaMemsetAsync(b.hist, 0, sizeof(int) * b.nHistBins, s));
+    int nHistBins = 0;
+    nHistBins = n_large_nodes_in_curr_batch * (1 + nbins) * colBlks * nclasses;


0 initialising this variable is unnecessary.

The 1+ in (1+nbins) should disappear when you merge the objective function PR.

ah, that was necessary in my initial prototyping version but missed it somehow.. changed it 👍🏾

RAMitchell · 2021-06-22T22:44:48Z

cpp/src/decisiontree/batched-levelalgo/builder_base.cuh

@@ -507,7 +515,7 @@ struct RegTraits {
   */
  static void computeSplit(Builder<RegTraits<DataT, IdxT>>& b, IdxT col,
                           IdxT batchSize, CRITERION splitType,
-                           cudaStream_t s) {
+                           int& n_large_nodes_in_curr_batch, cudaStream_t s) {


Why is n_large_nodes_in_curr_batch passed by reference? Is it modified somewhere?

it was in my initial version, but you're right, unnecessary here. I have changed it to const int 👍🏾

RAMitchell · 2021-06-22T22:46:30Z

cpp/src/decisiontree/batched-levelalgo/kernels.cuh

@@ -35,6 +35,7 @@ namespace DecisionTree {
 template <typename IdxT>
 struct WorkloadInfo {
  IdxT nodeid;  // Node in the batch on which the threadblock needs to work
+  IdxT large_nodeid;    // counts only large nodes


Make sure you comment what large nodes means.

done 👍🏾

dantegd

codeowner approval

dantegd · 2021-06-23T15:44:05Z

rerun tests

…enh-ext-partial-memset-and-batch-size-optimization

RAMitchell · 2021-06-28T22:37:27Z

rerun tests

RAMitchell · 2021-06-29T01:22:14Z

rerun tests

codecov-commenter · 2021-06-29T07:40:10Z

Codecov Report

❗ No coverage uploaded for pull request base (branch-21.08@f71d369). Click here to learn what that means.
The diff coverage is n/a.

@@               Coverage Diff               @@
##             branch-21.08    #4001   +/-   ##
===============================================
  Coverage                ?   85.44%           
===============================================
  Files                   ?      230           
  Lines                   ?    18088           
  Branches                ?        0           
===============================================
  Hits                    ?    15455           
  Misses                  ?     2633           
  Partials                ?        0

Flag	Coverage Δ
dask	`48.04% <0.00%> (?)`
non-dask	`77.79% <0.00%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f71d369...70665a3. Read the comment docs.

dantegd · 2021-06-29T20:06:24Z

@gpucibot merge

…#4001) * **optimization 1:** Increase the default maximum number of nodes that can be processed per batch (the `max_batch_size` hyperparameter) * However, this causes an increase in GPU memory, but for practical workloads, this hardly exceeds 200 MB. * **optimization 2:** reduce the amount of memory accessed in the memset operations per kernel call --- * The current PR drastically reduces total number of kernel invocations (while increasing work-per-invocation) and also memsets required per kernel invocation. This can be seen in the following plot on the `year` dataset. * x-axis: (with/without `optimization 1` x with/without `optimization 2`) , y-axis: times (s) * `CSRK` = `computeSplitRegressionKernel` * ![year-nsys-kernel-and-memset-times-lite_mode-max_bach_size](https://user-images.githubusercontent.com/23023424/122897144-5b319380-d367-11eb-995f-9c05a086fc0f.png) --- * With `n_estimators: 10`, `n_streams: 4`, `max_depth:32` (rest default) the following are the gbm-bench plots: * (main: branch-21.08 , devel: current PR, skl: scikit-learn RF) * scores are accuracy for classification and MSE for regression * Note: scikit-learn runs on `n_jobs=-1` so it's leveraging all the 24 CPUs in my machine ![memset-batch-opt](https://user-images.githubusercontent.com/23023424/122897816-f88cc780-d367-11eb-9b0f-6384d4ef8cbb.png) Authors: - Venkat (https://github.com/venkywonka) Approvers: - Rory Mitchell (https://github.com/RAMitchell) - Dante Gama Dessavre (https://github.com/dantegd) URL: rapidsai#4001

venkywonka added 8 commits June 5, 2021 14:12

writing workspace to file

797d157

getenv to change blks_for_cols

93187f1

Merge branch 'branch-21.08' of https://github.com/rapidsai/cuml into …

5fa1faf

…devel

memsets for only large nodes

c4b1548

change default

4e20c28

Merge branch 'branch-21.08' of https://github.com/rapidsai/cuml into …

98694d5

…enh-ext-partial-memset-and-batch-size-optimization

pruning prints and changing defaults in other places

8a0762e

FIX clang format

23ff5da

venkywonka requested review from a team as code owners June 22, 2021 09:18

github-actions bot added CUDA/C++ Cython / Python Cython or Python issue labels Jun 22, 2021

venkywonka added Perf Related to runtime performance of the underlying code breaking Breaking change labels Jun 22, 2021

Merge branch 'branch-21.08' into enh-ext-partial-memset-and-batch-siz…

d9c8c04

…e-optimization

RAMitchell reviewed Jun 22, 2021

View reviewed changes

add comments, fix conflicts

68edd49

RAMitchell approved these changes Jun 23, 2021

View reviewed changes

dantegd mentioned this pull request Jun 23, 2021

[BUG] Sporadic pytest crash in hdbscan #3997

Closed

dantegd approved these changes Jun 23, 2021

View reviewed changes

dantegd added the improvement Improvement / enhancement to an existing function label Jun 23, 2021

Merge branch 'branch-21.08' of https://github.com/rapidsai/cuml into …

4b326af

…enh-ext-partial-memset-and-batch-size-optimization

venkywonka added 2 commits June 29, 2021 05:23

merge with branch-21.08

142bed0

change default , update docstrings

70665a3

rapids-bot bot merged commit 705e0df into rapidsai:branch-21.08 Jun 29, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RF: memset and batch size optimization for computing splits #4001

RF: memset and batch size optimization for computing splits #4001

venkywonka commented Jun 22, 2021 •

edited

Loading

RAMitchell left a comment

RAMitchell Jun 22, 2021

venkywonka Jun 23, 2021

RAMitchell Jun 22, 2021

venkywonka Jun 23, 2021

RAMitchell Jun 22, 2021

venkywonka Jun 23, 2021

RAMitchell Jun 22, 2021

venkywonka Jun 23, 2021

dantegd left a comment

dantegd commented Jun 23, 2021

RAMitchell commented Jun 28, 2021

RAMitchell commented Jun 29, 2021

codecov-commenter commented Jun 29, 2021

dantegd commented Jun 29, 2021

RF: memset and batch size optimization for computing splits #4001

RF: memset and batch size optimization for computing splits #4001

Conversation

venkywonka commented Jun 22, 2021 • edited Loading

RAMitchell left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dantegd left a comment

Choose a reason for hiding this comment

dantegd commented Jun 23, 2021

RAMitchell commented Jun 28, 2021

RAMitchell commented Jun 29, 2021

codecov-commenter commented Jun 29, 2021

Codecov Report

dantegd commented Jun 29, 2021

venkywonka commented Jun 22, 2021 •

edited

Loading