Use `std::thread` instead of `OMP` for GPUs. #4302

trivialfis · 2019-03-27T21:32:18Z

Use std::thread in ExecuteIndexShards.

Span for atomic write symbol.
Use ExecutePerDevice instead of OpenMP loop.
Use SaveCudaContext to make sure nothing can change master thread's device.

This decouples parameter nthread with n_gpus. See #4162.

I added some debugging utilities in device_helpers.cuh, please let me keep them ...

trivialfis · 2019-03-27T22:12:37Z

@hcho3 Can we upgrade the compiler at some point ... What's the convention of supported compiler?

hcho3 · 2019-03-28T00:00:33Z

@trivialfis Right now, we assume that C++11 is supported. Do you want something from C++14 or C++17?

trivialfis · 2019-03-28T00:54:40Z

@hcho3 No. Just a compiler that has complete support for c++11. For example you found <regex> is not available in g++-4.8.2, and here the variadic arguments capturing is not working in 4.8.2. See: https://stackoverflow.com/questions/20006574/workaround-for-variadic-lambda-capture

I might be able to workaround this later, but if we were to claim c++11 support, 4.8.2 might be too old.

hcho3 · 2019-03-28T00:58:48Z

@trivialfis Got it, we should probably bump the version. Any suggestion?

trivialfis · 2019-03-28T01:01:14Z

@hcho3 I'm not sure about what's the status in commercial deployment. Can you find an example for currently active distribution that uses 4.x? If not, I would suggest sticking to one of those "popular stable distributions" like debian.

hcho3 · 2019-03-28T01:07:12Z

@trivialfis

I would suggest sticking to one of those "popular stable distributions" like debian.

The reason for using CentOS 6 is to maximize the compatibility of the binary wheel in many Linux distributions. Otherwise, XGBoost wheel may depend on recent versions of GLIBC and other system libraries that other distributions may lack.

That said, we can certainly upgrade the GCC version. We just need to use later version of devtoolset package.

trivialfis · 2019-03-28T01:14:42Z

@hcho3 I see. Just checked out CentOS, EOL is at the end of 2020 .....

Let's hold it and I will try to work around it for this PR. Thanks.

trivialfis · 2019-03-30T15:12:24Z

One lesson I learned from this PR is, under no circumstance should one change the device in master thread. I wish I can add a test for that.

codecov-io · 2019-03-30T16:02:00Z

Codecov Report

Merging #4302 into master will increase coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master    #4302      +/-   ##
==========================================
+ Coverage   67.82%   67.82%   +<.01%     
==========================================
  Files         132      132              
  Lines       12201    12203       +2     
==========================================
+ Hits         8275     8277       +2     
  Misses       3926     3926

Impacted Files	Coverage Δ
tests/cpp/common/test_transform_range.cc	`77.27% <ø> (ø)`	⬆️
src/metric/elementwise_metric.cu	`74.69% <ø> (ø)`	⬆️
src/common/compressed_iterator.h	`97.22% <ø> (ø)`	⬆️
src/common/transform.h	`88.88% <ø> (ø)`	⬆️
tests/cpp/test_main.cc	`100% <100%> (ø)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7ea5b77...7efc7ab. Read the comment docs.

trivialfis · 2019-04-01T22:49:08Z

@RAMitchell

RAMitchell

I am happy to merge after removing formatting changes and span related changes.

src/common/device_helpers.cuh

src/linear/updater_gpu_coordinate.cu

* Span for atomic write symbol. * Use `ExecutePerDevice` instead of OpenMP loop. * Use `SaveCudaContext` to make sure nothing can change master thread's device.

hcho3 · 2019-04-03T23:10:41Z

@trivialfis For some reason, one of the GPU agents failed with error "Remote call on Linux GPU slave (i-0c8b6da83669c814e) failed". I restarted the agent.

trivialfis · 2019-04-04T00:27:02Z

@hcho3 Thanks for helping out.

@RAMitchell Done removing unrelated changes.

RAMitchell · 2019-04-04T01:43:14Z

Let me do a performance test then I will merge it if everything is okay.

RAMitchell · 2019-04-05T02:25:56Z

@trivialfis there is a performance regression here. Using the following command with 8x Tesla V100-SXM2-32GB:

python xgboost/tests/benchmark/benchmark_tree.py --rows 10000000 --params "{'n_gpus':8, 'debug_verbose':0}"

Time appears to have gone from around 20s to 50s. According to my recent profiling poor scaling on multi-GPU seems related to the way we are using threads.

Another way of achieving what you are trying to do while staying with omp would be to store the omp global number of threads (set according to nthreads parameter) to a temporary variable, change the number of threads to the number of GPUs just for ExecuteShards, then change it back afterwards.

trivialfis · 2019-04-05T03:26:27Z

@RAMitchell I wonder what happened ... Let me see if I can reproduce the slow down. BTW, debug_verbose is deprecated.

trivialfis · 2019-05-17T14:38:50Z

With #4454 it's possible to get rid of the global OpenMP parameter, I will close this one.

trivialfis force-pushed the fix/nthreads-vs-ngpus branch from 8d86077 to 3f1ad96 Compare March 28, 2019 02:06

trivialfis requested a review from RAMitchell March 28, 2019 02:08

trivialfis force-pushed the fix/nthreads-vs-ngpus branch from 42001bd to e23dbea Compare March 28, 2019 20:39

RAMitchell reviewed Apr 3, 2019

View reviewed changes

src/common/device_helpers.cuh Outdated Show resolved Hide resolved

src/common/device_helpers.cuh Outdated Show resolved Hide resolved

src/linear/updater_gpu_coordinate.cu Outdated Show resolved Hide resolved

trivialfis added 2 commits April 3, 2019 14:16

Use std::thread in ExecuteIndexShards.

57ed836

* Span for atomic write symbol. * Use `ExecutePerDevice` instead of OpenMP loop. * Use `SaveCudaContext` to make sure nothing can change master thread's device.

Revert un-related changes.

6b44d5b

trivialfis force-pushed the fix/nthreads-vs-ngpus branch from 7efc7ab to 6b44d5b Compare April 3, 2019 20:55

Don't set device for cudaFree.

daa8a56

StrikerRUS mentioned this pull request Apr 8, 2019

lightGBM 2.2.200 still has glibc issue /lib64/libstdc++.so.6: version : GLIBCXX_3.4.20' not found microsoft/LightGBM#1945

Closed

trivialfis closed this May 17, 2019

trivialfis deleted the fix/nthreads-vs-ngpus branch June 10, 2019 21:16

lock bot locked as resolved and limited conversation to collaborators Sep 8, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use `std::thread` instead of `OMP` for GPUs. #4302

Use `std::thread` instead of `OMP` for GPUs. #4302

trivialfis commented Mar 27, 2019 •

edited

Loading

trivialfis commented Mar 27, 2019 •

edited

Loading

hcho3 commented Mar 28, 2019

trivialfis commented Mar 28, 2019

hcho3 commented Mar 28, 2019

trivialfis commented Mar 28, 2019

hcho3 commented Mar 28, 2019

trivialfis commented Mar 28, 2019

trivialfis commented Mar 30, 2019

codecov-io commented Mar 30, 2019 •

edited

Loading

trivialfis commented Apr 1, 2019

RAMitchell left a comment

hcho3 commented Apr 3, 2019

trivialfis commented Apr 4, 2019

RAMitchell commented Apr 4, 2019

RAMitchell commented Apr 5, 2019

trivialfis commented Apr 5, 2019

trivialfis commented May 17, 2019

Use std::thread instead of OMP for GPUs. #4302

Use std::thread instead of OMP for GPUs. #4302

Conversation

trivialfis commented Mar 27, 2019 • edited Loading

trivialfis commented Mar 27, 2019 • edited Loading

hcho3 commented Mar 28, 2019

trivialfis commented Mar 28, 2019

hcho3 commented Mar 28, 2019

trivialfis commented Mar 28, 2019

hcho3 commented Mar 28, 2019

trivialfis commented Mar 28, 2019

trivialfis commented Mar 30, 2019

codecov-io commented Mar 30, 2019 • edited Loading

Codecov Report

trivialfis commented Apr 1, 2019

RAMitchell left a comment

Choose a reason for hiding this comment

hcho3 commented Apr 3, 2019

trivialfis commented Apr 4, 2019

RAMitchell commented Apr 4, 2019

RAMitchell commented Apr 5, 2019

trivialfis commented Apr 5, 2019

trivialfis commented May 17, 2019

Use `std::thread` instead of `OMP` for GPUs. #4302

Use `std::thread` instead of `OMP` for GPUs. #4302

trivialfis commented Mar 27, 2019 •

edited

Loading

trivialfis commented Mar 27, 2019 •

edited

Loading

codecov-io commented Mar 30, 2019 •

edited

Loading