Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use std::thread instead of OMP for GPUs. #4302

Closed
wants to merge 3 commits into from

Conversation

trivialfis
Copy link
Member

@trivialfis trivialfis commented Mar 27, 2019

Use std::thread in ExecuteIndexShards.

  • Span for atomic write symbol.
  • Use ExecutePerDevice instead of OpenMP loop.
  • Use SaveCudaContext to make sure nothing can change master thread's device.

This decouples parameter nthread with n_gpus. See #4162.

I added some debugging utilities in device_helpers.cuh, please let me keep them ...

@trivialfis
Copy link
Member Author

trivialfis commented Mar 27, 2019

@hcho3 Can we upgrade the compiler at some point ... What's the convention of supported compiler?

@hcho3
Copy link
Collaborator

hcho3 commented Mar 28, 2019

@trivialfis Right now, we assume that C++11 is supported. Do you want something from C++14 or C++17?

@trivialfis
Copy link
Member Author

@hcho3 No. Just a compiler that has complete support for c++11. For example you found <regex> is not available in g++-4.8.2, and here the variadic arguments capturing is not working in 4.8.2. See: https://stackoverflow.com/questions/20006574/workaround-for-variadic-lambda-capture

I might be able to workaround this later, but if we were to claim c++11 support, 4.8.2 might be too old.

@hcho3
Copy link
Collaborator

hcho3 commented Mar 28, 2019

@trivialfis Got it, we should probably bump the version. Any suggestion?

@trivialfis
Copy link
Member Author

@hcho3 I'm not sure about what's the status in commercial deployment. Can you find an example for currently active distribution that uses 4.x? If not, I would suggest sticking to one of those "popular stable distributions" like debian.

@hcho3
Copy link
Collaborator

hcho3 commented Mar 28, 2019

@trivialfis

I would suggest sticking to one of those "popular stable distributions" like debian.

The reason for using CentOS 6 is to maximize the compatibility of the binary wheel in many Linux distributions. Otherwise, XGBoost wheel may depend on recent versions of GLIBC and other system libraries that other distributions may lack.

That said, we can certainly upgrade the GCC version. We just need to use later version of devtoolset package.

@trivialfis
Copy link
Member Author

@hcho3 I see. Just checked out CentOS, EOL is at the end of 2020 .....

Let's hold it and I will try to work around it for this PR. Thanks.

@trivialfis trivialfis force-pushed the fix/nthreads-vs-ngpus branch from 8d86077 to 3f1ad96 Compare March 28, 2019 02:06
@trivialfis trivialfis requested a review from RAMitchell March 28, 2019 02:08
@trivialfis trivialfis force-pushed the fix/nthreads-vs-ngpus branch from 42001bd to e23dbea Compare March 28, 2019 20:39
@trivialfis
Copy link
Member Author

One lesson I learned from this PR is, under no circumstance should one change the device in master thread. I wish I can add a test for that.

@codecov-io
Copy link

codecov-io commented Mar 30, 2019

Codecov Report

Merging #4302 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #4302      +/-   ##
==========================================
+ Coverage   67.82%   67.82%   +<.01%     
==========================================
  Files         132      132              
  Lines       12201    12203       +2     
==========================================
+ Hits         8275     8277       +2     
  Misses       3926     3926
Impacted Files Coverage Δ
tests/cpp/common/test_transform_range.cc 77.27% <ø> (ø) ⬆️
src/metric/elementwise_metric.cu 74.69% <ø> (ø) ⬆️
src/common/compressed_iterator.h 97.22% <ø> (ø) ⬆️
src/common/transform.h 88.88% <ø> (ø) ⬆️
tests/cpp/test_main.cc 100% <100%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7ea5b77...7efc7ab. Read the comment docs.

@trivialfis
Copy link
Member Author

@RAMitchell

Copy link
Member

@RAMitchell RAMitchell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am happy to merge after removing formatting changes and span related changes.

src/common/device_helpers.cuh Outdated Show resolved Hide resolved
src/common/device_helpers.cuh Outdated Show resolved Hide resolved
src/linear/updater_gpu_coordinate.cu Outdated Show resolved Hide resolved
* Span for atomic write symbol.
* Use `ExecutePerDevice` instead of OpenMP loop.
* Use `SaveCudaContext` to make sure nothing can change master thread's device.
@trivialfis trivialfis force-pushed the fix/nthreads-vs-ngpus branch from 7efc7ab to 6b44d5b Compare April 3, 2019 20:55
@hcho3
Copy link
Collaborator

hcho3 commented Apr 3, 2019

@trivialfis For some reason, one of the GPU agents failed with error "Remote call on Linux GPU slave (i-0c8b6da83669c814e) failed". I restarted the agent.

@trivialfis
Copy link
Member Author

@hcho3 Thanks for helping out.

@RAMitchell Done removing unrelated changes.

@RAMitchell
Copy link
Member

Let me do a performance test then I will merge it if everything is okay.

@RAMitchell
Copy link
Member

@trivialfis there is a performance regression here. Using the following command with 8x Tesla V100-SXM2-32GB:

python xgboost/tests/benchmark/benchmark_tree.py --rows 10000000 --params "{'n_gpus':8, 'debug_verbose':0}"

Time appears to have gone from around 20s to 50s. According to my recent profiling poor scaling on multi-GPU seems related to the way we are using threads.

Another way of achieving what you are trying to do while staying with omp would be to store the omp global number of threads (set according to nthreads parameter) to a temporary variable, change the number of threads to the number of GPUs just for ExecuteShards, then change it back afterwards.

@trivialfis
Copy link
Member Author

@RAMitchell I wonder what happened ... Let me see if I can reproduce the slow down. BTW, debug_verbose is deprecated.

@trivialfis
Copy link
Member Author

With #4454 it's possible to get rid of the global OpenMP parameter, I will close this one.

@trivialfis trivialfis closed this May 17, 2019
@trivialfis trivialfis deleted the fix/nthreads-vs-ngpus branch June 10, 2019 21:16
@lock lock bot locked as resolved and limited conversation to collaborators Sep 8, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants