-
-
Notifications
You must be signed in to change notification settings - Fork 188
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
integrate Intel TBB #1376
integrate Intel TBB #1376
Conversation
…ure/intel-tbb-init
…stable/2017-11-14)
stan/math/prim/scal.hpp
Outdated
@@ -78,6 +79,7 @@ | |||
#include <stan/math/prim/scal/fun/identity_free.hpp> | |||
#include <stan/math/prim/scal/fun/if_else.hpp> | |||
#include <stan/math/prim/scal/fun/inc_beta.hpp> | |||
#include <stan/math/prim/scal/fun/init_threadpool_tbb.hpp> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason these sit in prim/scal/fun? Maybe we should have a backend
folder where we put both TBB and OpenCL stuff
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am open to suggestions. Another option could be stan/math/core
which is motivated by rev/core
- but I am really open to where to stick this; I thought prim/scal/fun
reflects the right dependency which is why I put it there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think core is fine. I imagine there wont be much "TBB core" code outside of this which would justify making another folder. Other TBB code is going to live either inside of Stan code or specific stan math functions right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Either way this is a review-time detail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The map_rect
which uses the TBB will, of course, stay where it is. So not much other TBB code (if any) will go into stan/math/core
. It just needs to be included always.
// 2. Consider to switch from std::thread::hardware_concurrency() to | ||
// tbb::task_scheduler_init::default_num_threads() once TBB is | ||
// mandatory. | ||
// 3. pull out of internal? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you need to use get_num_threads outside of this file? If so then yeah probs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the future I can imagine that client code will want to find out about STAN_NUM_THREADS
and at that point it would be good to have this in the stan::math
namespace... but maybe that is not needed and we can keep it where it is.
inline int get_num_threads(int num_jobs) { | ||
int num_threads = 1; | ||
#ifdef STAN_THREADS | ||
const char* env_stan_num_threads = std::getenv("STAN_NUM_THREADS"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sidenote: We should start thinking about how to add the number of threads and the OpenCL devices so they are flags at runtime
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe... not sure. There are pros and cons for both ways (runtime environment vs interface configuration). I am continuing here the system we started for map_rect
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(moving all my comments up to here so that it's easier for others to read the code from the github page)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
line 50: personally, especially with multiple if elses, I like having brackets to show leaving and entering a scope
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the docs for hardware_concurrency
Number of concurrent threads supported. If the value is not well defined or not computable, returns 0
Should we have a check here and if the returned value is zero return 1?
Also does this return the number of logical or physical cores? I think previously we saw that hyperthreading etc was not that useful
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The get_num_jobs
function is right now used inside map_rect
, but its use will go away there once we switch over map_rect
to use the TBB.
When we switch over map_rect
to the TBB, we can then include this function inside the init_threadpool_tbb.hpp
file, sure. Then we should also remove the num_jobs
argument which is really not needed. The get_num_threads
should only parse the STAN_NUM_THREADS
and that's it.
The value -1
means "use all cores". I would change this to then return the TBB default for that. As I recall that is without hyper-threading as I saw that last time.
* | ||
* @param num_jobs number of jobs | ||
* @return number of threads to use | ||
* @throws std::runtime_error if the value of STAN_NUM_THREADS env. variable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is more of a final review-style comment than what you need right now, but this throws an std::invalid_argument.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, you are right... will fix.
* @throws std::runtime_error if the value of STAN_NUM_THREADS env. variable | ||
* is invalid | ||
*/ | ||
inline bool init_threadpool_tbb(tbb::stack_size_type stack_size = 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens in someone calls this more than once? Should this throw, return silently without changing anything? Is there a TBB API call you can make to check whether the threadpool was initialized?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is that what is_active does? Should we return immediately if is_active is true?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment 1: First note that the tbb::task_scheduler_init
is static. So you ever create a single instance of this thing. Any further call is basically ignored. This is in line with the TBB way of working in this case. Repeated initialisation is not possible. You do it once and that's it - no more reconfiguration after that. If you want to have differently sized working areas then you can use so-called task_arenas
.
Comment 2: Why should we return immediately? In case the static init object is active, then this one was used (and the parsing has worked) so that we will just return again true
if it was true before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couple quick comments, I'll look over the tests more tonight or tomorrow
* @throws std::runtime_error if the value of STAN_NUM_THREADS env. variable | ||
* is invalid | ||
*/ | ||
inline bool init_threadpool_tbb(tbb::stack_size_type stack_size = 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just so I'm clear, stack_size
set the amount of memory each thread has at initialization right? Is there a reason to say 0 as the default and not something like 64 just to give them a bit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
0 let's the TBB choose that initial size. We should leave the default here as is, I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gah totally missed that in the docs my bad
|
||
static tbb::task_scheduler_init tbb_scheduler(tbb_max_threads, stack_size); | ||
|
||
return tbb_scheduler.is_active(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From docs
By default, Intel TBB 2.2 automatically creates a task scheduler the first time that a thread uses task scheduling services and destroys it when the last such thread exits.
Does this function have a side effect somewhere that actually turns on some global scheduler?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The side effect is initialisation of the threadpool with the scheduler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be good to add that as a @note in the doxygen
namespace stan { | ||
namespace math { | ||
|
||
static std::mutex cout_mutex; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For stuff like this I find it helpful to have a // FIXME(DELETE)
somewhere just so I remember to delete this stuff
static std::mutex cout_mutex; | ||
|
||
// initialize AD tape for TBB threads | ||
struct ad_tape_observer : public tbb::task_scheduler_observer { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
General docs would be nice. With this stuff especially I, Rok, pretty much any other reviewer will be reading a lot of these methods and schemes for the first time. So a little doc here is nice to lay down intent
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure... I was in a hurry yesterday night... can add some URL refs to TBB doc, for example.
} | ||
|
||
private: | ||
std::unordered_map<std::thread::id, ChainableStack*> ad_tape_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
You can use
using
to reuse this type in the code for a little nicer readability -
Why not a vector pair? I've only really found unordered_map to actually be faster when you have a pretty large N
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unordered_map is just a convenient way to have an index set which is indexed with std::thread_id
keys. Speed is totally irrelevant here (this is called just once per created thread for the entire life-time of the threading in the pool). So I want to use std::thread_id
as a key.
From a design perspective I agree with how this is done. I dont think you can make a more clean solution here. There are minor review-time issues with braces, docs and such but I think Steve has already flagged most of those. Initializing the threadpool is straightforward now with another layer and the details of get_num_threads hidden away. The open question there is what to do if someone calls the init threadpool more than once. The chainable stack is also pretty straightforward, with the only question, I think, being whether unordered map is the way to go or not. Can we write any sort of tests for the get_num_threads function. That is going to be tricky probably. |
How much additional work is changing map_rect going to be? I am starting to think this is so close we need to merge the Cmdstan and Stan Math TBB PRs, the former does still need some work. Otherwise reviewing and preparing all these branches is going to be a huge pain for everyone. For the sake of complying with our agreement made at the Stan Math meeting, I am stating here that take care of reverting all PRs if we, for some reason, fail to meet the demands we set . |
There are three cases which will lead to no effect for a call to
This is the default behaviour of the TBB. I don't see why we should deviate from that. I can add some URLs to the TBB doc if that helps?
For the observer: I want to use I can certainly add the I will address these things later tonight. Let me know if my responses don't make sense / you think differently / I misunderstood. With this PR I would also like to update the wiki page which talks about threading. |
Also it looks like the design here differs from the design doc's ScopedChainableStack? |
What do you think differs? I am relatively sure that things are consistent with one another (maybe not the same... the parallel design RFC is months old). |
I just wanted to double check what would happen. A line in the doc stating something like "calling this after the threadpool has already been initialized will have no effect" would suffice in my opinion.
👍
I would do it here. If you say the map_rect changes are small, I think adding them here makes sense. The only non-trivial thing here is the observer code. |
Adding the
Performance testing? Ok... if you think its needed. Cool! This thing is landing! |
Yes, do that separately.
Yes, I think it would be great to have them here, this is the first thing with TBB. We have the scripts ready and it should not take that much time. I would take it more as an additional advertisement for TBB in Stan. I will prepare the scripts, I think that is reasonable as I am the one requesting the tests and to take some burden of of you.
🚀 |
(stat_comp_benchmarks/benchmarks/gp_pois_regr/gp_pois_regr.stan, 0.96) |
(stat_comp_benchmarks/benchmarks/gp_pois_regr/gp_pois_regr.stan, 1.0) |
(stat_comp_benchmarks/benchmarks/gp_pois_regr/gp_pois_regr.stan, 1.01) |
(stat_comp_benchmarks/benchmarks/gp_pois_regr/gp_pois_regr.stan, 1.01) |
(stat_comp_benchmarks/benchmarks/gp_pois_regr/gp_pois_regr.stan, 1.04) |
(stat_comp_benchmarks/benchmarks/gp_pois_regr/gp_pois_regr.stan, 0.99) |
(stat_comp_benchmarks/benchmarks/gp_pois_regr/gp_pois_regr.stan, 0.98) |
(stat_comp_benchmarks/benchmarks/gp_pois_regr/gp_pois_regr.stan, 0.97) |
(stat_comp_benchmarks/benchmarks/gp_pois_regr/gp_pois_regr.stan, 0.77) |
Why does this keep going off? |
We are running a cmdstan PR with a submodule reference to this PR. And that also sends the cmdstan perf results here. |
(stat_comp_benchmarks/benchmarks/gp_pois_regr/gp_pois_regr.stan, 1.03) |
I am restarting this again. It got stuck in the Stan upstream part with
and there was no other way than simply restarting... |
(stat_comp_benchmarks/benchmarks/gp_pois_regr/gp_pois_regr.stan, 1.01) |
Again the same error? Is there a Jenkins / Windows problem or a problem on this branch? |
Jenkins/Windows. I restarted the Always run part 2 now. |
Will this ever go through? I hope this time we go through the Windows stuff... |
ok let me merge develop into this one since the allow spaces went in. |
…math into feature/intel-tbb-init
Windows seems to be fixed now, shouldn't happen again! @wds15 It failed at the testing stage ( because of Jenkins ), restarted it here: https://jenkins.mc-stan.org/blue/organizations/jenkins/Math%20Pipeline/detail/PR-1376/34/pipeline/ |
Yeah, but it still points to PR-744 that was merged and doesnt exist anymore. So it will def fail again. It seems that Windows Jenkins is stable now so hopefully this is all over in the morning CET time. |
thanks for restarting! I need to merge now develop into this... but looking at stan downstream tests it looks to me as if we need there possibly as well the tbb.dll copy thing? |
(stat_comp_benchmarks/benchmarks/gp_pois_regr/gp_pois_regr.stan, 1.01) |
(stat_comp_benchmarks/benchmarks/gp_pois_regr/gp_pois_regr.stan, 1.01) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets get this merged! Thank you @wds15!!
Summary
This PR integrate the Intel TBB with stan-math. Specifically this PR ensures the TBB's threadpool initialization and it switches
map_rect
C++11 threads implementation over to use thetbb::parallel_for
(which has scheduling and a more efficient threadpool).In order to keep the Intel TBB optional in the next release, the TBB code is only used whenever
STAN_THREADS
is set. This makes the TBB only mandatory whenever threads are being used. The Intel TBB will become mandatory in the near future regardless of theSTAN_THREADS
use (which has licensing implications for GPL-2 projects using Stan-math).The initialisation of the TBB happens in two places:
General initialisation of the maximal concurrency level in the TBB threadpool. This is handled by the new
init_threadpool_tbb
function defined instan/math/prim/core/init_threadpool_tbb.hpp
. The number of threads used is defined in theSTAN_NUM_THREADS
environment variable.For all TBB worker threads, the AD tape is initialised through an observer which is registered with the Intel TBB. Every time a worker thread enters the threadpool, the observer is called and ensures proper initialisation of the AD tape resource automatically. See
stan/math/rev/core/init_chainablestack_tbb.hpp
.Note: This PR is a branch from the
feature/intel-tbb-lib
such that it includes these changes in addition to the actual changes introduced in addition. As reference branch thefeature/intel-tbb-lib
has been selected to make it easier to follow the actual changes introduced relative to the base branch.Performance Tests
All reported times are in seconds.
Tests
test/unit/math/prim/core/init_threadpool_tbb_test.cpp
tests initialisation of the threadpooltest/unit/math/prim/core/init_threadpool_tbb_late_test.cpp
tests initialisation of the threadpool in case a c++ user has already initialized the scheduler (in which case nothing happens)test/unit/math/rev/mat/functor/gradient_test.cpp
tests if the threaded gradient code runs fine when using the Intel TBB as backend. This tests if the effects of AD tape initialisation have taken place.Side Effects
The initialisation of the TBB threadpool is only optional. That is, should client code not call the
init_threadpool_tbb
function, then the first execution of any TBB code will trigger default initialisation of the TBB threadpool which is the default behaviour of the TBB. In case the client code initialises first the TBB through explicit instantiation of thetbb::task_scheduler_init
interface, then a subsequent call to theinit_threadpool_tbb
has no effect - this is again the default behaviour of the TBB (only the very first call is honoured).For interfaces it is recommended to make a call to
stan::math::init_threadpool_tbb()
prior to using Stan-math. This will ensure the proper initialisation of the threadpool and ensure that no moreSTAN_NUM_THREADS
threads are running in the threadpool.Todo
init_threadpool_tbb
stan::math::get_num_threads
(moved tostan/math/prim/scal/fun/get_num_threads.hpp
), see comments withinChecklist
Math issue #(issue number)
Copyright holder: Sebastian Weber
The copyright holder is typically you or your assignee, such as a university or company. By submitting this pull request, the copyright holder is agreeing to the license the submitted work under the following licenses:
- Code: BSD 3-clause (https://opensource.org/licenses/BSD-3-Clause)
- Documentation: CC-BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
the basic tests are passing
./runTests.py test/unit
)make test-headers
)make doxygen
)make cpplint
)the code is written in idiomatic C++ and changes are documented in the doxygen
the new changes are tested