-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Return the thread limit stored in TBB instead of the local _threadLimit value #1368
Return the thread limit stored in TBB instead of the local _threadLimit value #1368
Conversation
value when checking the max concurrency value. These two should be equal except in cases like running within Houdini where the TBB thread limit is set without explicitly telling USD that the value has been set.
Filed as internal issue #USD-6433 |
Thanks for this, @marktucker. Unfortunately, I don't think we can quite take it as-is. The issue is that the WorkGetConcurrencyLimit() is meant to return the global configuration value, and the code you have will potentially return a different value per caller depending on the limits in the calling arena (for example). One question is, why this is harming you and could you just call tbb::this_task_arena::max_concurrency() yourself? In discussing this with Alex, we're guessing that the issue is probably code that we're calling internally where we erroneously call WorkGetConcurrencyLimit(). In searching for these spots, I found this site: https://github.com/PixarAnimationStudios/USD/blob/release/pxr/base/work/arenaDispatcher.cpp#L51 Which I suspect might be the culprit of your problems? Fixing this is not as straightforward unfortunately (not that it's particularly difficult either, it's just not trivial), but before we go any further, we wanted to double check with you to see where your issues are coming from. |
I may get some of this wrong, but the basic issue is when the user starts Houdini with "-j2" on a 16 core machine, we never want Houdini to use more than 2 threads. We enforce this within Houdini by using tbb::task_scheduler_init(2), just like WorkSetConcurrencyLimit does. In order to prevent the USD library from using 16 threads, I believe we need all situations where USD currently calls WorkGetConcurrencyLimit to return 2, not 16 (work/loops.h, work/reduce.h, work/arenaDispatcher.cpp, usdAbc/alembicReader.cpp, hdxPrman/context.cpp, etc). Otherwise each of those bits of code has the opportunity to run on 16 threads, unless I'm misunderstanding something (which is certainly possible). Or some code may take the "multithreaded" path even with "-j1" specified. I believe your objection that tbb::this_task_arena::max_concurrency() may return different values in some scenarios, though I don't know TBB well enough to understand it (I filed this PR largely under the guidance of @e4lam, who maybe can chip in here). But I don't think that just changing arena_dispatcher.cpp will solve the problem we are attempting to solve with this PR. The other thing we considered doing in place of this PR was adding a way to set _threadLimit without doing all the other stuff that WorkSetConcurrencyLimit does that leads to trouble. But that would involve changing the API and really muddying the waters about what function an application should be calling to set USD's thread limit. |
Okay got it, thanks @marktucker that makes a lot of sense, but I think it potentially highlights a bigger problem that we have. When you pass in -j2 to Houdini we would absolutely expect to be able to set up USD to respect that limit (in fact we have a handy WrokSetConcurrencyLimitArgument that has some smarts about how to parse those arguments so that our applications are all consistent -- but that's neither here nor there, the point is that we need to make this possible). In the case you describe we definitely want to set up USD globally, so the solution of just using tbb::this_task_arena::max_concurrency() in the areas you call out is not quite right either. WorkSetConcurrencyLimit is intended for this and should be the right thing to call. Now we have to make it work for you. Can you arrange to have the Houdini scheduler initialized before invoking USD? From the TBB documentation:
We struggled with this code a bit to get it right in the beginning when running in Maya and in general for our apps, so it's entirely possible we're not quite there yet with the ideal solution. |
@c64kernal Hey George, allow me to diverge for a bit here first. @marktucker alluded to this in his description but allow me to expand. It's not enough even if Houdini is the first one to create a
The reason for shutting down the TBB worker threads prior to Given the frailty of TBB on this aspect, I would really prefer if libraries like USD never created The second issue that we ran into here is that it seems that even if Houdini created the first I'm sure it's because I'm not seeing all the use cases but why does USD need to use Another less radical approach to consider (which we never tried) is to replace all appropriate calls to |
Thanks very much for the great explanation, @e4lam -- I think we're going to need a bit (more) time to discuss and re-evaluate how we do this based on the scenarios you've given. To give a bit of context, part of the reason we end up pooling the creation of the In some initial discussions I was thinking exactly along the lines of |
@c64kernal I think we're on the same page here but to clarify some things.
I did not mean to imply this. Because The other dimension here is whether we really need EDIT: I've now raised a question regarding the TBB issue here |
Thanks again @e4lam -- I think we're totally on the same page. The |
See #1368 (Internal change: 2160904)
See #1368 (Internal change: 2161374)
…ature that lets us build sensibly composable parallelism. See #1368 (Internal change: 2161405)
See #1368 (Internal change: 2161891)
See #1368 (Internal change: 2161961)
See #1368 (Internal change: 2161962)
… enqueued tbb task. See #1368 (Internal change: 2162128)
See #1368 (Internal change: 2162130)
preparation for removing WorkArenaDispatcher. See #1368 (Internal change: 2162199)
See #1368 (Internal change: 2162349)
…llelsim & WorkDispatcher. See #1368 (Internal change: 2162390)
Hey guys, just want to summarize where we are on all of this. First, USD will not create a task_scheduler_init unless something calls WorkSetConcurrencyLimit(). Exactly one thing in USD does that: the usdview application. Second, I've removed usage of tbb::task_arenas entirely from USD. Third, I've changed the WorkDetachedTask code to use its own single worker thread (or no threads if we have no concurrency) instead of calling tbb::task::enqueue(). The thing that I think we're still missing is that as far as I know, we cannot determine what tbb's notion of the thread limit is. For example, if houdini creates a task_scheduler_init with N workers, I haven't found a way for USD to determine that N. For the most part, USD doesn't care unless N is 1. Even then the effect will be that the WorkDetachedTask thing will create a single worker thread when it ideally wouldn't, and we will set the number of ogawa streams to a default of 4 in the alembic plugin. The other effects are just a couple of places where we avoid the overhead of creating tasks if we know we are meant to be single-threaded, but that's not a huge deal. But if there was a way to determine the thread limit set by another entity, we could tighten this up. |
@gitamohr thanks for working on this!
I'm unclear on what you mean here. Why does |
PS. I should also mention that we're moving to |
Oh geez, my brain kept reading that as the task_scheduler_init::default_num_threads() thing... yes I'll get things working with this_task_arena::max_concurrency(). Then I hope we'll be all set here! |
thank you @e4lam and @gitamohr -- thanks to Alex's work, this should no longer be a problem in the next release. Going to close this out for now, Ed, if you're still running into any problems, please let us know! (You can check out the new work in dev sometime next week too to get an early peek). Thanks all! |
…. Our understanding is that this is not important anymore in newer alembic versions and we want to be conservative with default thread usage. See #1368 (Internal change: 2165472)
See #1368 (Internal change: 2165477)
…rrencyLimit. Fixes #1368 (Internal change: 2165478)
Return the thread limit stored in TBB instead of the local _threadLimit
value when checking the max concurrency value. These two should be equal
except in cases like running within Houdini where the TBB thread limit is
set without explicitly telling USD that the value has been set.
Description of Change(s)
Houdini (and I suspect other DCCs that use TBB) have their own code for setting the TBB max concurrency in the global scheduler object. Houdini therefore avoids calling WorkSetConcurrencyLimit to avoid having the USD library also configure (and hold a pointer to) the global scheduler. Work_InitializeThreading does get called, which sets _threadLimit based on the available hardware.
This almost works fine, except WorkGetConcurrencyLimit returns the _threadLimit variable when using a TBB task arena. So algorithms that use task arenas will still use the hardware limited number of threads.
This change proposes to return tbb::this_task_arena::max_concurrency() from this function instead of _threadLimit. This function will return the limit from the global scheduler object (which is exactly what Houdini needs it to do). And for apps that do call WorkSetConcurrencyLimit to set the thread limit, the return value from this function should match _threadLimit, and so there should be no change in behavior.