Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Return the thread limit stored in TBB instead of the local _threadLimit value #1368

Conversation

marktucker
Copy link
Contributor

Return the thread limit stored in TBB instead of the local _threadLimit
value when checking the max concurrency value. These two should be equal
except in cases like running within Houdini where the TBB thread limit is
set without explicitly telling USD that the value has been set.

Description of Change(s)

Houdini (and I suspect other DCCs that use TBB) have their own code for setting the TBB max concurrency in the global scheduler object. Houdini therefore avoids calling WorkSetConcurrencyLimit to avoid having the USD library also configure (and hold a pointer to) the global scheduler. Work_InitializeThreading does get called, which sets _threadLimit based on the available hardware.

This almost works fine, except WorkGetConcurrencyLimit returns the _threadLimit variable when using a TBB task arena. So algorithms that use task arenas will still use the hardware limited number of threads.

This change proposes to return tbb::this_task_arena::max_concurrency() from this function instead of _threadLimit. This function will return the limit from the global scheduler object (which is exactly what Houdini needs it to do). And for apps that do call WorkSetConcurrencyLimit to set the thread limit, the return value from this function should match _threadLimit, and so there should be no change in behavior.

value when checking the max concurrency value. These two should be equal
except in cases like running within Houdini where the TBB thread limit is
set without explicitly telling USD that the value has been set.
@jilliene
Copy link

Filed as internal issue #USD-6433

@c64kernal
Copy link
Contributor

Thanks for this, @marktucker. Unfortunately, I don't think we can quite take it as-is. The issue is that the WorkGetConcurrencyLimit() is meant to return the global configuration value, and the code you have will potentially return a different value per caller depending on the limits in the calling arena (for example). One question is, why this is harming you and could you just call tbb::this_task_arena::max_concurrency() yourself? In discussing this with Alex, we're guessing that the issue is probably code that we're calling internally where we erroneously call WorkGetConcurrencyLimit(). In searching for these spots, I found this site:

https://github.com/PixarAnimationStudios/USD/blob/release/pxr/base/work/arenaDispatcher.cpp#L51

Which I suspect might be the culprit of your problems? Fixing this is not as straightforward unfortunately (not that it's particularly difficult either, it's just not trivial), but before we go any further, we wanted to double check with you to see where your issues are coming from.

@marktucker
Copy link
Contributor Author

I may get some of this wrong, but the basic issue is when the user starts Houdini with "-j2" on a 16 core machine, we never want Houdini to use more than 2 threads. We enforce this within Houdini by using tbb::task_scheduler_init(2), just like WorkSetConcurrencyLimit does. In order to prevent the USD library from using 16 threads, I believe we need all situations where USD currently calls WorkGetConcurrencyLimit to return 2, not 16 (work/loops.h, work/reduce.h, work/arenaDispatcher.cpp, usdAbc/alembicReader.cpp, hdxPrman/context.cpp, etc). Otherwise each of those bits of code has the opportunity to run on 16 threads, unless I'm misunderstanding something (which is certainly possible). Or some code may take the "multithreaded" path even with "-j1" specified.

I believe your objection that tbb::this_task_arena::max_concurrency() may return different values in some scenarios, though I don't know TBB well enough to understand it (I filed this PR largely under the guidance of @e4lam, who maybe can chip in here). But I don't think that just changing arena_dispatcher.cpp will solve the problem we are attempting to solve with this PR.

The other thing we considered doing in place of this PR was adding a way to set _threadLimit without doing all the other stuff that WorkSetConcurrencyLimit does that leads to trouble. But that would involve changing the API and really muddying the waters about what function an application should be calling to set USD's thread limit.

@c64kernal
Copy link
Contributor

Okay got it, thanks @marktucker that makes a lot of sense, but I think it potentially highlights a bigger problem that we have. When you pass in -j2 to Houdini we would absolutely expect to be able to set up USD to respect that limit (in fact we have a handy WrokSetConcurrencyLimitArgument that has some smarts about how to parse those arguments so that our applications are all consistent -- but that's neither here nor there, the point is that we need to make this possible). In the case you describe we definitely want to set up USD globally, so the solution of just using tbb::this_task_arena::max_concurrency() in the areas you call out is not quite right either. WorkSetConcurrencyLimit is intended for this and should be the right thing to call. Now we have to make it work for you.

Can you arrange to have the Houdini scheduler initialized before invoking USD? From the TBB documentation:

The number_of_threads is ignored if any other task_scheduler_inits currently exist. A thread may construct multiple task_scheduler_inits. Doing so does no harm because the underlying scheduler is reference counted.

We struggled with this code a bit to get it right in the beginning when running in Maya and in general for our apps, so it's entirely possible we're not quite there yet with the ideal solution.

@e4lam
Copy link

e4lam commented Dec 17, 2020

@c64kernal Hey George, allow me to diverge for a bit here first. @marktucker alluded to this in his description but allow me to expand. It's not enough even if Houdini is the first one to create atbb::task_scheduler_init object in the process. In addition to that, we must also destroy all tbb::task_scheduler_init objects before main() returns or std::exit() is called in order to properly exit the process. Right before we return from main() or call std::exit(), we must call tbb::task_scheduler_init::blocking_terminate() and have it successfully shutdown all TBB worker threads. The conditions for successfully calling blocking_terminate() are: (quoted from the docs)

  • No active (i.e. not yet completed) application thread initialized the task scheduler automatically;
  • No active (i.e. not yet terminated) instances of class task_arena exist in the whole program;
  • blocking_terminate is called for each active instance of class task_scheduler_init, possibly by different application threads

The reason for shutting down the TBB worker threads prior to exit() is because on Windows, the system call ExitProcess() is performed at exit time, which immediately causes all non-main threads to be unconditionally terminated, prior to execution of any exit callbacks or static destructors. When this happens, some particular thread might be in TBB code holding a mutex (either in the scheduler or the tbbmalloc allocator) right before it dies. Then when TBB unloads, its destruction code tries to acquire one of these mutexes and deadlocks the entire process. We spent years debugging this issue where our CI reg tests on Windows would hang. After a long time going back and forth with the Intel TBB team, we finally settled on blocking_terminate(). In our builds, it is an assertion failure if blocking_terminate() fails.

Given the frailty of TBB on this aspect, I would really prefer if libraries like USD never created task_scheduler_init objects and left this to the application instead. So requirement number 1 here is that USD never creates task_scheduler_init objects. A lesser solution to this requirement might be provide a mechanism to ensure that all such objects created by USD can be explicitly destroyed via some API that the application calls, but this might introduce new destruction order issues.

The second issue that we ran into here is that it seems that even if Houdini created the first task_scheduler_init with a thread limit 1, we observed that Hydra is still somehow able to run using more threads. The task_arena creation in _ArenaManager::CheckOut() that you mentioned is precisely one place that we saw which we suspected might be the cause for this because WorkGetConcurrencyLimit() would return a larger number. However, I don't know if we ever pinned down exactly how this extra parallel work was done. As @marktucker mentioned, it seems that WorkGetConcurrencyLimit() is used in several places, any one of which could also be problematic. We can't call WorkSetConcurrencyLimit() because that will create a task_scheduler_init, violating requirement 1. Then we said ok, what if the application should always be the one to set the max concurrency. In that case, then the library should just obey that from the scheduler, which the application owns, not the library. And that's how we arrived at this patch.

I'm sure it's because I'm not seeing all the use cases but why does USD need to use task_scheduler_init at all? In my mind, this_task_arena::max_concurrency() should be respected by the library without needing the application to call WorkSetConcurrencyLimit() in the first place. If the application needs to limit the number of threads, it can and should do so by controlling task_scheduler_init itself. This approach seems like just a workaround for the unexpectedness that somehow we can create task_arena's that use more threads than what the first task_scheduler_init object has dictated. However, it has the added benefit of eliminating the need for USD to store any state at all regarding thread limits because it would simply use the current task scheduler settings.

Another less radical approach to consider (which we never tried) is to replace all appropriate calls to WorkGetConcurrencyLimit() with something like a WorkGetEffectiveConcurrencyLimit() that returns min(WorkGetConcurrencyLimit(), tbb::this_task_arena::max_concurrency()).

@c64kernal
Copy link
Contributor

Thanks very much for the great explanation, @e4lam -- I think we're going to need a bit (more) time to discuss and re-evaluate how we do this based on the scenarios you've given.

To give a bit of context, part of the reason we end up pooling the creation of the tbb_scheduler_init object in libWork is that at the time we were transitioning from an in-house threading subsystem to tbb and we needed both of them to collaborate. Getting that to work properly (especially with older versions of tbb) required some acrobatics. Calling WorkSetConcurrencyLimit from USD shouldn't be happening... it'd be surprising to me honestly (if you know the callstack of how that's happening, let us know). As far as I know USD doesn't create tbb_scheduler_init objects and leaves it to the application to call WorkSetConcurrencyLimit effectively already doing what you say it should be doing. The real trouble comes from the fact that libWork thinks it owns the world (which in our ecosystem it kind of does, with an asterisk) -- and doesn't play nice with someone else setting a threading limit in tbb outside of its knowledge. There is definitely a gap here that we need to investigate more -- thank you for raising it!

In some initial discussions I was thinking exactly along the lines of WorkGetEffectiveConcurrencyLimit -- which I think would address the problem, but I feel like we want to do something cleaner to make libWork itself more consistent here. If we can't come up with something more satisfying, we'd at least have that in our back pocket I think.

@e4lam
Copy link

e4lam commented Jan 13, 2021

@c64kernal I think we're on the same page here but to clarify some things.

Calling WorkSetConcurrencyLimit from USD shouldn't be happening... it'd be surprising to me honestly (if you know the callstack of how that's happening, let us know).

I did not mean to imply this. Because WorkGetConcurrencyLimit() was returning a larger value than what we wanted when used in _ArenaManager::CheckOut(), the first thing we tried was to have Houdini call WorkSetConcurrencyLimit(), which is the thing that backfired on us. I was trying to explain why using WorkSetConcurrencyLimit() is not currently a solution for Houdini.

The other dimension here is whether we really need WorkSetConcurrencyLimit() to do anything with task_scheduler_init at all. And maybe this requires a conference call to understand this part. Again, this comes from the application-centric view that the task scheduler is a resource that it owns, not the library. If we really want things to be clean, then USD should not have this "concurrency limit" concept at all, let/make the application control it. The only use I see for such a concept would be to FURTHER reduce the concurrency used by USD, which is where my view comes from that all uses to of concurrency limit must be bounded by the current arena's max concurrency. Arguably, I would have expected that tbb should have already been limited by the task_scheduler_init limit that Houdini had set but somehow that is thwarted by USD's current use of it.

EDIT: I've now raised a question regarding the TBB issue here

@c64kernal
Copy link
Contributor

Thanks again @e4lam -- I think we're totally on the same page. The WorkSetConcurrencyLimit call is our wrapper that all our applications call into in order to create task_schedule_init -- it's not part of USD strictly speaking, but USD uses libWork and it has this deficiency we're talking about. Also thanks for following up on the TBB forums, that'll be interesting to keep an eye on.

pixar-oss pushed a commit that referenced this pull request Apr 26, 2021
See #1368

(Internal change: 2160904)
pixar-oss pushed a commit that referenced this pull request Apr 26, 2021
pixar-oss pushed a commit that referenced this pull request Apr 26, 2021
…ature that lets us

build sensibly composable parallelism.

See #1368

(Internal change: 2161405)
pixar-oss pushed a commit that referenced this pull request Apr 27, 2021
pixar-oss pushed a commit that referenced this pull request Apr 27, 2021
pixar-oss pushed a commit that referenced this pull request Apr 27, 2021
pixar-oss pushed a commit that referenced this pull request May 4, 2021
… enqueued tbb task.

See #1368

(Internal change: 2162128)
pixar-oss pushed a commit that referenced this pull request May 4, 2021
pixar-oss pushed a commit that referenced this pull request May 4, 2021
preparation for removing WorkArenaDispatcher.

See #1368

(Internal change: 2162199)
pixar-oss pushed a commit that referenced this pull request May 4, 2021
See #1368

(Internal change: 2162349)
pixar-oss pushed a commit that referenced this pull request May 4, 2021
…llelsim &

WorkDispatcher.

See #1368

(Internal change: 2162390)
@gitamohr
Copy link
Contributor

gitamohr commented May 5, 2021

Hey guys, just want to summarize where we are on all of this. First, USD will not create a task_scheduler_init unless something calls WorkSetConcurrencyLimit(). Exactly one thing in USD does that: the usdview application. Second, I've removed usage of tbb::task_arenas entirely from USD. Third, I've changed the WorkDetachedTask code to use its own single worker thread (or no threads if we have no concurrency) instead of calling tbb::task::enqueue().

The thing that I think we're still missing is that as far as I know, we cannot determine what tbb's notion of the thread limit is. For example, if houdini creates a task_scheduler_init with N workers, I haven't found a way for USD to determine that N.

For the most part, USD doesn't care unless N is 1. Even then the effect will be that the WorkDetachedTask thing will create a single worker thread when it ideally wouldn't, and we will set the number of ogawa streams to a default of 4 in the alembic plugin. The other effects are just a couple of places where we avoid the overhead of creating tasks if we know we are meant to be single-threaded, but that's not a huge deal. But if there was a way to determine the thread limit set by another entity, we could tighten this up.

@e4lam
Copy link

e4lam commented May 5, 2021

@gitamohr thanks for working on this!

we cannot determine what tbb's notion of the thread limit is

I'm unclear on what you mean here. Why does tbb::this_task_arena::max_concurrency() (discussed in previous comments) not fit the bill? oneTBB Migration documentation also lists some other options. I'll note that OpenVDB has taken to using it in an upcoming PR.

@e4lam
Copy link

e4lam commented May 5, 2021

PS. I should also mention that we're moving to tbb::global_control instead of tbb::task_scheduler_init as the latter class is deprecated.

@gitamohr
Copy link
Contributor

gitamohr commented May 5, 2021

Oh geez, my brain kept reading that as the task_scheduler_init::default_num_threads() thing... yes I'll get things working with this_task_arena::max_concurrency(). Then I hope we'll be all set here!

@c64kernal
Copy link
Contributor

thank you @e4lam and @gitamohr -- thanks to Alex's work, this should no longer be a problem in the next release. Going to close this out for now, Ed, if you're still running into any problems, please let us know! (You can check out the new work in dev sometime next week too to get an early peek). Thanks all!

@c64kernal c64kernal closed this May 7, 2021
pixar-oss pushed a commit that referenced this pull request May 8, 2021
…. Our

understanding is that this is not important anymore in newer alembic versions and we
want to be conservative with default thread usage.

See #1368

(Internal change: 2165472)
pixar-oss pushed a commit that referenced this pull request May 8, 2021
pixar-oss pushed a commit that referenced this pull request May 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants