-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[UsdGeom] Speed up extent computation. #588
Conversation
Filed as internal issue #164193. |
@superfunc - does setting grainSize really help here? We usually wind up regretting it when we set it, because TBB does a pretty good job, in genera; setting it high may wind up hurting you for small/moderate cases. |
(and the count mentioned in the documentation is 10K instructions, not 10K elements :-) |
Hey @spiffmon, I based that off some testing I did on assets we have, here is some more data for context: This was tested via one of our file format plugins via: This was the only change made to the code between testing 233,234c233,234
< }//,
< ///*grainSize=*/ 10000
---
> },
> /*grainSize=*/ 10000 On a 140Mb animated asset:
On a 280k non-animated asset:
The heavier asset pulls on this computation much more, as we are precomputing extents for each mesh at each time point. |
In general testing we have done with this, the grainsize is most often helpful when the amount of work done per task is very small. In this case we have sometimes found that the partitioner doesn't select a sufficiently large grainsize. |
Yep, thanks Toby, I get that's the intent of the grainsize, but 10k seems
excessively high. Josh volunteered to try a range of values on your
datasets, results of which I eagerly anticipate!
On Wed, Aug 29, 2018 at 1:56 PM Tobias Matthew Jones < ***@***.***> wrote:
In general testing we have done with this, the grainsize is most often
helpful when the amount of work done per task is very small. In this case
we have sometimes found that the partitioner doesn't select a sufficiently
large grainsize.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#588 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AF7qaKxJGMyRi0mNpay_68S0lWML6bBfks5uVv__gaJpZM4V5Q-J>
.
--
--spiffiPhone
|
Edit: Updated charts with more information, see post below |
|
Spiff I understand your comment now. 10,000 was probably my fault, when Josh was originally testing against it he noted using the default grainsize of 1 is slow, and I suggested 10,000 because I know that is the upper limit of where you should go to see if that provided a benefit (I should have after that suggested that after that he do a thorough analysis like he just did). You were correct earlier about the 10k instructions (I took that directly from the intel guidelines). Taking a quick look at the code https://godbolt.org/z/9l4ByQ (note escape is a trick to make the compile not optimize everything away) you can see that this code should generate about 20 instructions, which agrees with the data from Josh's tests. Yay performance fun. |
Thanks, guys. Had a discussion with Florian our TBB expert, and he
reiterated that setting grainSize is unfortunate, in theory, because "the
right" value will depend on more factors than just data per-task
instruction count, like architecture and current load, and in theory, TBB's
auto partitioner is supposed to take all these factors into account. In
*practice*, we've noted what seems like a bug or deficiency, at least in
the ancient version of TBB that VFX ref plat keeps us on, that keeps the
auto-partitioner from living up to its claims.
So for now, the data-driven number you determined seems good, which we can
revisit in the future. It would be super-duper awesome if the datasets
(possibly already in USD form?) could be contributed to a performance test,
so that when we can move TBB forward, we can confidently revisit this a(and
other) codesites that set grainSize?
We will definitely be creating a performance test based on the (IP) USD
form of the Moana Island dataset for the more extensive multithreading of
the PointInstancer extent computation that Shriram just did.
…--spiff
On Thu, Aug 30, 2018 at 8:46 AM Tobias Matthew Jones < ***@***.***> wrote:
Spiff I understand your comment now.
10,000 was probably my fault, when Josh was originally testing against it
he noted using the default grainsize of 1 is slow, and I suggested 10,000
because I know that is the upper limit of where you should go to see if
that provided a benefit (I should have after that suggested that after that
he do a thorough analysis like he just did).
You were correct earlier about the 10k instructions (I took that directly
from the intel guidelines). Taking a quick look at the code
https://godbolt.org/z/9l4ByQ (note escape is a trick to make the compile
not optimize everything away) you can see that this code should generate
about 20 instructions, which agrees with the data from Josh's tests.
Yay performance fun.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#588 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AF7qaL_cpxX5yr8nizRySIY3ht4kkwwSks5uWAjegaJpZM4V5Q-J>
.
|
@spiffmon Ok, I'll update the PR to set it to 500 for now. I will think about ways I could get some generic assets which exercise this behavior together in the future. |
For posterity, and any onlookers, I also ran these tests under google/bench to confirm the results in a more isolated context. https://github.com/superfunc/perf/blob/master/usd/extentComputation/results_mac.md |
Parallelize extent computation using WorkParallelReduce. This proved to provide non trivial speedups in one of our file format plugins which needed to compute extents for every mesh it was creating (in the realm of 30% on a 150MB asset with meshes accounting for 30% of its total number of prims). For grain size, there was discussion regarding the value chosen here: PixarAnimationStudios#588
0ca6402
to
2405af8
Compare
[UsdGeom] Speed up extent computation. (Internal change: 1891887)
This brings ComputeExtent for transforms inline with the changes made in PixarAnimationStudios#588. Given the similarity of the implementations this also abstracts them into a template function.
* Support disable test cases in ios test. * Update according to comment. * Refine code to handle some special characters. * Remove one unnecessary message.
Description of Change(s)
Parallelize extent computation using WorkParallelReduce.
This proved to provide non trivial speedups in one of our
file format plugins which needed to compute extents for every
mesh it was creating (in the realm of 30% on a 150MB asset with
meshes accounting for 30% of its total number of prims).
Fixes Issue(s)