[UsdGeom] Speed up extent computation. #588

superfunc · 2018-08-11T18:31:19Z

Description of Change(s)

Parallelize extent computation using WorkParallelReduce.

This proved to provide non trivial speedups in one of our
file format plugins which needed to compute extents for every
mesh it was creating (in the realm of 30% on a 150MB asset with
meshes accounting for 30% of its total number of prims).

Fixes Issue(s)

Need to use da cores

jtran56 · 2018-08-25T00:20:12Z

Filed as internal issue #164193.

spiffmon · 2018-08-29T17:40:34Z

@superfunc - does setting grainSize really help here? We usually wind up regretting it when we set it, because TBB does a pretty good job, in genera; setting it high may wind up hurting you for small/moderate cases.

spiffmon · 2018-08-29T17:45:22Z

(and the count mentioned in the documentation is 10K instructions, not 10K elements :-)

superfunc · 2018-08-29T19:38:18Z

Hey @spiffmon,

I based that off some testing I did on assets we have, here is some more data for context:

This was tested via one of our file format plugins via:
@:$ time usdcat asset.oldFormat --out /dir/asset.usdc

This was the only change made to the code between testing
bash @:$ diff ~/pb_noGrainProvided.cpp ~/pb_withGrain10k.cpp

233,234c233,234
<         }//,
<         ///*grainSize=*/ 10000
---
>         },
>         /*grainSize=*/ 10000

On a 140Mb animated asset:

10k Grain Size	No Grain Size Provided
~2.3994s	~4.1104s

On a 280k non-animated asset:

10k Grain Size	No Grain Size Provided
~0.2598s	~0.2788s

The heavier asset pulls on this computation much more, as we are precomputing extents for each mesh at each time point.

tobymjones · 2018-08-29T20:56:26Z

In general testing we have done with this, the grainsize is most often helpful when the amount of work done per task is very small. In this case we have sometimes found that the partitioner doesn't select a sufficiently large grainsize.

spiffmon · 2018-08-30T04:51:30Z

Yep, thanks Toby, I get that's the intent of the grainsize, but 10k seems excessively high. Josh volunteered to try a range of values on your datasets, results of which I eagerly anticipate!

On Wed, Aug 29, 2018 at 1:56 PM Tobias Matthew Jones < ***@***.***> wrote: In general testing we have done with this, the grainsize is most often helpful when the amount of work done per task is very small. In this case we have sometimes found that the partitioner doesn't select a sufficiently large grainsize. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#588 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AF7qaKxJGMyRi0mNpay_68S0lWML6bBfks5uVv__gaJpZM4V5Q-J> .

-- --spiffiPhone

superfunc · 2018-08-30T05:34:39Z

~~Some rough data I put together, lets discuss (it seems to point to 500 being a reasonable starting place)~~

Edit: Updated charts with more information, see post below

superfunc · 2018-08-30T08:08:15Z

Note that this data was taken on a typical VFX workstation (24 core xeon box, RHEL7 etc), over six runs, removing the outlier and taking the average.
The first chart describes performance on a varied set of asset conversions.

Note the 0 column on the x-axis represents the original, single threaded version.

The second chart describes the improvement relative to single threaded performance.

tobymjones · 2018-08-30T15:46:33Z

Spiff I understand your comment now.

10,000 was probably my fault, when Josh was originally testing against it he noted using the default grainsize of 1 is slow, and I suggested 10,000 because I know that is the upper limit of where you should go to see if that provided a benefit (I should have after that suggested that after that he do a thorough analysis like he just did).

You were correct earlier about the 10k instructions (I took that directly from the intel guidelines). Taking a quick look at the code https://godbolt.org/z/9l4ByQ (note escape is a trick to make the compile not optimize everything away) you can see that this code should generate about 20 instructions, which agrees with the data from Josh's tests.

Yay performance fun.

spiffmon · 2018-08-30T18:03:54Z

Thanks, guys. Had a discussion with Florian our TBB expert, and he reiterated that setting grainSize is unfortunate, in theory, because "the right" value will depend on more factors than just data per-task instruction count, like architecture and current load, and in theory, TBB's auto partitioner is supposed to take all these factors into account. In *practice*, we've noted what seems like a bug or deficiency, at least in the ancient version of TBB that VFX ref plat keeps us on, that keeps the auto-partitioner from living up to its claims. So for now, the data-driven number you determined seems good, which we can revisit in the future. It would be super-duper awesome if the datasets (possibly already in USD form?) could be contributed to a performance test, so that when we can move TBB forward, we can confidently revisit this a(and other) codesites that set grainSize? We will definitely be creating a performance test based on the (IP) USD form of the Moana Island dataset for the more extensive multithreading of the PointInstancer extent computation that Shriram just did.

…

--spiff

On Thu, Aug 30, 2018 at 8:46 AM Tobias Matthew Jones < ***@***.***> wrote: Spiff I understand your comment now. 10,000 was probably my fault, when Josh was originally testing against it he noted using the default grainsize of 1 is slow, and I suggested 10,000 because I know that is the upper limit of where you should go to see if that provided a benefit (I should have after that suggested that after that he do a thorough analysis like he just did). You were correct earlier about the 10k instructions (I took that directly from the intel guidelines). Taking a quick look at the code https://godbolt.org/z/9l4ByQ (note escape is a trick to make the compile not optimize everything away) you can see that this code should generate about 20 instructions, which agrees with the data from Josh's tests. Yay performance fun. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#588 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AF7qaL_cpxX5yr8nizRySIY3ht4kkwwSks5uWAjegaJpZM4V5Q-J> .

superfunc · 2018-08-30T18:12:44Z

@spiffmon Ok, I'll update the PR to set it to 500 for now.

I will think about ways I could get some generic assets which exercise this behavior together in the future.

superfunc · 2018-09-01T15:38:22Z

For posterity, and any onlookers, I also ran these tests under google/bench to confirm the results in a more isolated context.

https://github.com/superfunc/perf/blob/master/usd/extentComputation/results_mac.md
https://github.com/superfunc/perf/blob/master/usd/extentComputation/results_workstation.md

Parallelize extent computation using WorkParallelReduce. This proved to provide non trivial speedups in one of our file format plugins which needed to compute extents for every mesh it was creating (in the realm of 30% on a 150MB asset with meshes accounting for 30% of its total number of prims). For grain size, there was discussion regarding the value chosen here: PixarAnimationStudios#588

[UsdGeom] Speed up extent computation. (Internal change: 1891887)

This brings ComputeExtent for transforms inline with the changes made in PixarAnimationStudios#588. Given the similarity of the implementations this also abstracts them into a template function.

* Support disable test cases in ios test. * Update according to comment. * Refine code to handle some special characters. * Remove one unnecessary message.

superfunc force-pushed the extentCompPerf branch from 0ca6402 to 2405af8 Compare September 4, 2018 04:13

sunyab added in review and removed in review labels Sep 7, 2018

pixar-oss merged commit 2405af8 into PixarAnimationStudios:dev Sep 18, 2018

pixar-oss added a commit that referenced this pull request Sep 18, 2018

Merge pull request #588 from superfunc/extentCompPerf

aa5cd6d

[UsdGeom] Speed up extent computation. (Internal change: 1891887)

sunyab removed pending push labels Sep 18, 2018

superfunc mentioned this pull request Oct 1, 2018

[UsdGeom] Speed up extent computation (redux) #640

Merged

superfunc deleted the extentCompPerf branch October 9, 2018 01:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[UsdGeom] Speed up extent computation. #588

[UsdGeom] Speed up extent computation. #588

superfunc commented Aug 11, 2018

jtran56 commented Aug 25, 2018

spiffmon commented Aug 29, 2018

spiffmon commented Aug 29, 2018

superfunc commented Aug 29, 2018

tobymjones commented Aug 29, 2018

spiffmon commented Aug 30, 2018 via email

superfunc commented Aug 30, 2018 •

edited

Loading

superfunc commented Aug 30, 2018 •

edited

Loading

tobymjones commented Aug 30, 2018

spiffmon commented Aug 30, 2018 via email

superfunc commented Aug 30, 2018

superfunc commented Sep 1, 2018

[UsdGeom] Speed up extent computation. #588

[UsdGeom] Speed up extent computation. #588

Conversation

superfunc commented Aug 11, 2018

Description of Change(s)

Fixes Issue(s)

jtran56 commented Aug 25, 2018

spiffmon commented Aug 29, 2018

spiffmon commented Aug 29, 2018

superfunc commented Aug 29, 2018

tobymjones commented Aug 29, 2018

spiffmon commented Aug 30, 2018 via email

superfunc commented Aug 30, 2018 • edited Loading

superfunc commented Aug 30, 2018 • edited Loading

tobymjones commented Aug 30, 2018

spiffmon commented Aug 30, 2018 via email

superfunc commented Aug 30, 2018

superfunc commented Sep 1, 2018

superfunc commented Aug 30, 2018 •

edited

Loading

superfunc commented Aug 30, 2018 •

edited

Loading