Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[UsdGeom] Speed up extent computation. #588

Merged
merged 1 commit into from
Sep 18, 2018

Conversation

superfunc
Copy link
Contributor

Description of Change(s)

Parallelize extent computation using WorkParallelReduce.

This proved to provide non trivial speedups in one of our
file format plugins which needed to compute extents for every
mesh it was creating (in the realm of 30% on a 150MB asset with
meshes accounting for 30% of its total number of prims).

Fixes Issue(s)

  • Need to use da cores

@jtran56
Copy link

jtran56 commented Aug 25, 2018

Filed as internal issue #164193.

@spiffmon
Copy link
Member

@superfunc - does setting grainSize really help here? We usually wind up regretting it when we set it, because TBB does a pretty good job, in genera; setting it high may wind up hurting you for small/moderate cases.

@spiffmon
Copy link
Member

(and the count mentioned in the documentation is 10K instructions, not 10K elements :-)

@superfunc
Copy link
Contributor Author

Hey @spiffmon,

I based that off some testing I did on assets we have, here is some more data for context:

This was tested via one of our file format plugins via:
@:$ time usdcat asset.oldFormat --out /dir/asset.usdc

This was the only change made to the code between testing
bash @:$ diff ~/pb_noGrainProvided.cpp ~/pb_withGrain10k.cpp

233,234c233,234
<         }//,
<         ///*grainSize=*/ 10000
---
>         },
>         /*grainSize=*/ 10000

On a 140Mb animated asset:

10k Grain Size No Grain Size Provided
~2.3994s ~4.1104s

On a 280k non-animated asset:

10k Grain Size No Grain Size Provided
~0.2598s ~0.2788s

The heavier asset pulls on this computation much more, as we are precomputing extents for each mesh at each time point.

@tobymjones
Copy link

In general testing we have done with this, the grainsize is most often helpful when the amount of work done per task is very small. In this case we have sometimes found that the partitioner doesn't select a sufficiently large grainsize.

@spiffmon
Copy link
Member

spiffmon commented Aug 30, 2018 via email

@superfunc
Copy link
Contributor Author

superfunc commented Aug 30, 2018

Some rough data I put together, lets discuss (it seems to point to 500 being a reasonable starting place)

Edit: Updated charts with more information, see post below

@superfunc
Copy link
Contributor Author

superfunc commented Aug 30, 2018

  • Note that this data was taken on a typical VFX workstation (24 core xeon box, RHEL7 etc), over six runs, removing the outlier and taking the average.

  • The first chart describes performance on a varied set of asset conversions.

Note the 0 column on the x-axis represents the original, single threaded version.

screenshot at 2018-08-30 01-03-45

  • The second chart describes the improvement relative to single threaded performance.

screenshot at 2018-08-30 01-09-18

@tobymjones
Copy link

Spiff I understand your comment now.

10,000 was probably my fault, when Josh was originally testing against it he noted using the default grainsize of 1 is slow, and I suggested 10,000 because I know that is the upper limit of where you should go to see if that provided a benefit (I should have after that suggested that after that he do a thorough analysis like he just did).

You were correct earlier about the 10k instructions (I took that directly from the intel guidelines). Taking a quick look at the code https://godbolt.org/z/9l4ByQ (note escape is a trick to make the compile not optimize everything away) you can see that this code should generate about 20 instructions, which agrees with the data from Josh's tests.

Yay performance fun.

@spiffmon
Copy link
Member

spiffmon commented Aug 30, 2018 via email

@superfunc
Copy link
Contributor Author

@spiffmon Ok, I'll update the PR to set it to 500 for now.

I will think about ways I could get some generic assets which exercise this behavior together in the future.

@superfunc
Copy link
Contributor Author

For posterity, and any onlookers, I also ran these tests under google/bench to confirm the results in a more isolated context.

https://github.com/superfunc/perf/blob/master/usd/extentComputation/results_mac.md
https://github.com/superfunc/perf/blob/master/usd/extentComputation/results_workstation.md

Parallelize extent computation using WorkParallelReduce.

This proved to provide non trivial speedups in one of our
file format plugins which needed to compute extents for every
mesh it was creating (in the realm of 30% on a 150MB asset with
meshes accounting for 30% of its total number of prims).

For grain size, there was discussion regarding the value chosen here:

PixarAnimationStudios#588
@sunyab sunyab added in review and removed in review labels Sep 7, 2018
@pixar-oss pixar-oss merged commit 2405af8 into PixarAnimationStudios:dev Sep 18, 2018
pixar-oss added a commit that referenced this pull request Sep 18, 2018
[UsdGeom] Speed up extent computation.

(Internal change: 1891887)
superfunc added a commit to superfunc/USD that referenced this pull request Oct 1, 2018
This brings ComputeExtent for transforms inline with the
changes made in PixarAnimationStudios#588. Given the similarity of the implementations
this also abstracts them into a template function.
@superfunc superfunc deleted the extentCompPerf branch October 9, 2018 01:55
AdamFelt pushed a commit to autodesk-forks/USD that referenced this pull request Apr 16, 2024
* Support disable test cases in ios test.

* Update according to comment.

* Refine code to handle some special characters.

* Remove one unnecessary message.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants