Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Merged by Bors] - Skip empty archetypes and tables when iterating over queries #4724

Closed
wants to merge 15 commits into from

Conversation

james7132
Copy link
Member

Objective

Speed up queries that are fragmented over many empty archetypes and tables.

Solution

Add a early-out to check if the table or archetype is empty before iterating over it. This adds an extra branch for every archetype matched, but skips setting the archetype/table to the underlying state and any iteration over it.

This may not be worth it for the default Query::iter and maybe even the Query::for_each implementations, but this definitely avoids scheduling unnecessary tasks in the Query::par_for_each case.

Ideally, matched_archetypes should only contain archetypes where there's actually work to do, but this would add a O(n) flat cost to every call to update_archetypes that scales with the number of matched archetypes.

TODO: Benchmark

@alice-i-cecile alice-i-cecile added A-ECS Entities, components, systems, and events C-Performance A change motivated by improving speed, memory usage or compile times labels May 11, 2022
@alice-i-cecile
Copy link
Member

Code changes look fine to me. The proof, like always, is in the benchmark :) I'd like to see a new benchmark that looks at iteration performance vs the number of empty archetypes (0, 1k, 5k, 10k maybe?), to see where the break-even point is for this change.

@BoxyUwU
Copy link
Member

BoxyUwU commented May 11, 2022

QueryIterationCursor needs to be updated

@james7132
Copy link
Member Author

QueryIterationCursor needs to be updated

Done.

@james7132
Copy link
Member Author

james7132 commented May 12, 2022

Actual benchmark results here, and the differences here make plenty of sense. Tl;dr: it's really only beneficial to par_for_each and it's variants in the general case, otherwise the total number of empty archetypes aren't really realistic.

group                                       main                                    skip-empty
-----                                       ----                                    ----------
empty_archetypes/for_each/10                1.00      4.2±0.18µs        ? ?/sec     1.47      6.1±0.28µs        ? ?/sec
empty_archetypes/for_each/100               1.00      4.6±0.16µs        ? ?/sec     1.34      6.2±0.97µs        ? ?/sec
empty_archetypes/for_each/1000              1.00      5.6±0.12µs        ? ?/sec     1.19      6.6±0.13µs        ? ?/sec
empty_archetypes/for_each/10000             1.92     20.7±0.47µs        ? ?/sec     1.00     10.8±0.56µs        ? ?/sec
empty_archetypes/for_each/2000              1.18      8.8±0.40µs        ? ?/sec     1.00      7.5±0.55µs        ? ?/sec
empty_archetypes/for_each/500               1.00      5.0±0.19µs        ? ?/sec     1.27      6.3±0.14µs        ? ?/sec
empty_archetypes/for_each/5000              1.59     12.4±0.33µs        ? ?/sec     1.00      7.8±0.75µs        ? ?/sec
empty_archetypes/iter/10                    1.00      4.3±0.21µs        ? ?/sec     1.47      6.4±0.13µs        ? ?/sec
empty_archetypes/iter/100                   1.00      4.4±0.15µs        ? ?/sec     1.44      6.3±0.37µs        ? ?/sec
empty_archetypes/iter/1000                  1.00      5.6±0.18µs        ? ?/sec     1.19      6.6±0.07µs        ? ?/sec
empty_archetypes/iter/10000                 1.74     22.9±0.66µs        ? ?/sec     1.00     13.1±0.83µs        ? ?/sec
empty_archetypes/iter/2000                  1.22      9.1±0.40µs        ? ?/sec     1.00      7.5±0.09µs        ? ?/sec
empty_archetypes/iter/500                   1.00      5.4±0.25µs        ? ?/sec     1.20      6.5±0.13µs        ? ?/sec
empty_archetypes/iter/5000                  1.61     13.4±0.20µs        ? ?/sec     1.00      8.4±0.71µs        ? ?/sec
empty_archetypes/par_for_each/10            1.37      7.7±0.39µs        ? ?/sec     1.00      5.6±0.45µs        ? ?/sec
empty_archetypes/par_for_each/100           1.31      7.7±0.31µs        ? ?/sec     1.00      5.8±0.36µs        ? ?/sec
empty_archetypes/par_for_each/1000          1.29      8.0±0.38µs        ? ?/sec     1.00      6.2±0.71µs        ? ?/sec
empty_archetypes/par_for_each/10000         1.11     15.2±0.47µs        ? ?/sec     1.00     13.7±0.96µs        ? ?/sec
empty_archetypes/par_for_each/2000          1.38      8.5±0.37µs        ? ?/sec     1.00      6.2±0.38µs        ? ?/sec
empty_archetypes/par_for_each/500           1.39      8.4±0.33µs        ? ?/sec     1.00      6.0±0.31µs        ? ?/sec
empty_archetypes/par_for_each/5000          1.25     10.8±0.33µs        ? ?/sec     1.00      8.6±0.48µs        ? ?/sec

@alice-i-cecile
Copy link
Member

Solid work benchmarking. Those numbers make sense to me. Can we do a more targeted change here then?

@james7132
Copy link
Member Author

There were a few errors with how the prior benchmarks were run. There were 10, 100, 1000, 2000, 5000, and 10000 archetypes made, but not all of them matched the query used in the system. Once all of the created archetypes matched the given query, it seems to show that all of the iteration methods benefit from this in a tangible way. The exception here seems to be with par_for_each, where it seems like the overhead of scheduling the tasks dominates the benchmark, and are otherwise within the margin of error.

group                                  main                                   skip-empty
-----                                  ----                                   ----------
empty_archetypes/for_each/10           1.00      4.6±0.14µs        ? ?/sec    1.05      4.8±0.25µs        ? ?/sec
empty_archetypes/for_each/100          1.00      4.8±0.24µs        ? ?/sec    1.02      4.9±0.20µs        ? ?/sec
empty_archetypes/for_each/1000         1.08      5.7±0.26µs        ? ?/sec    1.00      5.3±0.21µs        ? ?/sec
empty_archetypes/for_each/10000        1.26     15.5±0.16µs        ? ?/sec    1.00     12.3±0.34µs        ? ?/sec
empty_archetypes/for_each/2000         1.20      7.3±0.33µs        ? ?/sec    1.00      6.1±0.27µs        ? ?/sec
empty_archetypes/for_each/500          1.05      5.4±0.20µs        ? ?/sec    1.00      5.1±0.20µs        ? ?/sec
empty_archetypes/for_each/5000         1.40     11.8±0.19µs        ? ?/sec    1.00      8.5±0.26µs        ? ?/sec
empty_archetypes/iter/10               1.08      4.9±0.76µs        ? ?/sec    1.00      4.5±0.18µs        ? ?/sec
empty_archetypes/iter/100              1.18      5.6±0.65µs        ? ?/sec    1.00      4.8±0.33µs        ? ?/sec
empty_archetypes/iter/1000             1.33      7.2±0.30µs        ? ?/sec    1.00      5.5±0.20µs        ? ?/sec
empty_archetypes/iter/10000            1.46     17.7±0.41µs        ? ?/sec    1.00     12.1±0.26µs        ? ?/sec
empty_archetypes/iter/2000             1.35      8.1±0.42µs        ? ?/sec    1.00      6.0±0.29µs        ? ?/sec
empty_archetypes/iter/500              1.26      6.4±0.28µs        ? ?/sec    1.00      5.1±0.22µs        ? ?/sec
empty_archetypes/iter/5000             1.42     11.9±0.15µs        ? ?/sec    1.00      8.4±0.24µs        ? ?/sec
empty_archetypes/par_for_each/10       1.00      8.0±0.38µs        ? ?/sec    1.04      8.3±0.37µs        ? ?/sec
empty_archetypes/par_for_each/100      1.00      8.0±0.32µs        ? ?/sec    1.02      8.2±0.38µs        ? ?/sec
empty_archetypes/par_for_each/1000     1.00      8.3±0.35µs        ? ?/sec    1.11      9.2±0.28µs        ? ?/sec
empty_archetypes/par_for_each/10000    1.00     14.1±0.50µs        ? ?/sec    1.00     14.1±0.47µs        ? ?/sec
empty_archetypes/par_for_each/2000     1.00      8.5±0.37µs        ? ?/sec    1.04      8.9±0.30µs        ? ?/sec

@BoxyUwU
Copy link
Member

BoxyUwU commented May 12, 2022

I dont think its right to conclude this has no effect on normal for_each. The benchmark uses Query<&A> which is going to be doing barely any work on empty archetypes. Something like Query<(&A, &A, &A, &A, &A, &A, &A, &A, &A, &A)> might make this show up in benches

@james7132
Copy link
Member Author

james7132 commented May 14, 2022

I dont think its right to conclude this has no effect on normal for_each. The benchmark uses Query<&A> which is going to be doing barely any work on empty archetypes. Something like Query<(&A, &A, &A, &A, &A, &A, &A, &A, &A, &A)> might make this show up in benches

Tried this and the results are much more pronounced. I also tried filtering during update_archetypes and saw the same overhead at the low levels, but it does scale a lot more easily with higher counts. Nevermind, that's unsound and we might need to find a better way to move the check out of the hot loop.

group                                  main                                   skip-empty
-----                                  ----                                   ----------
empty_archetypes/for_each/10           1.00      4.4±0.14µs        ? ?/sec    1.36      6.0±0.22µs        ? ?/sec
empty_archetypes/for_each/100          1.00      5.0±0.28µs        ? ?/sec    1.38      6.9±0.14µs        ? ?/sec
empty_archetypes/for_each/1000         1.45     10.6±0.16µs        ? ?/sec    1.00      7.3±0.07µs        ? ?/sec
empty_archetypes/for_each/10000        5.21     66.1±1.07µs        ? ?/sec    1.00     12.7±0.62µs        ? ?/sec
empty_archetypes/for_each/2000         2.03     15.4±0.25µs        ? ?/sec    1.00      7.6±0.11µs        ? ?/sec
empty_archetypes/for_each/500          1.21      7.6±0.27µs        ? ?/sec    1.00      6.3±0.17µs        ? ?/sec
empty_archetypes/for_each/5000         3.66     32.5±0.40µs        ? ?/sec    1.00      8.9±0.79µs        ? ?/sec
empty_archetypes/iter/10               1.00      4.8±0.21µs        ? ?/sec    1.32      6.3±0.12µs        ? ?/sec
empty_archetypes/iter/100              1.00      5.0±0.37µs        ? ?/sec    1.24      6.2±0.37µs        ? ?/sec
empty_archetypes/iter/1000             1.56     11.2±0.23µs        ? ?/sec    1.00      7.2±0.10µs        ? ?/sec
empty_archetypes/iter/10000            5.28     66.2±1.52µs        ? ?/sec    1.00     12.5±0.98µs        ? ?/sec
empty_archetypes/iter/2000             2.00     15.0±0.24µs        ? ?/sec    1.00      7.5±0.20µs        ? ?/sec
empty_archetypes/iter/500              1.17      7.7±0.29µs        ? ?/sec    1.00      6.6±0.12µs        ? ?/sec
empty_archetypes/iter/5000             3.87     31.9±0.24µs        ? ?/sec    1.00      8.3±0.41µs        ? ?/sec
empty_archetypes/par_for_each/10       1.36      7.9±0.52µs        ? ?/sec    1.00      5.8±0.86µs        ? ?/sec
empty_archetypes/par_for_each/100      1.31      8.0±0.43µs        ? ?/sec    1.00      6.1±0.33µs        ? ?/sec
empty_archetypes/par_for_each/1000     1.45      8.6±0.43µs        ? ?/sec    1.00      5.9±0.35µs        ? ?/sec
empty_archetypes/par_for_each/10000    1.07     15.2±0.44µs        ? ?/sec    1.00     14.2±0.89µs        ? ?/sec
empty_archetypes/par_for_each/2000     1.46      8.6±0.42µs        ? ?/sec    1.00      5.9±0.32µs        ? ?/sec
empty_archetypes/par_for_each/500      1.36      8.1±0.24µs        ? ?/sec    1.00      6.0±0.62µs        ? ?/sec
empty_archetypes/par_for_each/5000     1.25     11.5±0.32µs        ? ?/sec    1.00      9.1±0.68µs        ? ?/sec

@DJMcNab
Copy link
Member

DJMcNab commented May 14, 2022

Is it feasible to cull empty archetypes, or re-order archetypes so that the full ones are always at the end (i.e. so we can make the check for emptyness just be 'reached the end of the archetypes list'?

I imagine the challenge there is in ensuring that all the IDs are up-to-date, since they 'need' to be used for indexing - we'd probably need a re-mapping phase? I'm not sure how well it would even out.

@james7132
Copy link
Member Author

Is it feasible to cull empty archetypes, or re-order archetypes so that the full ones are always at the end (i.e. so we can make the check for emptyness just be 'reached the end of the archetypes list'?

This is what the "valid_archetype_ids" naively attempted to do. We'd need to find an opportune time to inject this, since it seems like only new_archetype is called on queries.

@alice-i-cecile alice-i-cecile added S-Needs-Design This issue requires design work to think about how it would best be accomplished and removed S-Needs-Design This issue requires design work to think about how it would best be accomplished labels May 18, 2022
Copy link
Contributor

@bjorn3 bjorn3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code itself seems to be correct. As for whether we should do this I will leave that to others.

@alice-i-cecile alice-i-cecile added the S-Ready-For-Final-Review This PR has been approved by the community. It's ready for a maintainer to consider merging it label May 18, 2022
Copy link
Member

@BoxyUwU BoxyUwU left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code seems correct to me.

I don't really think the benchmarks shown give good reason to make the change though. the benches are completely artificial and skip every single archetype never actually doing anything and even then the perf difference is a regression in a fair amount of cases!

@alice-i-cecile alice-i-cecile added the X-Controversial There is active debate or serious implications around merging this PR label May 26, 2022
@james7132
Copy link
Member Author

Revisiting this, it may make sense to add this only for the parallel case, where the work per non-empty archetype is definitively non-trivial since it needs to allocate for each task that is spawned and the overhead is significantly heavier when the task pools are under contention.

@alice-i-cecile alice-i-cecile removed the S-Ready-For-Final-Review This PR has been approved by the community. It's ready for a maintainer to consider merging it label Sep 14, 2022
@alice-i-cecile
Copy link
Member

Revisiting this, it may make sense to add this only for the parallel case

I think this should be done before merging, which would make this non-controversial IMO.

@james7132
Copy link
Member Author

I think this should be done before merging, which would make this non-controversial IMO.

This was done before this comment. Should be ready to go.

@james7132 james7132 added S-Ready-For-Final-Review This PR has been approved by the community. It's ready for a maintainer to consider merging it and removed X-Controversial There is active debate or serious implications around merging this PR labels Oct 19, 2022
@alice-i-cecile
Copy link
Member

Description could use updating now, but I'm not going to block on that.

bors r+

bors bot pushed a commit that referenced this pull request Oct 24, 2022
# Objective
Speed up queries that are fragmented over many empty archetypes and tables.

## Solution
Add a early-out to check if the table or archetype is empty before iterating over it. This adds an extra branch for every archetype matched, but skips setting the archetype/table to the underlying state and any iteration over it.

This may not be worth it for the default `Query::iter` and maybe even the `Query::for_each` implementations, but this definitely avoids scheduling unnecessary tasks in the `Query::par_for_each` case.

Ideally, `matched_archetypes` should only contain archetypes where there's actually work to do, but this would add a `O(n)` flat cost to every call to `update_archetypes` that scales with the number of matched archetypes.

TODO: Benchmark
@bors bors bot changed the title Skip empty archetypes and tables when iterating over queries [Merged by Bors] - Skip empty archetypes and tables when iterating over queries Oct 24, 2022
@bors bors bot closed this Oct 24, 2022
james7132 added a commit to james7132/bevy that referenced this pull request Oct 28, 2022
…ine#4724)

# Objective
Speed up queries that are fragmented over many empty archetypes and tables.

## Solution
Add a early-out to check if the table or archetype is empty before iterating over it. This adds an extra branch for every archetype matched, but skips setting the archetype/table to the underlying state and any iteration over it.

This may not be worth it for the default `Query::iter` and maybe even the `Query::for_each` implementations, but this definitely avoids scheduling unnecessary tasks in the `Query::par_for_each` case.

Ideally, `matched_archetypes` should only contain archetypes where there's actually work to do, but this would add a `O(n)` flat cost to every call to `update_archetypes` that scales with the number of matched archetypes.

TODO: Benchmark
Pietrek14 pushed a commit to Pietrek14/bevy that referenced this pull request Dec 17, 2022
…ine#4724)

# Objective
Speed up queries that are fragmented over many empty archetypes and tables.

## Solution
Add a early-out to check if the table or archetype is empty before iterating over it. This adds an extra branch for every archetype matched, but skips setting the archetype/table to the underlying state and any iteration over it.

This may not be worth it for the default `Query::iter` and maybe even the `Query::for_each` implementations, but this definitely avoids scheduling unnecessary tasks in the `Query::par_for_each` case.

Ideally, `matched_archetypes` should only contain archetypes where there's actually work to do, but this would add a `O(n)` flat cost to every call to `update_archetypes` that scales with the number of matched archetypes.

TODO: Benchmark
ItsDoot pushed a commit to ItsDoot/bevy that referenced this pull request Feb 1, 2023
…ine#4724)

# Objective
Speed up queries that are fragmented over many empty archetypes and tables.

## Solution
Add a early-out to check if the table or archetype is empty before iterating over it. This adds an extra branch for every archetype matched, but skips setting the archetype/table to the underlying state and any iteration over it.

This may not be worth it for the default `Query::iter` and maybe even the `Query::for_each` implementations, but this definitely avoids scheduling unnecessary tasks in the `Query::par_for_each` case.

Ideally, `matched_archetypes` should only contain archetypes where there's actually work to do, but this would add a `O(n)` flat cost to every call to `update_archetypes` that scales with the number of matched archetypes.

TODO: Benchmark
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-ECS Entities, components, systems, and events C-Performance A change motivated by improving speed, memory usage or compile times S-Ready-For-Final-Review This PR has been approved by the community. It's ready for a maintainer to consider merging it
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants