Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add parallel iteration benchmarks #2173

Closed
alice-i-cecile opened this issue May 15, 2021 · 12 comments
Closed

Add parallel iteration benchmarks #2173

alice-i-cecile opened this issue May 15, 2021 · 12 comments
Labels
A-ECS Entities, components, systems, and events C-Feature A new feature, making something new possible C-Performance A change motivated by improving speed, memory usage or compile times S-Ready-For-Implementation This issue is ready for an implementation PR. Go for it!

Comments

@alice-i-cecile
Copy link
Member

ElliotB256/bevy_bench#2 reveals that while Bevy can be quite competitive with the performance of other Rust ECS backends it's parallel iteration performance is quite poor.

This is likely to be particularly true with unpredictable workloads.

Once we have a good benchmark suite for this we can work on improving the performance in other issues.

@alice-i-cecile alice-i-cecile added C-Feature A new feature, making something new possible A-ECS Entities, components, systems, and events C-Performance A change motivated by improving speed, memory usage or compile times labels May 15, 2021
@mockersf
Copy link
Member

mockersf commented May 15, 2021

bench from https://github.com/ElliotB256/bevy_bench/blob/master/examples/bevy.rs looks not that far from https://github.com/cart/ecs_bench_suite/blob/bevy-benches/src/bevy/heavy_compute.rs, are there specific things to add?

results also seems to be in the same ballpark as those reported
from ecs_bench_suite:

bench legion (*) legion 0.2.4 bevy specs
heavy_compute 0.701ms(0.723ms) 4.34ms 1.06ms 0.995ms

from ElliotB256:

ECS create entities run loop
bevy 0.01 1.11
specs 0.03 0.54
legion 0.82 0.70

@alice-i-cecile
Copy link
Member Author

Adding that benchmark suite to Bevy itself will help reduce the need for this issue for sure.

The other concern that I have is that perfectly predictable and uniform workloads like those shown in both of those examples are not always representative and will underestimate the potential gains of more sophisticated strategies.

Parallel pathfinding might be a good example of an unpredictable but highly parallelizable task to supplement the example you linked.

@ElliotB256
Copy link

Is it worth noting that my bench is using a simpler operation - e.g. multiplying a couple of floats, instead of inverting a matrix 100 times? Each parallel task is much smaller.

@alice-i-cecile
Copy link
Member Author

Yep: ideally we could vary the weight of the tasks to try and assess the relative overhead in each engine.

@ElliotB256
Copy link

I'd be happy to make a PR for adding some benchmarks for this issue

@ElliotB256
Copy link

Apologies I was away from this for so long.

I've added an example of a parallel_light_compute benchmark to https://github.com/ElliotB256/ecs_bench_suite/tree/parallel_light_compute. This benchmark is for the case where some very short task must be performed for each entity - a realistic example would be updating position of an entity based on velocity. It compares to the existing heavy_compute benchmark as follows:

Benchmark Description
heavy_compute 1,000 entities. Parallel iter inverts a matrix 100 times.
parallel_light_compute 10,000 entities. Parallel iter inverts a matrix once.

The numbers on my PC are as follows:

Benchmark time
parallel_light_compute/legion 106.60 us
parallel_light_compute/legion (packed) 112.58 us
parallel_light_compute/bevy 1.0066 ms
parallel_light_compute/hecs 100.08 us
parallel_light_compute/shipyard 142.53 us
parallel_light_compute/specs 108.00 us

@ElliotB256
Copy link

The batch size used for bevy in the above tests was 64. Changing to a batch size of 1024 offers an improvement to 907.58us, but still behind the other libraries.

For different batch sizes:

Batch Size Time
8 1.9473ms
64 1.0444ms
256 1.0048ms
1024 960.13us
4096 1.0296ms
10,000 1.2633ms

It's not clear to me what the equivalent batch size is for the other tests. legion and specs (maybe the others too?) use rayon for parallel iteration, but I'm unfamiliar with the internals. Some information claims by default it splits every time (see e.g. this PR and this SO - but that doesn't seem quite right, and the Rayon page says:

Parallel iterators take care of deciding how to divide your data into tasks; it will dynamically adapt for maximum performance.

@ElliotB256
Copy link

@mockersf pointed out on discord that my example was creating a new task pool every iteration. Changing it so the task pool is initialised before hand, the results are:

Batch Size Time
8 1.177ms
64 234.13us
256 149.48us
1024 130.48us
4096 207.13us
10,000 485.55us

This is much better. It is still not quite as good as the others, but at optimum batch sizes bevy is at least competitive.

@ElliotB256
Copy link

@ElliotB256
Copy link

By chance I just saw on the discord:

iamseb: I'm trying to run query.par_for_each across a large set of entities, and noticing that my CPU is only engaging half my available cores. I tried running https://github.com/bevyengine/bevy/blob/main/examples/async_tasks/async_compute.rs and setting the sleep timer to 1 sec for all tasks and noticed the same thing, with the example taking twice the expected time to run. In previous versions of bevy that used rayon it would max out the CPU. Is this a design decision, or is there some config I'm missing for the task scheduler?
MinerSebas: Bevy splits has three diferent Thread pools which together use the whole CPU .(https://github.com/bevyengine/bevy/blob/main/crates/bevy_core/src/task_pool_options.rs#L56-L81).
You can adjust the used ratios by inserting the DefaultTaskPoolOptions Resource before adding the default plugins.

Does anyone know if this is relevant for the benchmark?

@ElliotB256
Copy link

Checked and it isn't, All cores used in profiler.

use bevy_ecs::prelude::*;
use bevy_tasks::TaskPool;
use cgmath::*;

#[derive(Copy, Clone)]
struct Position(Vector3<f32>);

#[derive(Copy, Clone)]
struct Rotation(Vector3<f32>);

#[derive(Copy, Clone)]
struct Velocity(Vector3<f32>);

pub struct Benchmark(World, TaskPool);

fn main() {
    
    let mut world = World::default();

        world.spawn_batch((0..10_000).map(|_| {
            (
                Matrix4::<f32>::from_angle_x(Rad(1.2)),
                Position(Vector3::unit_x()),
                Rotation(Vector3::unit_x()),
                Velocity(Vector3::unit_x()),
            )
        }));

    let pool = TaskPool::new();
    
    for _ in 0..100_000 {
        let mut query = world.query::<(&mut Position, &mut Matrix4<f32>)>();    
        query.par_for_each_mut(&mut world, &pool, 1024, |(mut pos, mut mat)| {
            *mat = mat.invert().unwrap();
            pos.0 = mat.transform_vector(pos.0);
        });
    }
}

image
image
image

@james7132
Copy link
Member

With the inclusion of cart's ecs_bench_suite in #4225, do we need to cover this more? busy_systems, contrived, and heavy_compute seems to cover it fairly well already.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-ECS Entities, components, systems, and events C-Feature A new feature, making something new possible C-Performance A change motivated by improving speed, memory usage or compile times S-Ready-For-Implementation This issue is ready for an implementation PR. Go for it!
Projects
None yet
Development

No branches or pull requests

4 participants