Add parallel iteration benchmarks #2173

alice-i-cecile · 2021-05-15T19:03:24Z

ElliotB256/bevy_bench#2 reveals that while Bevy can be quite competitive with the performance of other Rust ECS backends it's parallel iteration performance is quite poor.

This is likely to be particularly true with unpredictable workloads.

Once we have a good benchmark suite for this we can work on improving the performance in other issues.

mockersf · 2021-05-15T20:48:23Z

bench from https://github.com/ElliotB256/bevy_bench/blob/master/examples/bevy.rs looks not that far from https://github.com/cart/ecs_bench_suite/blob/bevy-benches/src/bevy/heavy_compute.rs, are there specific things to add?

results also seems to be in the same ballpark as those reported
from ecs_bench_suite:

bench	legion (*)	legion 0.2.4	bevy	specs
heavy_compute	0.701ms(0.723ms)	4.34ms	1.06ms	0.995ms

from ElliotB256:

ECS	create entities	run loop
bevy	0.01	1.11
specs	0.03	0.54
legion	0.82	0.70

alice-i-cecile · 2021-05-16T06:30:19Z

Adding that benchmark suite to Bevy itself will help reduce the need for this issue for sure.

The other concern that I have is that perfectly predictable and uniform workloads like those shown in both of those examples are not always representative and will underestimate the potential gains of more sophisticated strategies.

Parallel pathfinding might be a good example of an unpredictable but highly parallelizable task to supplement the example you linked.

ElliotB256 · 2021-05-16T11:52:50Z

Is it worth noting that my bench is using a simpler operation - e.g. multiplying a couple of floats, instead of inverting a matrix 100 times? Each parallel task is much smaller.

alice-i-cecile · 2021-05-16T17:34:05Z

Yep: ideally we could vary the weight of the tasks to try and assess the relative overhead in each engine.

ElliotB256 · 2021-05-18T18:41:58Z

I'd be happy to make a PR for adding some benchmarks for this issue

ElliotB256 · 2021-11-24T15:25:00Z

Apologies I was away from this for so long.

I've added an example of a parallel_light_compute benchmark to https://github.com/ElliotB256/ecs_bench_suite/tree/parallel_light_compute. This benchmark is for the case where some very short task must be performed for each entity - a realistic example would be updating position of an entity based on velocity. It compares to the existing heavy_compute benchmark as follows:

Benchmark	Description
`heavy_compute`	1,000 entities. Parallel iter inverts a matrix 100 times.
`parallel_light_compute`	10,000 entities. Parallel iter inverts a matrix once.

The numbers on my PC are as follows:

Benchmark	time
parallel_light_compute/legion	106.60 us
parallel_light_compute/legion (packed)	112.58 us
parallel_light_compute/bevy	1.0066 ms
parallel_light_compute/hecs	100.08 us
parallel_light_compute/shipyard	142.53 us
parallel_light_compute/specs	108.00 us

ElliotB256 · 2021-11-25T00:36:10Z

The batch size used for bevy in the above tests was 64. Changing to a batch size of 1024 offers an improvement to 907.58us, but still behind the other libraries.

For different batch sizes:

Batch Size	Time
8	1.9473ms
64	1.0444ms
256	1.0048ms
1024	960.13us
4096	1.0296ms
10,000	1.2633ms

It's not clear to me what the equivalent batch size is for the other tests. legion and specs (maybe the others too?) use rayon for parallel iteration, but I'm unfamiliar with the internals. Some information claims by default it splits every time (see e.g. this PR and this SO - but that doesn't seem quite right, and the Rayon page says:

Parallel iterators take care of deciding how to divide your data into tasks; it will dynamically adapt for maximum performance.

ElliotB256 · 2021-11-25T10:08:40Z

@mockersf pointed out on discord that my example was creating a new task pool every iteration. Changing it so the task pool is initialised before hand, the results are:

Batch Size	Time
8	1.177ms
64	234.13us
256	149.48us
1024	130.48us
4096	207.13us
10,000	485.55us

This is much better. It is still not quite as good as the others, but at optimum batch sizes bevy is at least competitive.

ElliotB256 · 2021-11-25T10:24:53Z

ElliotB256/ecs_bench_suite@0ad6a44

ElliotB256 · 2021-11-27T15:07:39Z

By chance I just saw on the discord:

iamseb: I'm trying to run query.par_for_each across a large set of entities, and noticing that my CPU is only engaging half my available cores. I tried running https://github.com/bevyengine/bevy/blob/main/examples/async_tasks/async_compute.rs and setting the sleep timer to 1 sec for all tasks and noticed the same thing, with the example taking twice the expected time to run. In previous versions of bevy that used rayon it would max out the CPU. Is this a design decision, or is there some config I'm missing for the task scheduler?
MinerSebas: Bevy splits has three diferent Thread pools which together use the whole CPU .(https://github.com/bevyengine/bevy/blob/main/crates/bevy_core/src/task_pool_options.rs#L56-L81).
You can adjust the used ratios by inserting the DefaultTaskPoolOptions Resource before adding the default plugins.

Does anyone know if this is relevant for the benchmark?

ElliotB256 · 2021-11-27T15:33:33Z

Checked and it isn't, All cores used in profiler.

use bevy_ecs::prelude::*;
use bevy_tasks::TaskPool;
use cgmath::*;

#[derive(Copy, Clone)]
struct Position(Vector3<f32>);

#[derive(Copy, Clone)]
struct Rotation(Vector3<f32>);

#[derive(Copy, Clone)]
struct Velocity(Vector3<f32>);

pub struct Benchmark(World, TaskPool);

fn main() {
    
    let mut world = World::default();

        world.spawn_batch((0..10_000).map(|_| {
            (
                Matrix4::<f32>::from_angle_x(Rad(1.2)),
                Position(Vector3::unit_x()),
                Rotation(Vector3::unit_x()),
                Velocity(Vector3::unit_x()),
            )
        }));

    let pool = TaskPool::new();
    
    for _ in 0..100_000 {
        let mut query = world.query::<(&mut Position, &mut Matrix4<f32>)>();    
        query.par_for_each_mut(&mut world, &pool, 1024, |(mut pos, mut mat)| {
            *mat = mat.invert().unwrap();
            pos.0 = mat.transform_vector(pos.0);
        });
    }
}

james7132 · 2022-12-04T03:10:05Z

With the inclusion of cart's ecs_bench_suite in #4225, do we need to cover this more? busy_systems, contrived, and heavy_compute seems to cover it fairly well already.

alice-i-cecile added C-Feature A new feature, making something new possible A-ECS Entities, components, systems, and events C-Performance A change motivated by improving speed, memory usage or compile times labels May 15, 2021

ElliotB256 mentioned this issue Nov 24, 2021

Porting AtomECS v0.8.0 from the specs ECS backend to bevy TeamAtomECS/AtomECS#3

Open

alice-i-cecile mentioned this issue Nov 25, 2021

Automatically specify batch size for parallel iteration #3184

Closed

alice-i-cecile added the S-Ready-For-Implementation This issue is ready for an implementation PR. Go for it! label Dec 12, 2021

ElliotB256 mentioned this issue Jan 20, 2022

Heavy compute does not give a good comparison of parallel iter rust-gamedev/ecs_bench_suite#28

Open

alice-i-cecile closed this as completed Dec 4, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add parallel iteration benchmarks #2173

Add parallel iteration benchmarks #2173

alice-i-cecile commented May 15, 2021

mockersf commented May 15, 2021 •

edited

Loading

alice-i-cecile commented May 16, 2021

ElliotB256 commented May 16, 2021

alice-i-cecile commented May 16, 2021

ElliotB256 commented May 18, 2021

ElliotB256 commented Nov 24, 2021

ElliotB256 commented Nov 25, 2021

ElliotB256 commented Nov 25, 2021

ElliotB256 commented Nov 25, 2021

ElliotB256 commented Nov 27, 2021

ElliotB256 commented Nov 27, 2021

james7132 commented Dec 4, 2022

Add parallel iteration benchmarks #2173

Add parallel iteration benchmarks #2173

Comments

alice-i-cecile commented May 15, 2021

mockersf commented May 15, 2021 • edited Loading

alice-i-cecile commented May 16, 2021

ElliotB256 commented May 16, 2021

alice-i-cecile commented May 16, 2021

ElliotB256 commented May 18, 2021

ElliotB256 commented Nov 24, 2021

ElliotB256 commented Nov 25, 2021

ElliotB256 commented Nov 25, 2021

ElliotB256 commented Nov 25, 2021

ElliotB256 commented Nov 27, 2021

ElliotB256 commented Nov 27, 2021

james7132 commented Dec 4, 2022

mockersf commented May 15, 2021 •

edited

Loading