Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Internalize task distinction into TaskPool #4740

Closed
wants to merge 108 commits into from

Conversation

james7132
Copy link
Member

@james7132 james7132 commented May 13, 2022

Objective

Fixes #1907. IO and Async Compute threads are sitting idle even when the compute threads are at capacity. This results in poor thread utilization.

Solution

Move the distinction between compute, async compute, and IO threads out of userspace, and internalize the distinction into TaskPool. This creates three tiers of threads and tasks: one for each group. Higher priority tasks can run on lower priority threads, if and only if they're currently idle, but the lower priority threads will prioritize their specific workload over higher priority ones. Priority goes compute > io > async compute.

For example, heavy per-frame compute workload, async compute and IO otherwise sitting idle: compute tasks will be scheduled onto async compute and IO threads whenever idle. Any IO task that is awoken will take precedence over the compute tasks.

In another case, a heavy IO workload will schedule onto idle async compute threads, but will never use a compute thread. This prevents lower priority tasks from starving out higher priority tasks.

The priority scheme chosen assumes well-behaved scheduling that adheres to the following principles:

  • Compute tasks are latency sensitive and generally do not hold the thread indefinitely or at the very minimum for time longer than the course of a frame.
  • IO tasks are not CPU intensive and will do minimal processing. Generally will yield readily at any incoming or outbound communication, and thus will only generally be under consistent heavy load only in specific circumstances.
  • Async compute tasks may run for a very long time, across multiple frames, and will generally not yield at any point during a given task's execution. Only under heavy load when a developer explicitly needs it, and can otherwise be used for IO or compute in the general case.

TODO:

  • Add two additional executors to TaskPool
  • Set up TaskPool threads to run tasks from all 3 with some prioritization
  • Add user-facing APIs for choosing which priority to spawn tasks under.
  • Remove the usages newtypes. Fix up dependent code.
  • Properly set up individual thread groups and their individual prioritization schemes.
  • Update docs.

Changelog

Changed: Newtypes of TaskPool (ComputeTaskPool, AsyncComputeTaskPool, and IoTaskPool) have been replaced with TaskPool::spawn_as with the appropriate TaskGroup. Thread utilization across these segments has been generally been increased via internal task prioritization. For convenience, TaskPool::spawn will always spawn on the compute task group.

Migration Guide

ComputeTaskPool, AsyncComputeTaskPool, and IoTaskPool have been removed. These individual TaskPools have been merged together, and can be accessed by Res<TaskPool>. To spawn a task on a specific segment of the threads, use TaskPool::spawn_as with the appropriate TaskGroup.

Before:

fn system(task_pool: Res<AsyncComputeTaskPool>) {
   task_pool.spawn(async {
       ...
   });
}

After:

fn system(task_pool: Res<TaskPool>) {
   task_pool.spawn(TaskGroup::AsyncCompute, async {
       ...
   });
}

@james7132 james7132 marked this pull request as draft May 13, 2022 18:05
@james7132 james7132 added C-Performance A change motivated by improving speed, memory usage or compile times A-Tasks Tools for parallel and async work X-Controversial There is active debate or serious implications around merging this PR C-Breaking-Change A breaking change to Bevy's public API that needs to be noted in a migration guide labels May 13, 2022
@hymm
Copy link
Contributor

hymm commented May 13, 2022

I had a slightly different idea on how to combine the task pools. I wanted to give tasks spawned on a scope the highest priority; give tasks spawned onto the task pool a lower priority; and then add a new spawn_blocking function that works like it does in tokio where these tasks are spawned onto a different set of threads. It reuses an idle thread if there are any or spawns a new thread when there aren't any.

@james7132
Copy link
Member Author

james7132 commented May 13, 2022

I had a slightly different idea on how to combine the task pools. I wanted to give tasks spawned on a scope the highest priority; give tasks spawned onto the task pool a lower priority; and then add a new spawn_blocking function that works like it does in tokio where these tasks are spawned onto a different set of threads. It reuses an idle thread if there are any or spawns a new thread when there aren't any.

I think the first two parts are still reconcilable with the current proposed approach; however, the problem with spawn_blocking as a general approach is that if a compute thread is idle, you can easily starve out the compute tasks, which is unacceptable in a game engine. Spinning up and down threads is also an overhead we generally want to avoid too, particularly since it can cause even heavier thrashing on the CPU for the other threads.

@hymm
Copy link
Contributor

hymm commented May 13, 2022

however, the problem with spawn_blocking as a general approach is that if a compute thread is idle, you can easily starve out the compute tasks, which is unacceptable in a game engine.

isn't that a problem with this approach too. If a user starts too many blocking tasks the compute task pool won't have any threads to work with either.

edit: thinking on this more I guess. The approach here prevents the user from running more than x blocking tasks at the same time. So the surprise for the user is that when they run too many blocking tasks, they only get x running in parallel. Even if they were expecting them to all run in parallel. In the case I suggest, they run as many as they want, but suddenly their game starts stuttering because they've used all the threads.

@james7132
Copy link
Member Author

james7132 commented May 13, 2022

isn't that a problem with this approach too. If a user starts too many blocking tasks the compute task pool won't have any threads to work with either.

If and only if they start it as a normal compute task, something that would require spawn_blocking should be scheduled to the async compute segment, where under this approach, it'll be delayed until the thread(s) are available.

@james7132
Copy link
Member Author

bors try

bors bot added a commit that referenced this pull request May 14, 2022
@bors
Copy link
Contributor

bors bot commented May 14, 2022

try

Build failed:

@james7132 james7132 marked this pull request as ready for review May 14, 2022 20:57
bors bot pushed a commit that referenced this pull request Nov 2, 2022
# Objective
Right now, the `TaskPool` implementation allows panics to permanently kill worker threads upon panicking. This is currently non-recoverable without using a `std::panic::catch_unwind` in every scheduled task. This is poor ergonomics and even poorer developer experience. This is exacerbated by #2250 as these threads are global and cannot be replaced after initialization.

Removes the need for temporary fixes like #4998. Fixes #4996. Fixes #6081. Fixes #5285. Fixes #5054. Supersedes #2307.

## Solution

The current solution is to wrap `Executor::run` in `TaskPool` with a `catch_unwind`, and discarding the potential panic. This was taken straight from [smol](https://github.com/smol-rs/smol/blob/404c7bcc0aea59b82d7347058043b8de7133241c/src/spawn.rs#L44)'s current implementation. ~~However, this is not entirely ideal as:~~
 
 - ~~the signaled to the awaiting task. We would need to change `Task<T>` to use `async_task::FallibleTask` internally, and even then it doesn't signal *why* it panicked, just that it did.~~ (See below).
 - ~~no error is logged of any kind~~ (See below)
 - ~~it's unclear if it drops other tasks in the executor~~ (it does not)
 - ~~This allows the ECS parallel executor to keep chugging even though a system's task has been dropped. This inevitably leads to deadlock in the executor.~~ Assuming we don't catch the unwind in ParallelExecutor, this will naturally kill the main thread.

### Alternatives
A final solution likely will incorporate elements of any or all of the following.

#### ~~Log and Ignore~~
~~Log the panic, drop the task, keep chugging. This only addresses the discoverability of the panic. The process will continue to run, probably deadlocking the executor. tokio's detatched tasks operate in this fashion.~~

Panics already do this by default, even when caught by `catch_unwind`.

#### ~~`catch_unwind` in `ParallelExecutor`~~
~~Add another layer catching system-level panics into the `ParallelExecutor`. How the executor continues when a core dependency of many systems fails to run is up for debate.~~

`async_task::Task`  bubbles up panics already, this will transitively push panics all the way to the main thread.

#### ~~Emulate/Copy `tokio::JoinHandle` with `Task<T>`~~
~~`tokio::JoinHandle<T>` bubbles up the panic from the underlying task when awaited. This can be transitively applied across other APIs that also use `Task<T>` like `Query::par_for_each` and `TaskPool::scope`, bubbling up the panic until it's either caught or it reaches the main thread.~~

`async_task::Task`  bubbles up panics already, this will transitively push panics all the way to the main thread.

#### Abort on Panic
The nuclear option. Log the error, abort the entire process on any thread in the task pool panicking. Definitely avoids any additional infrastructure for passing the panic around, and might actually lead to more efficient code as any unwinding is optimized out. However gives the developer zero options for dealing with the issue, a seemingly poor choice for debuggability, and prevents graceful shutdown of the process. Potentially an option for handling very low-level task management (a la #4740). Roughly takes the shape of:

```rust
struct AbortOnPanic;

impl Drop for AbortOnPanic {
   fn drop(&mut self) {
     abort!();
   }
}

let guard = AbortOnPanic;
// Run task
std::mem::forget(AbortOnPanic);
```

---

## Changelog

Changed: `bevy_tasks::TaskPool`'s threads  will no longer terminate permanently when a task scheduled onto them panics.
Changed: `bevy_tasks::Task` and`bevy_tasks::Scope` will propagate panics in the spawned tasks/scopes to the parent thread.
@james7132
Copy link
Member Author

james7132 commented Nov 5, 2022

@cart @hymm I've mostly updated the PR and this is once again up for review. #4466 added an unusual interaction since we need a temporary Executor, which I originally made with the assumption that it would be static and never be dropped. There were a few optimizations that definitely decrease overhead when (re)scheduling tasks. I can make another "SimpleExecutor" (as a Send-only MPSC executor) for this case that resolves this, but the problem remains that Executor is not meant to handle being dropped.

Should we:

  1. Take the perf hit and revert those changes, supporting dropping Executor and by proxy TaskPool?
  2. Or go all in on a global TaskPool and keep the optimizations?

2 would move us closer to a tokio-esque implementation where there's a single global runtime, while 1 would keep what we have and retain the approach that bevy_tasks has had thus far. Which one should we pursue? I'm personally inclined to push for 2 since:

  • it lets us make those assumptions and cut more overhead out
  • allows for more ergonomic initialization (i.e. #[tokio::main] style #[bevy::main]),
  • this PR already reduces/removes the need to pass around &TaskPool or some newtype of it.
  • provides a bit of familiarity for people using tokio's APIs if there's a clear mapping between the two.

@hymm
Copy link
Contributor

hymm commented Nov 7, 2022

We don't necessarily need a temporary executor. We just need a way of running tasks that only run on the thread the scope is running on. This could potentially be a thread local storage queue that can't be stolen from.

A consideration here is that in #6503, I removed the temporary executor for a global main thread executor that has run time checks to only allow it to be ticked on the main thread. But not sure if that is going to get merged or not.

@hymm
Copy link
Contributor

hymm commented Nov 7, 2022

I'm leaning towards option 2. Seems like there isn't much need for users to construct their own taskpools now that we're using global task pools. The one issue might be if users will want to reconfigure the number of threads or thread distribution during execution.

On another note, it might be a good idea to port the micro benchmarks over from async_executor to get a better idea of how the current executor compares to the original.

@hymm hymm mentioned this pull request Nov 10, 2022
2 tasks
@hymm
Copy link
Contributor

hymm commented Nov 15, 2022

Will this completely block us off from using the tokio ecosystem forever on our task pools? Or would we still be able to switch the tokio executor into our task pools somehow even if it doesn't have all the niceties (priorities)?

@hymm
Copy link
Contributor

hymm commented Nov 15, 2022

My biggest problem with reviewing this PR is that I have no idea how to understand the changes that were made to the async executor. I can look at the diff https://github.com/james7132/bevy/compare/b5fc0d2..task-pool-internalization and figure out what it's doing, but I have no context for why some things were changed. If you could port over the tests and benches from async executor I would have a lot more confidence that things are working as they're supposed to. Bonus points if you can get loom working somehow.

@inodentry
Copy link
Contributor

Just something to note: this sort of PR, where we have very tricky performance implications due to multithreading, different CPU cores being involved, etc, can be very sensitive to CPU hardware boost behaviors and such!

I am worried that there is a high risk of benchmarks being wildly inaccurate (when it comes to judging how much more/less efficient bevy's scheduling has become), due to different thread utilization, causing different boost clocks of different CPUs, etc. The benchmark results may therefore be skewed because of the particular CPU model of the computer they are run on.

I recommend, if possible, to benchmark this PR (and other multithreading/scheduling things) on a machine where all CPU cores can be forced to run at the same clock speeds (via BIOS settings, etc, to disable boost and clock frequency scaling).

Otherwise we can't really know how much we actually win/lose due to better multithreading / thread utilization, and how much of it was due to the particular CPU hardware boosting differently due to its cores being loaded differently.

This kind of thing is important in general, but for these sorts of PRs especially so. I would not trust benchmarks that were not done that way.

@alice-i-cecile alice-i-cecile added the D-Complex Quite challenging from either a design or technical perspective. Ask for help! label Jan 3, 2023
ItsDoot pushed a commit to ItsDoot/bevy that referenced this pull request Feb 1, 2023
# Objective
Right now, the `TaskPool` implementation allows panics to permanently kill worker threads upon panicking. This is currently non-recoverable without using a `std::panic::catch_unwind` in every scheduled task. This is poor ergonomics and even poorer developer experience. This is exacerbated by bevyengine#2250 as these threads are global and cannot be replaced after initialization.

Removes the need for temporary fixes like bevyengine#4998. Fixes bevyengine#4996. Fixes bevyengine#6081. Fixes bevyengine#5285. Fixes bevyengine#5054. Supersedes bevyengine#2307.

## Solution

The current solution is to wrap `Executor::run` in `TaskPool` with a `catch_unwind`, and discarding the potential panic. This was taken straight from [smol](https://github.com/smol-rs/smol/blob/404c7bcc0aea59b82d7347058043b8de7133241c/src/spawn.rs#L44)'s current implementation. ~~However, this is not entirely ideal as:~~
 
 - ~~the signaled to the awaiting task. We would need to change `Task<T>` to use `async_task::FallibleTask` internally, and even then it doesn't signal *why* it panicked, just that it did.~~ (See below).
 - ~~no error is logged of any kind~~ (See below)
 - ~~it's unclear if it drops other tasks in the executor~~ (it does not)
 - ~~This allows the ECS parallel executor to keep chugging even though a system's task has been dropped. This inevitably leads to deadlock in the executor.~~ Assuming we don't catch the unwind in ParallelExecutor, this will naturally kill the main thread.

### Alternatives
A final solution likely will incorporate elements of any or all of the following.

#### ~~Log and Ignore~~
~~Log the panic, drop the task, keep chugging. This only addresses the discoverability of the panic. The process will continue to run, probably deadlocking the executor. tokio's detatched tasks operate in this fashion.~~

Panics already do this by default, even when caught by `catch_unwind`.

#### ~~`catch_unwind` in `ParallelExecutor`~~
~~Add another layer catching system-level panics into the `ParallelExecutor`. How the executor continues when a core dependency of many systems fails to run is up for debate.~~

`async_task::Task`  bubbles up panics already, this will transitively push panics all the way to the main thread.

#### ~~Emulate/Copy `tokio::JoinHandle` with `Task<T>`~~
~~`tokio::JoinHandle<T>` bubbles up the panic from the underlying task when awaited. This can be transitively applied across other APIs that also use `Task<T>` like `Query::par_for_each` and `TaskPool::scope`, bubbling up the panic until it's either caught or it reaches the main thread.~~

`async_task::Task`  bubbles up panics already, this will transitively push panics all the way to the main thread.

#### Abort on Panic
The nuclear option. Log the error, abort the entire process on any thread in the task pool panicking. Definitely avoids any additional infrastructure for passing the panic around, and might actually lead to more efficient code as any unwinding is optimized out. However gives the developer zero options for dealing with the issue, a seemingly poor choice for debuggability, and prevents graceful shutdown of the process. Potentially an option for handling very low-level task management (a la bevyengine#4740). Roughly takes the shape of:

```rust
struct AbortOnPanic;

impl Drop for AbortOnPanic {
   fn drop(&mut self) {
     abort!();
   }
}

let guard = AbortOnPanic;
// Run task
std::mem::forget(AbortOnPanic);
```

---

## Changelog

Changed: `bevy_tasks::TaskPool`'s threads  will no longer terminate permanently when a task scheduled onto them panics.
Changed: `bevy_tasks::Task` and`bevy_tasks::Scope` will propagate panics in the spawned tasks/scopes to the parent thread.
@goshhhy
Copy link

goshhhy commented Feb 2, 2023

I recommend, if possible, to benchmark this PR (and other multithreading/scheduling things) on a machine where all CPU cores can be forced to run at the same clock speeds (via BIOS settings, etc, to disable boost and clock frequency scaling).

i have a machine with 272 threads, where the cores are all running the same speed, and as long as at least 3 cores are active then there is only 100mhz difference between base and boost clocks (1400MHz vs 1500MHz; 1600 is possible with 2 or fewer cores active), and i may be able to disable boost entirely.

i also have a simulation workload for that machine, which is affected by #1907, with rather severe underutilization of threads compared to what i would hope for.

i will test this PR with that machine/workload and share the results.

@james7132
Copy link
Member Author

Going to return to the drawing board on this one. Taking on an entire fork of async_executor might not be in our interests. In the meantime, I'm potentially looking to experiment with switchyard, which provides better per-thread control over task allocation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-Tasks Tools for parallel and async work C-Breaking-Change A breaking change to Bevy's public API that needs to be noted in a migration guide C-Performance A change motivated by improving speed, memory usage or compile times D-Complex Quite challenging from either a design or technical perspective. Ask for help! S-Needs-Benchmarking This set of changes needs performance benchmarking to double-check that they help X-Controversial There is active debate or serious implications around merging this PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Using ComputeTaskPool for a par_for_each query only uses half of available logical cores
10 participants