-
Notifications
You must be signed in to change notification settings - Fork 502
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lazily create threads in ThreadPool #576
Comments
I believe amethyst/amethyst#780 may be caused by a similar issue. I recently notices that the amethyst game engine is using a surprising amount of CPU when idle (15%-25% CPU usage when no application logic is running). When I profiled my application, I noticed that most of the CPU time was actually being used by rayon worker threads, which were spinning idly while waiting for more work. Rayon has a mechanism for automatically putting worker threads to sleep when there's no work, but I believe this mechanism is being thwarted by the fact that amethyst will add new work for each frame. This means that every frame, once all work for the frame is done, every worker thread has to spin-wait for an amount of time before being put back to sleep. |
For the Amethyst use case it could also be useful to make the sleep timings configurable. In our case we need the workers to go to sleep much sooner than they do by default. |
RE the subject, I think we've mostly decided we'd prefer not to dynamically adjust the thread count, as this implies a heap of complexity of its own. We've tried to keep the API such that it would be possible though. I have thought about how we could be more careful about wakeups though. Right now, all sleeping threads in a given thread pool wait on a common condvar, and we mostly wake for two reasons: new work is available, or some stolen work is done. (A third reason is shutdown, but this is uncontroversial -- it must wake all threads to clean up and exit.) For new work, it seemed like a good idea to wake everyone ( We could also consider giving each thread its own condvar to allow fine-grained wake-ups. That makes broadcasts harder, but that should be on the cold path anyway, so maybe this doesn't matter. |
Does the complexity arise from the direct act of adding/removing threads or from the new need to monitor the thread utilization? The reason I ask is that for my use case specifically I can monitor whether or not the thread pool is meeting our targets pretty easily, then if we'd benefit from adding another thread I could just call something like I for one do not mind the idea of controlling this completely manually. |
Yeah, making decisions about dynamic threads is hard, and I think we'd more likely be open to manual controls to let the pool owner decide. Internally we'd still have to figure out how to safely update shared state, especially Ideally, I'd hope we can improve wakeups/etc enough that you don't need to worry about it, but manual controls might be useful for other reasons too. |
Let's see if the benchmarks can answer whether |
If dynamically scaling up the number of threads isn't an option, potentially we could at least have saner default for high core count machines? There's really no reason to take all 56 LPs in a machine by default.
I think both of these solutions would go a long way on a high-core count machine and I'd personally lean to having a condvar per thread since it'll reduce contention quite significantly - having all 56 threads wake up seems like overkill. |
Note: I've opened rayon-rs/rfcs#5 which is targeting exactly this issue. I've got a branch that implements the proposed changes, also (in fact, it implements several variations thereof). In particular, that branch should only start a small-ish number of threads, and no longer tries to keep all threads awake all of the time. |
Closing because issue is solved through other means. |
Hi,
I just ran into an issue where I hit pretty major utilization doing almost no work on a 56LP machine, after profiling it looked like most of the time spent was on a condition variable when waiting for more work to be consumed in 56 threads at a time.
For quite low core count machines (<8) it might make sense to reserve the whole system for potential parallelism but with these beefy machines you end up spending a lot of time waiting for nothing & burning cycles in the condition variable. One alternative is to take a look at what TBB does for this problem since they seem to lazily allocate and deallocate threads after a cooldown period depending on the workload.
The text was updated successfully, but these errors were encountered: