Limiting parallelism for certain map operations in a Rayon parallel computation #1193
-
I have a secrets detection app, Nosey Parker, that uses Rayon for the bulk of its parallel execution. The parallel computation, very roughly, is to find files on disk and scan them, all in parallel with as many cores as are available. I have been looking more closely at the performance of this application, and I see that in most environments, even with fast SSDs, the optimal performance for reading the files from disk is to use a smaller number of parallel workers than the number of CPU cores (maybe 1/2). Going beyond this value burns lots of CPU time in the kernel without actually speeding anything up. However, in this application, the actual scanning of files once read from disk is compute-bound, benefitting from as many cores as are available. I have been staring at the Rayon docs and poking through the code, but have not seen an obvious way to limit the parallel computation's degree of parallelism for just the I/O part without also limiting parallelism for the compute-bound part. Any guidance on how to achieve this? |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 7 replies
-
I don't think we have a way to do this in a single iterator chain, but you could probably use separate |
Beta Was this translation helpful? Give feedback.
-
Here is a blog post from 2020 about using multiple thread pools in Rayon: https://pkolaczk.github.io/multiple-threadpools-rust/. It's a bit old now but still relevant. In that blog post, a separate Notably, the multiple thread pools in this blog post all send their outputs to the same place — the shared Multiple thread pools could be connected sequentially using channels. But I wonder if there is a better way to do this in Rayon? |
Beta Was this translation helpful? Give feedback.
-
What I have ended up doing is using multiple Rayon pools, connected with a pair of crossbeam channels. The output elements of the first Rayon pool get written to the channel; the second Rayon pool reads from the channel using This has worked well. This also allows, for example, to initialize thread-local state (such as GPU contexts) on each worker of a particular Rayon thread pool. Surely there is room to do better in terms of efficiency and parallel scalability for connecting the two Rayon thread pools. In particular, instead of using a single shared pair of crossbeam channels to connect the two, it might be possible to directly multiplex the outputs of the first pool to the inputs of the second pool (e.g., 4 threads onto 8), with an added work-stealing scheme. But that would be a lot more complicated than a single crossbeam channel. |
Beta Was this translation helpful? Give feedback.
What I have ended up doing is using multiple Rayon pools, connected with a pair of crossbeam channels. The output elements of the first Rayon pool get written to the channel; the second Rayon pool reads from the channel using
par_bridge()
.This has worked well. This also allows, for example, to initialize thread-local state (such as GPU contexts) on each worker of a particular Rayon thread pool.
Surely there is room to do better in terms of efficiency and parallel scalability for connecting the two Rayon thread pools. In particular, instead of using a single shared pair of crossbeam channels to connect the two, it might be possible to directly multiplex the outputs of the first pool to th…