Limiting parallelism for certain map operations in a Rayon parallel computation #1193

bradlarsen · 2024-08-26T23:42:27Z

bradlarsen
Aug 26, 2024

I have a secrets detection app, Nosey Parker, that uses Rayon for the bulk of its parallel execution.

The parallel computation, very roughly, is to find files on disk and scan them, all in parallel with as many cores as are available.

I have been looking more closely at the performance of this application, and I see that in most environments, even with fast SSDs, the optimal performance for reading the files from disk is to use a smaller number of parallel workers than the number of CPU cores (maybe 1/2). Going beyond this value burns lots of CPU time in the kernel without actually speeding anything up.

However, in this application, the actual scanning of files once read from disk is compute-bound, benefitting from as many cores as are available.

I have been staring at the Rayon docs and poking through the code, but have not seen an obvious way to limit the parallel computation's degree of parallelism for just the I/O part without also limiting parallelism for the compute-bound part.

Any guidance on how to achieve this?

Answered by bradlarsen

Oct 11, 2024

What I have ended up doing is using multiple Rayon pools, connected with a pair of crossbeam channels. The output elements of the first Rayon pool get written to the channel; the second Rayon pool reads from the channel using par_bridge().

This has worked well. This also allows, for example, to initialize thread-local state (such as GPU contexts) on each worker of a particular Rayon thread pool.

Surely there is room to do better in terms of efficiency and parallel scalability for connecting the two Rayon thread pools. In particular, instead of using a single shared pair of crossbeam channels to connect the two, it might be possible to directly multiplex the outputs of the first pool to th…

View full answer

cuviper · 2024-08-27T00:13:32Z

cuviper
Aug 27, 2024
Maintainer

I don't think we have a way to do this in a single iterator chain, but you could probably use separate ThreadPools for this, one for I/O with limited threads that farms the compute out to a bigger pool.

4 replies

bradlarsen Aug 27, 2024
Author

How would you go about connecting the outputs from the I/O threadpool to the bigger one — a pair of channels?

bradlarsen Aug 27, 2024
Author

Another thought I have had is to use a single thread pool with a mutex or similar to hold some number of tokens, and require that each I/O operation acquire a token. If the mutex cannot be locked, call rayon::yield_now and then retry. This approach I think could be used to limit parallelism around scarce resources in single Rayon iterator chains. Crazy idea?

cuviper Aug 28, 2024
Maintainer

Beware of #592 if you try the mutex idea -- that is, you may hit a recursion deadlock if you make rayon calls while holding a mutex. I think it will be ok if you only yield when the lock fails though.

bradlarsen Aug 28, 2024
Author

Indeed! In the meantime, I have been experimenting with multiple explicit Rayon thread pools, explicitly connecting the output of one as the input to the next using channels. Seems to work well...

bradlarsen · 2024-08-27T16:11:50Z

bradlarsen
Aug 27, 2024
Author

Here is a blog post from 2020 about using multiple thread pools in Rayon: https://pkolaczk.github.io/multiple-threadpools-rust/. It's a bit old now but still relevant.

In that blog post, a separate ThreadPool is used for each hardware device (like a magnetic hard disk and an SSD). A map operation is run on each thread pool, and each of them writes its results to an std::sync::mpsc::Channel, which the main application thread reads from.

Notably, the multiple thread pools in this blog post all send their outputs to the same place — the shared std::sync::mpsc::Channel. The multiple thread pools are not sequentially connected to each other, with the output of one acting as the input of another.

Multiple thread pools could be connected sequentially using channels. But I wonder if there is a better way to do this in Rayon?

1 reply

adamreichold Aug 27, 2024
Collaborator

Multiple thread pools could be connected sequentially using channels. But I wonder if there is a better way to do this in Rayon?

I am not sure if I would call it decidedly better or worse, but an alternative approach to basically the same problem was recently discussed in #1173 (comment)

bradlarsen · 2024-10-11T15:40:46Z

bradlarsen
Oct 11, 2024
Author

What I have ended up doing is using multiple Rayon pools, connected with a pair of crossbeam channels. The output elements of the first Rayon pool get written to the channel; the second Rayon pool reads from the channel using par_bridge().

This has worked well. This also allows, for example, to initialize thread-local state (such as GPU contexts) on each worker of a particular Rayon thread pool.

Surely there is room to do better in terms of efficiency and parallel scalability for connecting the two Rayon thread pools. In particular, instead of using a single shared pair of crossbeam channels to connect the two, it might be possible to directly multiplex the outputs of the first pool to the inputs of the second pool (e.g., 4 threads onto 8), with an added work-stealing scheme. But that would be a lot more complicated than a single crossbeam channel.

2 replies

adamreichold Oct 11, 2024
Collaborator

Did you try to have the first pool spawn tasks directly into the second one to replace the channel in between them? A thread pool already manages a queue tasks after all.

bradlarsen Oct 11, 2024
Author

No, I didn't try that, though it seems like a good experiment.

This would involve using ThreadPool::spawn to "send" work from one pool to the other. This would require spawning a job for every item passed between the two pools though? I wonder what the overhead of that would be like.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Limiting parallelism for certain map operations in a Rayon parallel computation #1193

{{title}}

Replies: 3 comments 7 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Limiting parallelism for certain map operations in a Rayon parallel computation #1193

bradlarsen Aug 26, 2024

Replies: 3 comments · 7 replies

cuviper Aug 27, 2024 Maintainer

bradlarsen Aug 27, 2024 Author

bradlarsen Aug 27, 2024 Author

cuviper Aug 28, 2024 Maintainer

bradlarsen Aug 28, 2024 Author

bradlarsen Aug 27, 2024 Author

adamreichold Aug 27, 2024 Collaborator

bradlarsen Oct 11, 2024 Author

adamreichold Oct 11, 2024 Collaborator

bradlarsen Oct 11, 2024 Author

bradlarsen
Aug 26, 2024

Replies: 3 comments 7 replies

cuviper
Aug 27, 2024
Maintainer

bradlarsen Aug 27, 2024
Author

bradlarsen Aug 27, 2024
Author

cuviper Aug 28, 2024
Maintainer

bradlarsen Aug 28, 2024
Author

bradlarsen
Aug 27, 2024
Author

adamreichold Aug 27, 2024
Collaborator

bradlarsen
Oct 11, 2024
Author

adamreichold Oct 11, 2024
Collaborator

bradlarsen Oct 11, 2024
Author