You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This discussion aims to shape future implementation of the parallel execution of the Par. It would most likely be a part of v2. The goal is to make the parallel execution logic a trait so that it is
extendable,
easily replaceable just as a parameter to the parallel iterator,
per computation rather than a global setting,
and as a natural consequence
testable
tunable
In the next section, current parallel executor is briefly summarized, which could be a default implementation of the trait to be defined. Then, some ideas on the trait and implementations are provided as a basis of the discussion.
Current Parallel Execution
Current parallel execution is governed by the Runner. Runner takes three bits of information:
num-threads: Maximum number of threads.
chunk-size: Minimum or exact number of inputs each thread will pull from the concurrent iterator each time it becomes idle.
computation-class: Currently, computations are divided into three classes which might hint different strategies to the runner:
Collect: All items will be processed and written concurrently.
Reduce: All items will be processed; however, will not be written concurrently. Each thread will have its own reduction.
EarlyReturn: First time a condition is satisfied, the execution stops, such as in find method. Stopping execution is handled by immediately skipping to the end of the concurrent iterator. Therefore, each thread will know that they must return once they are idle again and see that the iterator is consumed. Note that a very large chunk size would be most performance degrading in this scenario.
The executor sequentially spawns threads one after the other. It is capable of avoiding unnecessary threads by checking the concurrent iterator's status in between spawning two threads. However, there is little time here to decide. In majority of computations, unless the computation is too easy, it quickly spawns the desired number of threads.
Chunk size can be set exactly by the caller. Here there is no work for the executor. When it is set to Auto, it will be converted to a minimum chunk size by a heuristic depending on the input concurrent iterator.
Then, in between spawning threads, the executor has the opportunity of the progress of already spawned threads and spawn the new thread with a larger chunk size. Although this seems very limited, it is already very effective in reducing the overhead of parallelization. However, it could provide much more capability if the executor had the opportunity to decide on the chunk size every time a thread requests to pull from the concurrent iterator.
Runner Trait - Step 1 - Dynamic Chunk Sizes
In the first attempt, we can leave the decision of number of threads out and let the runner control the parallel execution through deciding on the chunk sizes. Then, the problem it must solve is
What is the optimal chunk size given:
the computation class,
number of spawned threads,
input of the computation as a concurrent iterator,
number of elements already pulled from the concurrent iterator,
number of times each thread pulled from the iterator and how long each iteration took.
Let's define some helper types and a draft for the trait.
enumComputationClass{Collect,Reduce,EarlyReturn,}structChunkExecution{len:usize,start_time:u64,completion_time:Option<u64>,}structThreadProgress{chunks:Vec<ChunkExecution>,}structExecutionStatus{threads_progresses:Vec<ThreadProgress>}traitParallelRunner{/// Returns the chunk size the idle thread must pull from the concurrent iterator. The thread must return when this method returns 0.fnchunk_size(class:ComputationClass,current_status:ExecutionStatus) -> usize;}
Then, each thread that becomes idle first calls the chunk_size method and then progresses with respect to the runner's chunk size decision.
How is dynamic chunk size decision helpful?
As a rule of thumb, we want to work with a large enough chunk size to make the parallelization overhead negligible. Now assume a long iterator with many elements. We would tend to start with a not so small chunk size, say 64. But at the beginning, we don't know how much each individual computation takes. This approach would lead to very poor results if each computation is super lengthy, and if the computation class is early return (find). Parallel execution would most likely be magnitudes slower than sequential if the element we are looking for is the first one. This would then force us to be more conservative in chunk size decision for early return scenarios, but then we might likely suffer from parallelization overhead when the computations are much simpler.
But if we can observe and decide, the runner could make good decisions.
Consider the same scenario and we start with a chunk size of 1. In the heavy computation case, our runner's chunk size method can easily keep returning 1 as it sees that each computation takes long enough to overweigh the parallelization overhead. Therefore, we would be correct and efficient.
If the computations are easy, the threads would compute the single elements super quickly and ask for the next chunk size. This would be visible to the runner. And the runner can decide to increase the chunk size, say to 4. And in the next call to 8, etc. Whenever, it sees that the computation time overweighs the parallelization overhead, it could stop increasing. And we could still be correct and efficient.
Runner Trait - Step 2 - Dynamic Number of Threads
We would always give the opportunity to caller to set an upper bound on the number of threads that can be used for the computation. If it is omitted, we can assume that all available threads can be used. Let's say this number is N.
Most of the parallel computations are not ideal; we get diminishing improvements as we add more threads for it.
If the objective is to minimize the computation time, using N threads is almost always optimal (unless the computation is super trivial).
But if the objective is to maximize the efficient usage of the threads, there is no dominating solution. We need to find the sweet spot, but sweet spot is not well defined, it is a preference.
Even currently in v1, the caller can benchmark and find the desired sweet spot and set a maximum number of threads. This could be good enough in the future as well.
One alternative, on the other hand, would be to give more responsibility to the ParallelRunner trait; in particular, to decide whether or not, or when, to spawn a new thread.
Should we spawn a new thread given:
the current execution status,
number of available threads.
traitParallelRunner{/// Returns the chunk size the idle thread must pull from the concurrent iterator. The thread must return when this method returns 0.fnchunk_size(class:ComputationClass,current_status:ExecutionStatus) -> usize;/// Returns whether or not a new thread needs to be spawned.fndo_spawn(class:ComputationClass,current_status:ExecutionStatus,num_available_threads:usize) -> bool;}
An example use case could be as follows. Consider a computation with very unexpected computation time of individual elements and tight resource environment. A strategy can be implemented through the parallel runner trait as follows:
start sequentially,
when T seconds have passed and iterator has not reached 20% yet, spawn the second thread.
when 2T seconds have passed and the iterator has not reached 50% yet, spawn two more threads,
etc.
This could give very detailed control to the implementor. Most likely, the complexity of implementing such a strategy could be worthwhile only in very critical use cases.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Parallel Runner Trait and Implementations
This discussion aims to shape future implementation of the parallel execution of the
Par
. It would most likely be a part of v2. The goal is to make the parallel execution logic a trait so that it isand as a natural consequence
In the next section, current parallel executor is briefly summarized, which could be a default implementation of the trait to be defined. Then, some ideas on the trait and implementations are provided as a basis of the discussion.
Current Parallel Execution
Current parallel execution is governed by the
Runner
. Runner takes three bits of information:find
method. Stopping execution is handled by immediately skipping to the end of the concurrent iterator. Therefore, each thread will know that they must return once they are idle again and see that the iterator is consumed. Note that a very large chunk size would be most performance degrading in this scenario.The executor sequentially spawns threads one after the other. It is capable of avoiding unnecessary threads by checking the concurrent iterator's status in between spawning two threads. However, there is little time here to decide. In majority of computations, unless the computation is too easy, it quickly spawns the desired number of threads.
Chunk size can be set exactly by the caller. Here there is no work for the executor. When it is set to Auto, it will be converted to a minimum chunk size by a heuristic depending on the input concurrent iterator.
Then, in between spawning threads, the executor has the opportunity of the progress of already spawned threads and spawn the new thread with a larger chunk size. Although this seems very limited, it is already very effective in reducing the overhead of parallelization. However, it could provide much more capability if the executor had the opportunity to decide on the chunk size every time a thread requests to pull from the concurrent iterator.
Runner Trait - Step 1 - Dynamic Chunk Sizes
In the first attempt, we can leave the decision of number of threads out and let the runner control the parallel execution through deciding on the chunk sizes. Then, the problem it must solve is
Let's define some helper types and a draft for the trait.
Then, each thread that becomes idle first calls the
chunk_size
method and then progresses with respect to the runner's chunk size decision.How is dynamic chunk size decision helpful?
As a rule of thumb, we want to work with a large enough chunk size to make the parallelization overhead negligible. Now assume a long iterator with many elements. We would tend to start with a not so small chunk size, say 64. But at the beginning, we don't know how much each individual computation takes. This approach would lead to very poor results if each computation is super lengthy, and if the computation class is early return (find). Parallel execution would most likely be magnitudes slower than sequential if the element we are looking for is the first one. This would then force us to be more conservative in chunk size decision for early return scenarios, but then we might likely suffer from parallelization overhead when the computations are much simpler.
But if we can observe and decide, the runner could make good decisions.
Consider the same scenario and we start with a chunk size of 1. In the heavy computation case, our runner's chunk size method can easily keep returning 1 as it sees that each computation takes long enough to overweigh the parallelization overhead. Therefore, we would be correct and efficient.
If the computations are easy, the threads would compute the single elements super quickly and ask for the next chunk size. This would be visible to the runner. And the runner can decide to increase the chunk size, say to 4. And in the next call to 8, etc. Whenever, it sees that the computation time overweighs the parallelization overhead, it could stop increasing. And we could still be correct and efficient.
Runner Trait - Step 2 - Dynamic Number of Threads
We would always give the opportunity to caller to set an upper bound on the number of threads that can be used for the computation. If it is omitted, we can assume that all available threads can be used. Let's say this number is
N
.Most of the parallel computations are not ideal; we get diminishing improvements as we add more threads for it.
N
threads is almost always optimal (unless the computation is super trivial).Even currently in v1, the caller can benchmark and find the desired sweet spot and set a maximum number of threads. This could be good enough in the future as well.
One alternative, on the other hand, would be to give more responsibility to the
ParallelRunner
trait; in particular, to decide whether or not, or when, to spawn a new thread.An example use case could be as follows. Consider a computation with very unexpected computation time of individual elements and tight resource environment. A strategy can be implemented through the parallel runner trait as follows:
T
seconds have passed and iterator has not reached 20% yet, spawn the second thread.2T
seconds have passed and the iterator has not reached 50% yet, spawn two more threads,This could give very detailed control to the implementor. Most likely, the complexity of implementing such a strategy could be worthwhile only in very critical use cases.
Beta Was this translation helpful? Give feedback.
All reactions