Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Default split strategy for GreedyScheduler #124

Closed
fredrikekre opened this issue Sep 26, 2024 · 9 comments
Closed

Default split strategy for GreedyScheduler #124

fredrikekre opened this issue Sep 26, 2024 · 9 comments

Comments

@fredrikekre
Copy link
Contributor

The default split strategy is :scatter for GreedyScheduler (and :batch for the others). I didn't see this being discussed in #77 -- what was the rationale?

In my case I essentially want to use the GreedyScheduler as a "ntasks-limited" version of DynamicScheduler but was wondering why they had different defaults for this.

@carstenbauer
Copy link
Member

The default split strategy is :scatter for GreedyScheduler (and :batch for the others). I didn't see this being discussed in #77 -- what was the rationale?

Good question. I can't recall it. @MasonProtter any idea why we have this different default?

(In any case, speaking about different defaults, it's worth noting that the GreedyScheduler has chunking=false.)

In my case I essentially want to use the GreedyScheduler as a "ntasks-limited" version of DynamicScheduler but was wondering why they had different defaults for this.

Maybe I misunderstand but doesn't minchunksize, once we have it (#114), give you what you want?

@fredrikekre
Copy link
Contributor Author

Maybe I misunderstand but doesn't minchunksize, once we have it (#114), give you what you want?}

No, I don't think so. I am after the decoupling of nchunks and ntasks that the GreedyScheduler offers and I don't see how minchunksize would help with that in the DynamicScheduler.

@MasonProtter
Copy link
Member

Good question. I can't recall it. @MasonProtter any idea why we have this different default?

Honestly I don't remember either. Maybe the motivation was that if you're using the GreedyScheduler, then that's because each iteration is likely taking an unpredictable amount of time, so a round robin approach might help, but that's just a guess, I have no idea.

@carstenbauer
Copy link
Member

carstenbauer commented Sep 27, 2024

Maybe the motivation was that if you're using the GreedyScheduler, then that's because each iteration is likely taking an unpredictable amount of time, so a round robin approach might help

That was it. We argued that the GreedyScheduler is mostly useful for non-uniform workload - because otherwise the other schedulers have less overhead - and for non-uniform workload the :roundrobin / :scatter split strategy is typically advantageous.

This line of reasoning seems reasonable to me but I'm generally open to considering changing the default. It might be insightful to test this argument with benchmarks. Especially in the light of #121, which makes the GreedyScheduler more efficient.

No, I don't think so. I am after the decoupling of nchunks and ntasks that the GreedyScheduler offers and I don't see how minchunksize would help with that in the DynamicScheduler.

I see. Out of curiosity, why are you after the decoupling of nchunks and ntasks? Do you have a very non-uniform (or perhaps even unknown/dynamic) workload?

@fredrikekre
Copy link
Contributor Author

Out of curiosity, why are you after the decoupling of nchunks and ntasks? Do you have a very non-uniform (or perhaps even unknown/dynamic) workload?

Yes, potentially. The case I am considering is assembly of finite elements matrices. In many cases, depending on the problem you are solving, all elements are identical and then of course just splitting into nthreads chunks and letting each thread process a single chunk is more or less optimal (Ferrite-FEM/Ferrite.jl#1068 (comment) and with thread pinning (thanks for that package!): Ferrite-FEM/Ferrite.jl#1068 (comment)). However, if you simulate plastic deformation the workload will become extremely uneven -- the elements that enter the plastic region will take significantly longer to compute.

The reason I want to limit the total number of tasks is that you typically need task local scratch data which, again depending on the problem, can be somewhat large an/ord expensive to setup.

and for non-uniform workload the :roundrobin / :scatter split strategy is typically advantageous.

I guess this depends on whether you expect the workload to be sorted or not. If it is truly random then the strategy shouldn't matter I think?

@carstenbauer
Copy link
Member

carstenbauer commented Sep 27, 2024

The reason I want to limit the total number of tasks is that you typically need task local scratch data which, again depending on the problem, can be somewhat large an/ord expensive to setup.

The task-local scratch data setup I get (in fact, we mention it in the docs). What I'm curious about is why you need decoupled chunking. By assigning N chunks to T tasks you effectively bundle ("chunk") again: One bundle being the chunks that each task processes. Couldn't you just as well have simply used one chunk per task (N == T) (which corresponds to the "bundle") instead?

In the context of non-uniform workload, the decoupling - as implemented by the greedy scheduler - makes sense to me, because what will eventually constitute a "bundle" is dynamically decided. In some loose sense, the greedy scheduler is a dynamic variant of :roundrobin. And I guess what I'm asking myself is: When is this dynamic greedy approach better than the regular dynamic and static scheduler with split=:roundrobin.

and for non-uniform workload the :roundrobin / :scatter split strategy is typically advantageous.

I guess this depends on whether you expect the workload to be sorted or not.

The workload doesn't have to be sorted I think. With :roundrobin you (kind of) randomize across the workload. Even for non-sorted (but non-uniform) workload distributions this should (conceptually) be beneficial. For example, consider a bimodal workload distribution (which isn't sorted). With :consecutive, the tasks that get chunks from the middle of the distribution (low workload) will eventually have to wait for the other tasks to finish. With :roundrobin, each chunk will get high-workload and low-workload bits. (Not saying that it will be perfectly balanced, of course.)

If it is truly random then the strategy shouldn't matter I think?

I agree.

@fredrikekre
Copy link
Contributor Author

The task-local scratch data setup I get (in fact, we mention it in the docs). What I'm curious about is why you need decoupled chunking. By assigning N chunks to T tasks you effectively bundle ("chunk") again: One bundle being the chunks that each task processes. Couldn't you just as well have simply used one chunk per task (N == T) (which corresponds to the "bundle") instead?

Maybe I am misunderstanding you, but this wouldn't result in the workstealing behavior of the greedy scheduler, right?

@fredrikekre
Copy link
Contributor Author

Anyway, this can probably be closes and certainly shouldn't hold up the new release. I was just curious.

@carstenbauer
Copy link
Member

Maybe I am misunderstanding you, but this wouldn't result in the workstealing behavior of the greedy scheduler, right?

It wouldn't. It just isn't obvious to me in what cases the greedy scheduler performs better than :roundrobin.

Anyway, this can probably be closes and certainly shouldn't hold up the new release. I was just curious.

Yep, I think the current default split=:roundrobin makes sense for the greedy scheduler. Unless someone complains about it and, ideally, can provide some benchmarks, I don't see a reason to change the default.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants