-
Notifications
You must be signed in to change notification settings - Fork 29
scx: Pick an idle cpu when select runqueue with WF_EXEC #221
Conversation
If the caller of select_task_rq_scx() calls with WF_EXEC, it's a valuable balancing opportunity to perform task migration. Although runqueue can't dictate where the task is going to run, we can still take advantage of this good opportunity by selecting an idle core, rather than doing nothing. Signed-off-by: I Hsin Cheng <richard120310@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the patch! Left a comment for some discussion.
if (unlikely(wake_flags & WF_EXEC)) | ||
return prev_cpu; | ||
if (unlikely(wake_flags & WF_EXEC)) { | ||
int idle_cpu = scx_pick_idle_cpu(p->cpus_ptr, SCX_PICK_IDLE_CORE); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, if we're going to do this we should probably use scx_select_cpu_dfl()
. But we have to be a bit careful here, not every scheduler will use built-in idle tracking, so that may actually cause a scheduler that implements ops.update_idle()
(and doesn't specify SCX_OPS_KEEP_BUILTIN_IDLE
) to be evicted. Maybe we want something like this:
if (unlikely(wake_flags & WF_EXEC)) {
if (static_branch_likely(&scx_builtin_idle_enabled)) {
bool found;
return scx_select_cpu_dfl(p, prev_cpu, wake_flags, &found);
} else {
return prev_cpu;
}
}
A couple more thoughts:
The potential downside of doing this is that the core scheduler is now doing stuff in the background that the BPF scheduler doesn't see, which could lead to confusing observations or unwanted behavior. For example, perhaps a BPF scheduler explicitly doesn't want to pick a new idle CPU here so that it can keep power consumption low. We already disable built-in behavior elsewhere in the kernel to give the BPF scheduler more control, such as disabling TTWU_QUEUE
when sched_ext is enabled, despite there being some performance implications so that we never skip running ops.select_cpu()
on the wakeup path.
So if we decide to do this, we might also want to introduce a new op flag for this which controls whether it's enabled (e.g. SCX_OPS_AUTO_IDLE_EXEC
) so that implementations can choose to opt in or out of letting the core scheduler reserve idle CPUs on the exec path.
@htejun wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, at WF_EXEC
, the cache footprint is low; however, unlike kernel scheduler, migration is often frequent and easy for SCX scheds. There is no built-in resistance to migration in SCX, so the benefit of extra migration opportunity is questionable. So, introducing unconditional migration to an idle CPU as the default behavior doesn't make much sense. It's likely to just add unnecessary migration overhead in exec path.
However, it could be that if an SCX scheduler implements high CPU stickiness like the built-in scheduler does, the cheap migration opportunity may be interesting, which is what the comment above is saying - if such cases arise, we can add another operation to ask the BPF scheduler whether it wants to preempt itself so that it goes through the scheduling path again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also note that an ops flag won't do here. It's highly unlikely that you'd always want to migrate to an idle CPU. There's no scenario where that's always a win. The scheduler must evaluate whether each relatively low cost migration opportunity is worth taking, which is what the kernel scheduler is doing too. Imagine a scheduler with high CPU stickiness and there are two CPUs. One CPU is loaded lower and a task is exec'ing there. The other CPU happened to be idle at that exact moment. If the task gets migrated to the more loaded CPU, it will just create more future need for migrations in the other direction and situations like this are quite likely in schedulers with high CPU stickiness.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However, it could be that if an SCX scheduler implements high CPU stickiness like the built-in scheduler does, the cheap migration opportunity may be interesting, which is what the comment above is saying - if such cases arise, we can add another operation to ask the BPF scheduler whether it wants to preempt itself so that it goes through the scheduling path again.
Yeah fair enough, if we want to support migrations on the exec path this is the way. So @vax-r, we'll probably drop this patch for the reasons Tejun mentioned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple more thoughts:
The potential downside of doing this is that the core scheduler is now doing stuff in the background that the BPF scheduler doesn't see, which could lead to confusing observations or unwanted behavior. For example, perhaps a BPF scheduler explicitly doesn't want to pick a new idle CPU here so that it can keep power consumption low. We already disable built-in behavior elsewhere in the kernel to give the BPF scheduler more control, such as disabling
TTWU_QUEUE
when sched_ext is enabled, despite there being some performance implications so that we never skip runningops.select_cpu()
on the wakeup path.
Interesting, so now EXT scheduling class is isolated from the rest of scheduling class like fair, deadline and so on, I suppose it should at least have the same priority level like normal class ? If that's the case then if we load BPF scheduler in the kernel, sched_entity
in kernel are still going to be scheduled by the default scheduler like CFS/EEVDF
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also note that an ops flag won't do here. It's highly unlikely that you'd always want to migrate to an idle CPU. There's no scenario where that's always a win. The scheduler must evaluate whether each relatively low cost migration opportunity is worth taking, which is what the kernel scheduler is doing too. Imagine a scheduler with high CPU stickiness and there are two CPUs. One CPU is loaded lower and a task is exec'ing there. The other CPU happened to be idle at that exact moment. If the task gets migrated to the more loaded CPU, it will just create more future need for migrations in the other direction and situations like this are quite likely in schedulers with high CPU stickiness.
Then that's why BPF scheduler has more advantage since it can take more controls on migrations due to different scheduling policies , we should leave this part for per-scheduler implementation to decide, rather than perform a no-brain migration whenever the migration cost is cheap. Thanks for providing this great insight .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah fair enough, if we want to support migrations on the exec path this is the way. So @vax-r, we'll probably drop this patch for the reasons Tejun mentioned.
No problem ! Thanks for sharing these insights and feedbacks. I'll close the PR later.
Summary
If the caller of
select_task_rq_scx()
calls withWF_EXEC
, it's a valuable balancing opportunity to perform task migration. Although runqueue can't dictate where the task is going to run, we can still take advantage of this good opportunity by selecting an idle core, rather than doing nothing.Note
This might be an over-concerned case for current implementation of sched_ext since there's no caller like
sched_exec()
under /kernel/sched/core.c which would callselect_task_rq()
withWF_EXEC
. However it's still worth considering the changing and maybe add a proper caller withWF_EXEC
.