address issue with max concurrent and work fetch #5755

davidpanderson · 2024-08-13T01:31:37Z

Fixes #5749 (hopefully)

Max concurrent is a limit on jobs, not processor instances.
The work fetch logic made the erroneous implicit assumption
that all jobs use 1 CPU.
So e.g. if project has max concurrent 4,
and the client has two 2-CPU jobs,
it will think (if work buf is zero) that there's
no point in fetching more work.
But in fact the project could use 8 CPUs, so 4 are idle.

Fix: if a project has MC constraints,
then for each resource compute 'mc_max_could_use':
the max # of instances the project could use, given its MC constraints.
Use this to compute the project's shortfall,
and hence to decide whether to fetch work from it.

Note: the way mc_max_could_use is computed is crude;
it takes the max over all apps,
when it's possible that only one of them has a MC constraint.
This could result in limited over-fetching,
but that's preferable to under-fetching and starvation.

Sim: show app name in timeline

Max concurrent is a limit on jobs, not processor instances. The work fetch logic made the erroneous implicit assumption that all jobs use 1 CPU. So e.g. if project has max concurrent 4, and the client has two 2-CPU jobs, it will think (if work buf is zero) that there's no point in fetching more work. But in fact the project could use 8 CPUs, so 4 are idle. Fix: if a project has MC constraints, then for each resource compute 'mc_max_could_use': the max # of instances the project could use, given its MC constraints. Use this to compute the project's shortfall, and hence to decide whether to fetch work from it. Note: the way mc_max_could_use is computed is crude; it takes the max over all apps, when it's possible that only one of them has a MC constraint. This could result in limited over-fetching, but that's preferable to under-fetching and starvation. Sim: show app name in timeline

AenBleidd · 2024-08-13T10:46:36Z

@davidpanderson, are you sure this is the fix for #5743, not for #5749?

davidpanderson · 2024-08-13T19:24:10Z

Oops! Fixed.

AenBleidd force-pushed the dpa_max_concurrent2 branch from ea60096 to fabdd13 Compare August 13, 2024 08:34

AenBleidd approved these changes Aug 13, 2024

View reviewed changes

AenBleidd added C: Client - Scheduler Policy P: Minor T: Bugfix PR: Reviewed labels Aug 13, 2024

AenBleidd added this to the Client/Manager 8.0.5 milestone Aug 13, 2024

AenBleidd merged commit af6c23c into master Aug 13, 2024
146 checks passed

AenBleidd deleted the dpa_max_concurrent2 branch August 13, 2024 19:30

This pull request was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

address issue with max concurrent and work fetch #5755

address issue with max concurrent and work fetch #5755

davidpanderson commented Aug 13, 2024 •

edited

Loading

AenBleidd commented Aug 13, 2024

davidpanderson commented Aug 13, 2024

address issue with max concurrent and work fetch #5755

address issue with max concurrent and work fetch #5755

Conversation

davidpanderson commented Aug 13, 2024 • edited Loading

AenBleidd commented Aug 13, 2024

davidpanderson commented Aug 13, 2024

davidpanderson commented Aug 13, 2024 •

edited

Loading