Server: Bug: Runtime of MT tasks underestimated in sched_send #4151

RichardHaselgrove · 2021-01-15T11:29:04Z

Describe the bug
When a client requests work, the request is expressed in core-seconds. The server doesn't factor in the core usage.

Expected behavior
The server needs to estimate the real cpu usage, across all applicable cores, for MT tasks.

Log entries

14/01/2021 19:30:21 |  | [work_fetch] Request work fetch: project work fetch resumed by user
14/01/2021 19:30:22 |  | [work_fetch] target work buffer: 8640.00 + 864.00 sec
14/01/2021 19:30:22 |  | [work_fetch] shortfall 38016.00 nidle 4.00 saturated 0.00 busy 0.00
14/01/2021 19:30:22 | Milkyway@Home | [sched_op] CPU work request: 38016.00 seconds; 4.00 devices
14/01/2021 19:30:24 | Milkyway@Home | Scheduler request completed: got 23 new tasks
14/01/2021 19:30:24 | Milkyway@Home | [sched_op] estimated total CPU task duration: 39233 seconds

11/01/2021 12:00:39 | PrimeGrid | CPU needs work - buffer low
11/01/2021 12:00:39 | PrimeGrid | [work_fetch] request: CPU (38016.00 sec, 4.00 inst) Intel GPU (0.00 sec, 0.00 inst)
11/01/2021 12:00:39 | PrimeGrid | [sched_op] CPU work request: 38016.00 seconds; 4.00 devices
11/01/2021 12:00:40 | PrimeGrid | Scheduler request completed: got 48 new tasks
11/01/2021 12:00:40 | PrimeGrid | [sched_op] estimated total CPU task duration: 56610 seconds

In both cases, the 'target work buffer' is expressed in wall time - 9,504 seconds, or about 2 hours 40 minutes.
But the 'request' is for four times that much - over 10 hours of core-time.
In both these cases, the scheduler went on adding MT tasks to the reply as if single threaded until the work request was fulfilled. Work for over ten hours of wall-time was delivered (more for PrimeGrid, which still uses DCF).

The server estimates task duration at
https://github.com/BOINC/boinc/blob/master/sched/sched_send.cpp#L429 (estimate_duration_unscaled) and
https://github.com/BOINC/boinc/blob/master/sched/sched_send.cpp#L477 (estimate_duration)
Neither routine considers the core loading of the MT tasks allocated.

avg_ncpus is available in the HOST_USAGE structure, and should be considered in either estimate_duration_unscaled or estimate_duration.

The text was updated successfully, but these errors were encountered:

RichardHaselgrove · 2022-10-29T11:20:16Z

This issue has been raised again by the CPDN project, which is testing and preparing to release a new multi-threaded application to process IFS climate models. These will be large tasks, with heavy resource demand: this bug will significantly delay the climate research, because too many tasks will be downloaded by the initial few machines.

We're heading towards a new server release to facilitate #4871 - it would be nice if somebody could code the solution to this trivial oversight before then. But I don't have access to a project server for testing, and I won't code without being able to test my own work.

AenBleidd · 2022-10-29T11:36:46Z

@davidpanderson, could you please take a look?

AenBleidd added C: Server - Scheduler E: to be determined P: Major T: Defect labels Apr 25, 2021

AenBleidd added this to the Server milestone Apr 25, 2021

davidpanderson mentioned this issue Oct 31, 2022

scheduler: when sending a job, decrement work request time by the correct amount. #4992

Merged

AenBleidd added the R: fixed label Nov 1, 2022

lfield closed this as completed in #4992 Nov 9, 2022

AenBleidd modified the milestones: Server, Server Release 1.4.1 Aug 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Server: Bug: Runtime of MT tasks underestimated in sched_send #4151

Server: Bug: Runtime of MT tasks underestimated in sched_send #4151

RichardHaselgrove commented Jan 15, 2021

RichardHaselgrove commented Oct 29, 2022

AenBleidd commented Oct 29, 2022

Server: Bug: Runtime of MT tasks underestimated in sched_send #4151

Server: Bug: Runtime of MT tasks underestimated in sched_send #4151

Comments

RichardHaselgrove commented Jan 15, 2021

RichardHaselgrove commented Oct 29, 2022

AenBleidd commented Oct 29, 2022