Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Server: Bug: Runtime of MT tasks underestimated in sched_send #4151

Closed
RichardHaselgrove opened this issue Jan 15, 2021 · 2 comments · Fixed by #4992
Closed

Server: Bug: Runtime of MT tasks underestimated in sched_send #4151

RichardHaselgrove opened this issue Jan 15, 2021 · 2 comments · Fixed by #4992

Comments

@RichardHaselgrove
Copy link
Contributor

Describe the bug
When a client requests work, the request is expressed in core-seconds. The server doesn't factor in the core usage.

Expected behavior
The server needs to estimate the real cpu usage, across all applicable cores, for MT tasks.

Log entries

14/01/2021 19:30:21 |  | [work_fetch] Request work fetch: project work fetch resumed by user
14/01/2021 19:30:22 |  | [work_fetch] target work buffer: 8640.00 + 864.00 sec
14/01/2021 19:30:22 |  | [work_fetch] shortfall 38016.00 nidle 4.00 saturated 0.00 busy 0.00
14/01/2021 19:30:22 | Milkyway@Home | [sched_op] CPU work request: 38016.00 seconds; 4.00 devices
14/01/2021 19:30:24 | Milkyway@Home | Scheduler request completed: got 23 new tasks
14/01/2021 19:30:24 | Milkyway@Home | [sched_op] estimated total CPU task duration: 39233 seconds
11/01/2021 12:00:39 | PrimeGrid | CPU needs work - buffer low
11/01/2021 12:00:39 | PrimeGrid | [work_fetch] request: CPU (38016.00 sec, 4.00 inst) Intel GPU (0.00 sec, 0.00 inst)
11/01/2021 12:00:39 | PrimeGrid | [sched_op] CPU work request: 38016.00 seconds; 4.00 devices
11/01/2021 12:00:40 | PrimeGrid | Scheduler request completed: got 48 new tasks
11/01/2021 12:00:40 | PrimeGrid | [sched_op] estimated total CPU task duration: 56610 seconds

In both cases, the 'target work buffer' is expressed in wall time - 9,504 seconds, or about 2 hours 40 minutes.
But the 'request' is for four times that much - over 10 hours of core-time.
In both these cases, the scheduler went on adding MT tasks to the reply as if single threaded until the work request was fulfilled. Work for over ten hours of wall-time was delivered (more for PrimeGrid, which still uses DCF).

The server estimates task duration at
https://github.com/BOINC/boinc/blob/master/sched/sched_send.cpp#L429 (estimate_duration_unscaled) and
https://github.com/BOINC/boinc/blob/master/sched/sched_send.cpp#L477 (estimate_duration)
Neither routine considers the core loading of the MT tasks allocated.

avg_ncpus is available in the HOST_USAGE structure, and should be considered in either estimate_duration_unscaled or estimate_duration.

@RichardHaselgrove
Copy link
Contributor Author

This issue has been raised again by the CPDN project, which is testing and preparing to release a new multi-threaded application to process IFS climate models. These will be large tasks, with heavy resource demand: this bug will significantly delay the climate research, because too many tasks will be downloaded by the initial few machines.

We're heading towards a new server release to facilitate #4871 - it would be nice if somebody could code the solution to this trivial oversight before then. But I don't have access to a project server for testing, and I won't code without being able to test my own work.

@AenBleidd
Copy link
Member

@davidpanderson, could you please take a look?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

2 participants