Work fetch stops before per-app concurrency limit is reached even when there are CPUs are idling #5749

wujj123456 · 2024-08-11T04:47:10Z

Describe the bug
The host has 8C/16T, currently with 4 threads busy running 2 ATLAS jobs but remaining 12 idling. Preference is set to use 100% CPUs. ATLAS app from LHC@Home is set to 4 concurrent jobs with 2 CPUs each in app_config. All other projects are set to request no more work. LHC will only provide work for ATLAS app based on online preference.

While ATLAS alone won't be able to fill all cores, I expect the work fetch logic to get at least 4 tasks to max out the concurrency limit, occupying 8 threads. However, fetch consistently stops at 2 ATLAS WUs with no other work running.

Only when I increase the work_buf_min_days to be beyond the estimated finish time of each ATLAS WU, fetching resumes and the new WUs can start right away, maxing out the concurrency limit.

Steps To Reproduce

Apply app_config.xml for LHC as shown in app_config.txt
Set all other projects except LHC@Home to request no new work.
Set work_buf_min_days to be below estimated time of an ATLAS WU, like 0.1. Set work_buf_additional_days to 0. Set max_ncpus_pct to 100.
The system would start fetching the work until half of the concurrency limit is hit, instead of the full concurrent limit. Manual update won't trigger any work fetch either.

Log with work fetch flag enabled is attached:
lhc_fetch.txt
Relevant parts (based on my guess) below.

[---] [work_fetch] shortfall 103680.00 nidle 12.00 saturated 0.00 busy 0.00
[LHC@home] [work_fetch] REC 1097.107 prio -1.009 can request work
[LHC@home] [work_fetch] share 1.000
[LHC@home] [work_fetch] using MC shortfall 0.000000 instead of shortfall 103680.000000
[LHC@home] Not requesting tasks: don't need (CPU: ; AMD/ATI GPU: )

I've also uploaded my state, prefs, cc_config files to the simulator scenario 210
. Simulation 0 also failed to fetch work, though the log looks a bit different.

Expected behavior
Given the concurrent limit is 4 while there are only 2 WUs running, I expect the client to fetch at least another 2 WUs regardless what work_buf_min_days is. If I remove the concurrency limit with no other changes to preference, the client would proceed to fetch enough work to fill all cores, which is expected behavior.

Screenshots
If applicable, add screenshots to help explain your problem.

System Information

OS: Ubuntu 24.04
BOINC Version: 7.24.1 (as packaged by Ubuntu)

Additional context
I was able to reproduce this on a different 16C/32T host by setting ATLAS currency limit to 8 with 2 CPUs per WU as well. Fetch stops at 4 WUs on that machine, instead of 8.

The text was updated successfully, but these errors were encountered:

davidpanderson · 2024-08-11T06:02:30Z

Please try to reproduce this on the client simulator:
https://boinc.berkeley.edu/dev/sim_web.php

Doing so makes it easy to fix things like this; otherwise it's hard.

wujj123456 · 2024-08-11T20:05:11Z

Please try to reproduce this on the client simulator: https://boinc.berkeley.edu/dev/sim_web.php

Doing so makes it easy to fix things like this; otherwise it's hard.

I've already done that with scenario 210 linked in the post. As shown in simulation 0, other than starting the two existing tasks, I didn't see log indicating any work fetch either.

davidpanderson · 2024-08-12T02:49:25Z

Oops! I didn't see that. Thanks - I'll take a look at this soon.

AenBleidd · 2024-08-13T19:30:27Z

This issue is gonna be closed as 'resolved'. In this way tonight (in ~6 hours) we will get a nightly build.
@wujj123456, I'll send you instructions later how to install and to test it.
Thank you in advance.

AenBleidd · 2024-08-14T04:18:34Z

@wujj123456, please use this instruction to install the nightly build: https://github.com/BOINC/boinc/wiki/Linux-DEB-and-RPM-support
It would be nice if you could test and report back to that the issue is fixed.
Thank you in advance!

wujj123456 · 2024-08-14T04:33:51Z

@AenBleidd @davidpanderson Thank you very much for the quick fix and clear test instructions. :-)
I just installed the nightly build on a host suffering from the issue. After a client restart, it immediately started fetching up to the concurrent limit. Top notch!

wujj123456 changed the title ~~Per-app concurrency limit prevents work fetch and causes cores to be idle even when limit has not been reached~~ Work fetch stops before per-app concurrency limit is reached even when there are CPUs are idling Aug 11, 2024

AenBleidd mentioned this issue Aug 13, 2024

address issue with max concurrent and work fetch #5755

Merged

AenBleidd added C: Client - Scheduler Policy P: Minor T: Defect labels Aug 13, 2024

AenBleidd added this to the Client/Manager 8.0.5 milestone Aug 13, 2024

AenBleidd assigned davidpanderson Aug 13, 2024

AenBleidd added the R: fixed label Aug 13, 2024

AenBleidd closed this as completed in #5755 Aug 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Work fetch stops before per-app concurrency limit is reached even when there are CPUs are idling #5749

Work fetch stops before per-app concurrency limit is reached even when there are CPUs are idling #5749

wujj123456 commented Aug 11, 2024 •

edited

Loading

davidpanderson commented Aug 11, 2024

wujj123456 commented Aug 11, 2024

davidpanderson commented Aug 12, 2024

AenBleidd commented Aug 13, 2024

AenBleidd commented Aug 14, 2024

wujj123456 commented Aug 14, 2024

Work fetch stops before per-app concurrency limit is reached even when there are CPUs are idling #5749

Work fetch stops before per-app concurrency limit is reached even when there are CPUs are idling #5749

Comments

wujj123456 commented Aug 11, 2024 • edited Loading

davidpanderson commented Aug 11, 2024

wujj123456 commented Aug 11, 2024

davidpanderson commented Aug 12, 2024

AenBleidd commented Aug 13, 2024

AenBleidd commented Aug 14, 2024

wujj123456 commented Aug 14, 2024

wujj123456 commented Aug 11, 2024 •

edited

Loading