Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Work fetch stops before per-app concurrency limit is reached even when there are CPUs are idling #5749

Closed
wujj123456 opened this issue Aug 11, 2024 · 6 comments · Fixed by #5755

Comments

@wujj123456
Copy link

wujj123456 commented Aug 11, 2024

Describe the bug
The host has 8C/16T, currently with 4 threads busy running 2 ATLAS jobs but remaining 12 idling. Preference is set to use 100% CPUs. ATLAS app from LHC@Home is set to 4 concurrent jobs with 2 CPUs each in app_config. All other projects are set to request no more work. LHC will only provide work for ATLAS app based on online preference.

While ATLAS alone won't be able to fill all cores, I expect the work fetch logic to get at least 4 tasks to max out the concurrency limit, occupying 8 threads. However, fetch consistently stops at 2 ATLAS WUs with no other work running.

Only when I increase the work_buf_min_days to be beyond the estimated finish time of each ATLAS WU, fetching resumes and the new WUs can start right away, maxing out the concurrency limit.

Steps To Reproduce

  1. Apply app_config.xml for LHC as shown in app_config.txt
  2. Set all other projects except LHC@Home to request no new work.
  3. Set work_buf_min_days to be below estimated time of an ATLAS WU, like 0.1. Set work_buf_additional_days to 0. Set max_ncpus_pct to 100.
  4. The system would start fetching the work until half of the concurrency limit is hit, instead of the full concurrent limit. Manual update won't trigger any work fetch either.

Log with work fetch flag enabled is attached:
lhc_fetch.txt
Relevant parts (based on my guess) below.

[---] [work_fetch] shortfall 103680.00 nidle 12.00 saturated 0.00 busy 0.00
[LHC@home] [work_fetch] REC 1097.107 prio -1.009 can request work
[LHC@home] [work_fetch] share 1.000
[LHC@home] [work_fetch] using MC shortfall 0.000000 instead of shortfall 103680.000000
[LHC@home] Not requesting tasks: don't need (CPU: ; AMD/ATI GPU: )

I've also uploaded my state, prefs, cc_config files to the simulator scenario 210
. Simulation 0 also failed to fetch work, though the log looks a bit different.

Expected behavior
Given the concurrent limit is 4 while there are only 2 WUs running, I expect the client to fetch at least another 2 WUs regardless what work_buf_min_days is. If I remove the concurrency limit with no other changes to preference, the client would proceed to fetch enough work to fill all cores, which is expected behavior.

Screenshots
If applicable, add screenshots to help explain your problem.

System Information

  • OS: Ubuntu 24.04
  • BOINC Version: 7.24.1 (as packaged by Ubuntu)

Additional context
I was able to reproduce this on a different 16C/32T host by setting ATLAS currency limit to 8 with 2 CPUs per WU as well. Fetch stops at 4 WUs on that machine, instead of 8.

@wujj123456 wujj123456 changed the title Per-app concurrency limit prevents work fetch and causes cores to be idle even when limit has not been reached Work fetch stops before per-app concurrency limit is reached even when there are CPUs are idling Aug 11, 2024
@davidpanderson
Copy link
Contributor

Please try to reproduce this on the client simulator:
https://boinc.berkeley.edu/dev/sim_web.php

Doing so makes it easy to fix things like this; otherwise it's hard.

@wujj123456
Copy link
Author

Please try to reproduce this on the client simulator: https://boinc.berkeley.edu/dev/sim_web.php

Doing so makes it easy to fix things like this; otherwise it's hard.

I've already done that with scenario 210 linked in the post. As shown in simulation 0, other than starting the two existing tasks, I didn't see log indicating any work fetch either.

@davidpanderson
Copy link
Contributor

Oops! I didn't see that. Thanks - I'll take a look at this soon.

@AenBleidd
Copy link
Member

This issue is gonna be closed as 'resolved'. In this way tonight (in ~6 hours) we will get a nightly build.
@wujj123456, I'll send you instructions later how to install and to test it.
Thank you in advance.

@AenBleidd
Copy link
Member

@wujj123456, please use this instruction to install the nightly build: https://github.com/BOINC/boinc/wiki/Linux-DEB-and-RPM-support
It would be nice if you could test and report back to that the issue is fixed.
Thank you in advance!

@wujj123456
Copy link
Author

@AenBleidd @davidpanderson Thank you very much for the quick fix and clear test instructions. :-)
I just installed the nightly build on a host suffering from the issue. After a client restart, it immediately started fetching up to the concurrent limit. Top notch!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Archived in project
Development

Successfully merging a pull request may close this issue.

3 participants