Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Client starts set of jobs too large for system memory #5641

Closed
davidpanderson opened this issue Jun 6, 2024 · 0 comments · Fixed by #5642
Closed

Client starts set of jobs too large for system memory #5641

davidpanderson opened this issue Jun 6, 2024 · 0 comments · Fixed by #5642

Comments

@davidpanderson
Copy link
Contributor

(This problem was reported by Glenn from CPDN)

For each runnable job we have several estimates of its future WSS:

a) the project-supplied value rsc_memory_bound
b) APP_VERSION::max_working_set_size:
the max measured WSS of jobs using this app version
since the client started (not saved to state file)
c) if the job has already run, ACTIVE_TASK::working_set_size_smoothed:
recent average (on the order of 1 min) of its WSS

Current policy:
In job scheduling, for job WSS we use
if job has run: c)
else if b) is nonzero: b)
else a)

Problem:
CPDN jobs run for a few minutes with small WSS (say, 1MB).
Then they grow to full WSS (say, 6 GB).
There are various scenarios in which this leads to problems.
E.g. suppose the host has 16GB RAM and 8 cores.
It gets 8 CPDN jobs.
It starts 2 of them.
Their WSS is measured as 1MB.
On the next reschedule the client starts 2 more CPDN jobs.
Eventually all 4 jobs expand to 6GB WSS.
This is bigger than RAM and maybe bigger than swap space.
Some of the jobs fail with memory allocation failure.

Solution:
Change the WSS policy to:
If job has run: max(a, c)
else max(b, c)

Note:
What happens if project's rsc_memory_bound is wildly wrong?
if too large:
client may run fewer jobs for that project
if too small:
same as current: client may run too many jobs,
and they may fail (and possibly cause jobs of other projects to fail)
with malloc failure

@davidpanderson davidpanderson changed the title Client starts jobs too large for system memory Client starts set of jobs too large for system memory Jun 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

2 participants