-
-
Notifications
You must be signed in to change notification settings - Fork 719
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Set worker memory limits at OS level? #6177
Comments
I'm inclined to say this should be the responsibility of the deployment. On the generic level this library usually operates, I consider this low level configuration rather hard to maintain |
cgroups should be a pretty stable API at this point. If we were just talking about
Fair, but if it's the deployment's responsibility, then I think we shouldn't have the memory limit feature at all in the nanny. The way it's implemented isn't reliable enough. To me, it's both simple to implement and quite useful, so I think it's reasonable to be the nanny's responsibility. But I'd be fine with removing the limit too. |
+1. |
I am open to this if it actually solves a problem. I am used to having a resource / cluster manager around killing misbehaving pods so I am a bit biased. If this would be helpful for most users, I am open to this but would like to get some feedback from people who are actually working with deployments. @dchudz @jacobtomlinson any opinions? Would this be helpful? Would you prefer implementing this as part of the deployment or should dask do this? Just a bunch of questions in the meantime
|
Update here: unsurprisingly, you can't use normally cgroups if you're already inside a Docker container. I think we should still try to do it in dask (for non-containerized workloads, it would be helpful) and fall back on polling if https://stackoverflow.com/questions/32534203/mounting-cgroups-inside-a-docker-container |
This feels like a duplicate of #4558 and the case you describe could be useful there. I generally agree with @fjetter that this should be the responsibility of the deployment tooling or OOM. I don't think Dask itself should be tinkering with groups, especially if it requires elevated privileges in a container environment. I wonder if there is an alternative where we could just get things to trigger the OOM as expected. |
Forgot to write this, but for posterity: I get the sentiment that deployment tooling should be responsible for setting memory limits if desired, but that's not quite the model that dask offers. The Nanny is, in effect, a deployment tool offered by dask. Its job is manage a Worker subprocess, kill it if it uses too much memory, and restart it if it dies. So I'd argue it's entirely within scope for the Nanny to enforce memory limits at a system level, since it's a deployment tool.
That's worth some research. Based on my understanding of the problem though #6110 (comment), it would basically involve disabling the disk cache, which is of course not acceptable. My guess is that any thing we could do here would be more intrusive and fiddly than using cgroups. |
The implication that polling "does not actually work" feels very drastic to me. It works fine if there is a (small) swap file mounted. It breaks specifically when the OS starts deallocating the executables memory, which only happens after the swap file is full. |
I can think of ways to reduce the problem. For example: We could have a dynamic polling interval which automatically drops to as little as 1ms when you approach the 95% threshold. We could be a lot more conservative in setting the automatic memory limit. E.g. We can easily detect with psutil if there's a swap file and take an educated guess that
|
Couple of interesting things I've found on on the topic. Not solutions dask would implement, but just useful for learning more. It could be worth playing with various
|
My reading of all the above comments is that this only applies to linux workers that do not have swap configured and are not running in containers. I would be curious to know what percentage of workers that is as I think most are already enforcing cgroups at the resource manager level. Basically, every way I deploy Dask these days is either inside a container or on an HPC.
Who are the affected users of this problem? |
Correct for the absence of swap file. |
I'm not sure if this comment belongs here, or a new issue. cgroups have two limits, a 'hard' limit and a 'soft' limit. For v2 (and I think v1) the cgroup docs state.
The v1 docs are a bit more unclear but I suspect the same mechanism kicks in. It might explain the heavy swapping without OOMKill happening that @gjoseph92 is talking about. I think i'd strongly prefer that dask attempts to stay under the soft limit, and can automatically detect/use the limit. Without doing so it's just going to either end up in swap hell or get OOMKilled with no warning. How dask achieves should be comfortably compatible with an unprivileged container and non-root user. |
I've created an MR to move this forward a bit #7051 with respect to detecting and obeying existing cgroup limits. |
The thing is, the whole system is designed so that it's resilient to an abrupt OOM kill. |
Both are a fail state though, OOMKill just being more recoverable, neither are desirable. |
In #6110 (comment), we found that workers were running themselves out of memory to the point where the machines became unresponsive. Because the memory limit in the Nanny is implemented at the application level, and in a periodic callback no less, there's nothing stopping workers from successfully allocating more memory than they're allowed to, as long as the Nanny doesn't catch them.
And as it turns out, if you allocate enough memory that you start heavily swapping (my guess, unconfirmed), but not so much that you get OOMkilled by the OS, it seems that you can effectively lock up the Nanny (and worker) Python processes, so the bad worker never gets caught, and everything just hangs. Memory limits are an important failsafe for stability, to un-stick this sort of situation.
A less brittle solution than this periodic callback might be to use the OS to enforce hard limits.
The logical approach would just be
resource.setrlimit(resource.RLIMIT_RSS, memory_limit_in_bytes)
. However, it turns out thatRLIMIT_RSS
is not supported on newer Linux kernels. The solution nowadays appears to be cgroups.Also relevant: https://jvns.ca/blog/2017/02/17/mystery-swap, https://unix.stackexchange.com/a/621576.
We could use
memory.memsw.limit_in_bytes
to limit total RAM+swap usage, ormemory.limit_in_bytes
to limit just RAM usage, or some smart combo of both. (Allowing a little swap might still be good for unmanaged memory.)Obviously, this whole discussion is Linux-specific. I haven't found (or tried that hard to find) macOS and Windows approaches—I think there might be something for Windows, sounds like probably not for macOS. We can always keep the current periodic callback behavior around for them, though.
cc @fjetter
The text was updated successfully, but these errors were encountered: