Set worker memory limits at OS level? #6177

gjoseph92 · 2022-04-22T03:59:40Z

In #6110 (comment), we found that workers were running themselves out of memory to the point where the machines became unresponsive. Because the memory limit in the Nanny is implemented at the application level, and in a periodic callback no less, there's nothing stopping workers from successfully allocating more memory than they're allowed to, as long as the Nanny doesn't catch them.

And as it turns out, if you allocate enough memory that you start heavily swapping (my guess, unconfirmed), but not so much that you get OOMkilled by the OS, it seems that you can effectively lock up the Nanny (and worker) Python processes, so the bad worker never gets caught, and everything just hangs. Memory limits are an important failsafe for stability, to un-stick this sort of situation.

A less brittle solution than this periodic callback might be to use the OS to enforce hard limits.

The logical approach would just be resource.setrlimit(resource.RLIMIT_RSS, memory_limit_in_bytes). However, it turns out that RLIMIT_RSS is not supported on newer Linux kernels. The solution nowadays appears to be cgroups.

Also relevant: https://jvns.ca/blog/2017/02/17/mystery-swap, https://unix.stackexchange.com/a/621576.

We could use memory.memsw.limit_in_bytes to limit total RAM+swap usage, or memory.limit_in_bytes to limit just RAM usage, or some smart combo of both. (Allowing a little swap might still be good for unmanaged memory.)

Obviously, this whole discussion is Linux-specific. I haven't found (or tried that hard to find) macOS and Windows approaches—I think there might be something for Windows, sounds like probably not for macOS. We can always keep the current periodic callback behavior around for them, though.

cc @fjetter

The text was updated successfully, but these errors were encountered:

fjetter · 2022-04-22T08:45:38Z

I'm inclined to say this should be the responsibility of the deployment. On the generic level this library usually operates, I consider this low level configuration rather hard to maintain

gjoseph92 · 2022-04-22T19:14:40Z

I consider this low level configuration rather hard to maintain

cgroups should be a pretty stable API at this point. If we were just talking about resource.setrlimit, would you feel the same way? Is it just that cgroups sound too low-level/complex? Because it sounds like cgroups is just the modern equivalent of ulimit -m / resource.setrlimit.

I'm inclined to say this should be the responsibility of the deployment

Fair, but if it's the deployment's responsibility, then I think we shouldn't have the memory limit feature at all in the nanny. The way it's implemented isn't reliable enough.

To me, it's both simple to implement and quite useful, so I think it's reasonable to be the nanny's responsibility. But I'd be fine with removing the limit too.

crusaderky · 2022-04-23T00:49:01Z

+1.
I like the idea to use the OS if possible and only fall back on the nanny polling system if not available.

fjetter · 2022-04-26T08:46:22Z

I am open to this if it actually solves a problem. I am used to having a resource / cluster manager around killing misbehaving pods so I am a bit biased. If this would be helpful for most users, I am open to this but would like to get some feedback from people who are actually working with deployments.

@dchudz @jacobtomlinson any opinions? Would this be helpful? Would you prefer implementing this as part of the deployment or should dask do this?

Just a bunch of questions in the meantime

can we control cgroups from python?
Are there any elevated permissions required?
Is there an equivalent for windows or how would we deal with this?

gjoseph92 · 2022-05-04T18:59:54Z

Update here: unsurprisingly, you can't use normally cgroups if you're already inside a Docker container.

I think we should still try to do it in dask (for non-containerized workloads, it would be helpful) and fall back on polling if /sys/fs/cgroups/<cgroup-name> isn't writeable.
But as @fjetter said, if the deployment system is using containers, it will need to either set the limit itself, or give dask permissions to access cgroups.

https://stackoverflow.com/questions/32534203/mounting-cgroups-inside-a-docker-container
https://groups.google.com/g/kubernetes-dev/c/TBNzAxNXPOA
(I also tested this, SSHing into a coiled worker.)

jacobtomlinson · 2022-05-06T09:13:51Z

This feels like a duplicate of #4558 and the case you describe could be useful there. I generally agree with @fjetter that this should be the responsibility of the deployment tooling or OOM. I don't think Dask itself should be tinkering with groups, especially if it requires elevated privileges in a container environment.

I wonder if there is an alternative where we could just get things to trigger the OOM as expected.

gjoseph92 · 2022-07-21T17:40:55Z

Forgot to write this, but for posterity: I get the sentiment that deployment tooling should be responsible for setting memory limits if desired, but that's not quite the model that dask offers.

The Nanny is, in effect, a deployment tool offered by dask. Its job is manage a Worker subprocess, kill it if it uses too much memory, and restart it if it dies. So I'd argue it's entirely within scope for the Nanny to enforce memory limits at a system level, since it's a deployment tool.

If you're using a Nanny, it must be the one to set the system memory limit. If you set the limit on the outer Nanny process, instead of the inner worker process, then when the worker uses too much memory, the whole Nanny will get killed. Your worker won't be restarted any more.
1. Obviously you might be using a deployment tool like k8s pods, which restarts the process automatically. But in that case, you didn't need a Nanny at all, and shouldn't be using one. So that's not relevant here. We're only talking about Nanny-based deployments where distributed.worker.memory.terminate is set.
Dask offers the option to configure a memory kill threshold right now. If we offer the option, I just think it should be implemented in a way that actually works. If we don't want to implement it in a way that actually works (cgroups), we should probably not offer it at all, and instead say in the docs that we recommend using a deployment tool like XXX to enforce memory limits and restart your worker processes instead of a Nanny.

I wonder if there is an alternative where we could just get things to trigger the OOM as expected

That's worth some research. Based on my understanding of the problem though #6110 (comment), it would basically involve disabling the disk cache, which is of course not acceptable. My guess is that any thing we could do here would be more intrusive and fiddly than using cgroups.

crusaderky · 2022-07-22T11:32:53Z

If we don't want to implement it in a way that actually works (cgroups)

The implication that polling "does not actually work" feels very drastic to me. It works fine if there is a (small) swap file mounted. It breaks specifically when the OS starts deallocating the executables memory, which only happens after the swap file is full.

crusaderky · 2022-07-22T11:40:21Z

I can think of ways to reduce the problem. For example:

We could have a dynamic polling interval which automatically drops to as little as 1ms when you approach the 95% threshold.

We could be a lot more conservative in setting the automatic memory limit. E.g. We can easily detect with psutil if there's a swap file and take an educated guess that

If there's a swap file, we're likely on somebody's laptop or on a traditional multi-purpose server. Memory in use by other processes will likely fluctatue wildly over time and it's OK to rely on a bit of swapping out - keep current behaviour
If there's no swap file, it's very likely we're on a container or specialised VM. Calculate max_memory as a function of available memory at the moment of starting instead of total mounted memory, and add some legroom for good measure.
alternatively we could deprecate automatic max_memory if there's no swap file.

gjoseph92 · 2022-08-05T21:17:08Z

Couple of interesting things I've found on on the topic. Not solutions dask would implement, but just useful for learning more.

It could be worth playing with various /proc/sys/vm settings like disabling overcommitting. Dask would not set these directly, but they could be things we'd recommend in the docs, and deployment systems like dask-cloudprovider, dask-k8s, coiled, etc. might be able to do.

https://www.kernel.org/doc/html/latest/mm/overcommit-accounting.html and /proc/sys/vm/overcommit_memory = 2, maybe with /prov/sys/vm/admin_reserve_kbytes
https://github.com/facebookincubator/oomd
Resorting to a kernel patch: https://stackoverflow.com/questions/52067753/how-to-keep-executable-code-in-memory-even-under-memory-pressure-in-linux

jacobtomlinson · 2022-08-10T13:35:00Z

My reading of all the above comments is that this only applies to linux workers that do not have swap configured and are not running in containers.

I would be curious to know what percentage of workers that is as I think most are already enforcing cgroups at the resource manager level. Basically, every way I deploy Dask these days is either inside a container or on an HPC.

LocalCluster is likely on a machine with swap, SSHCluster is likely also on machines with swap, KubeCluster uses Kubernetes, EC2Cluster/AzureVMCluster/GCPCluster all use containers inside the VMs, ECSCluster uses containers, SLURMCluster/PBSCluster/etc use HPC resource managers that generally enforce cgroups.

Who are the affected users of this problem?

crusaderky · 2022-08-11T20:21:51Z

My reading of all the above comments is that this only applies to linux workers that do not have swap configured and are not running in containers.

Correct for the absence of swap file.
I think however that the spilling of the executable memory is something that @gjoseph92 observed on Coiled - e.g. docker images. Do I understand correctly?

shughes-uk · 2022-09-21T01:14:50Z

I'm not sure if this comment belongs here, or a new issue.

cgroups have two limits, a 'hard' limit and a 'soft' limit. For v2 (and I think v1) the cgroup docs state.

If a cgroup's memory use goes over the high boundary specified here, the cgroup's processes are throttled and put under heavy reclaim pressure

The v1 docs are a bit more unclear but I suspect the same mechanism kicks in. It might explain the heavy swapping without OOMKill happening that @gjoseph92 is talking about.

I think i'd strongly prefer that dask attempts to stay under the soft limit, and can automatically detect/use the limit. Without doing so it's just going to either end up in swap hell or get OOMKilled with no warning.

How dask achieves should be comfortably compatible with an unprivileged container and non-root user.

shughes-uk · 2022-09-21T01:33:15Z

I've created an MR to move this forward a bit #7051 with respect to detecting and obeying existing cgroup limits.

crusaderky · 2022-09-21T11:23:54Z

I think i'd strongly prefer that dask attempts to stay under the soft limit, and can automatically detect/use the limit. Without doing so it's just going to either end up in swap hell or get OOMKilled with no warning.

The thing is, the whole system is designed so that it's resilient to an abrupt OOM kill.
What it's not resilient to is a worker becoming sluggish (but not completely unresponsive) due to swap file trashing.
So OOMkill is always preferrable.

shughes-uk · 2022-09-21T16:27:56Z

Both are a fail state though, OOMKill just being more recoverable, neither are desirable.

gjoseph92 added stability Issue or feature related to cluster stability (e.g. deadlock) memory labels Apr 22, 2022

gjoseph92 mentioned this issue Apr 23, 2022

Hold event loop while evicting #6174

Closed

3 tasks

This was referenced Apr 25, 2022

Computation deadlocks due to worker rapidly running out of memory instead of spilling #6110

Closed

Huge memory leak and processes do not restart automatically #4345

Open

setting distributed.scheduler.allowed-failures to 0 does not always work #6078

Open

gjoseph92 mentioned this issue Jun 3, 2022

Do not close worker on comm error in heartbeat #6492

Closed

gjoseph92 mentioned this issue Jun 13, 2022

Creating dask geodataframe from_dask_dataframe deadlocks geopandas/dask-geopandas#197

Closed

gjoseph92 mentioned this issue Jul 26, 2022

Tell workers when their peers have left (so they don't hang fetching data from them) #6790

Closed

This was referenced Aug 4, 2022

Revert skip of deadlock test [DNM yet] coiled/benchmarks#212

Open

Inserting memory_limit parameter within LocalCluster docstring #6839

Merged

gjoseph92 mentioned this issue Aug 15, 2022

Snapd / watchdog timeout while python process is frozen if process operates close to memory limit coiled/feedback#186

Closed

gjoseph92 mentioned this issue Sep 21, 2022

Support for cgroups v2 and respect soft limits #7051

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Set worker memory limits at OS level? #6177

Set worker memory limits at OS level? #6177

gjoseph92 commented Apr 22, 2022 •

edited

Loading

fjetter commented Apr 22, 2022

gjoseph92 commented Apr 22, 2022

crusaderky commented Apr 23, 2022

fjetter commented Apr 26, 2022

gjoseph92 commented May 4, 2022

jacobtomlinson commented May 6, 2022

gjoseph92 commented Jul 21, 2022

crusaderky commented Jul 22, 2022

crusaderky commented Jul 22, 2022 •

edited

Loading

gjoseph92 commented Aug 5, 2022 •

edited

Loading

jacobtomlinson commented Aug 10, 2022

crusaderky commented Aug 11, 2022

shughes-uk commented Sep 21, 2022

shughes-uk commented Sep 21, 2022 •

edited

Loading

crusaderky commented Sep 21, 2022

shughes-uk commented Sep 21, 2022

Set worker memory limits at OS level? #6177

Set worker memory limits at OS level? #6177

Comments

gjoseph92 commented Apr 22, 2022 • edited Loading

fjetter commented Apr 22, 2022

gjoseph92 commented Apr 22, 2022

crusaderky commented Apr 23, 2022

fjetter commented Apr 26, 2022

gjoseph92 commented May 4, 2022

jacobtomlinson commented May 6, 2022

gjoseph92 commented Jul 21, 2022

crusaderky commented Jul 22, 2022

crusaderky commented Jul 22, 2022 • edited Loading

gjoseph92 commented Aug 5, 2022 • edited Loading

jacobtomlinson commented Aug 10, 2022

crusaderky commented Aug 11, 2022

shughes-uk commented Sep 21, 2022

shughes-uk commented Sep 21, 2022 • edited Loading

crusaderky commented Sep 21, 2022

shughes-uk commented Sep 21, 2022

gjoseph92 commented Apr 22, 2022 •

edited

Loading

crusaderky commented Jul 22, 2022 •

edited

Loading

gjoseph92 commented Aug 5, 2022 •

edited

Loading

shughes-uk commented Sep 21, 2022 •

edited

Loading