Docker Driver Fails With Upper Limit of 262144 CPU Shares #7731

herter4171 · 2020-04-15T21:57:14Z

Nomad version

Nomad v0.11.0 (5f8fe0a)

Operating system and Environment details

Amazon Linux 2 with a fixed head node and an auto-scaling group, with scaling driven by Nomad state using a custom cloud metric.

Issue

This came up in the course of troubleshooting issue #7681, and while my intent isn't to issue-spam you guys, I think this is a separate problem that is actively holding back some of my work, unlike the former.

Anyway, I'm experiencing a Docker Driver failure due to an apparent upper limit on CPU shares. I have tested this on c5.18xlarge instances with the following result.

I have also tested this on m5a.24xlarge instances with the following identical result.

I can't even find process_linux.go in the source, so I'm really at a loss here. Any help is greatly appreciated.

Reproduction steps

Submit a job that has more than 262144 CPU shares allocated on a large enough instance to have the job placed, and the Docker driver should fail in the manner I've described.

The text was updated successfully, but these errors were encountered:

shishir-a412ed · 2020-04-15T22:44:49Z

@herter4171 You won't find container_linux.go or process_linux.go in nomad or docker codebase. The error is coming from container runtime (runc).

https://github.com/opencontainers/runc/blob/master/libcontainer/container_linux.go#L349

herter4171 · 2020-04-15T22:58:16Z

@shishir-a412ed, thanks for pointing me in the right direction. 262144 is all over the place there.

tgross · 2020-04-17T18:10:14Z

@herter4171 just a heads up, that value is the maximum cpu_share parameter value from the Linux kernel. I'm not sure if there's a tunable for that .

herter4171 · 2020-04-17T22:03:14Z

Hey @tgross, thanks for the heads-up. This was the first time that troubleshooting led me to a Torvalds repo, and I knew to abandon all hope without being well-versed in operating systems. Hats off to you and your team for the CPU burst capability. Running with 262144 shares on a c5.24xlarge takes up 90% of the capacity, so we'll be just fine, even if another job allocates the remainder from time to time.

schmichael · 2021-02-05T18:07:34Z

Reopening as the root bug here is Nomad's 1:1 mapping of mhz to shares. I think Nomad can even change that in a way that fixes this bug and preserves backward compatible behavior. A 10:1 or 128:1 or similar mapping should preserve the relative cpu share weights while keeping within the valid value range.

This problem is going to get more common as high-core-count machines are used more.

flyinprogrammer · 2021-03-25T20:45:00Z

As brought up previously: #4899 (comment)

https://bugs.openjdk.java.net/browse/JDK-8146115

If cpu_shares has been setup for the container, the number_of_cpus() will be calculated based on cpu_shares()/1024. 1024 is the default and standard unit for calculating relative cpu usage in cloud based container management software.

By this logic it seems the openjdk community believes there will never be more than 256 (262144/1024) cores in a machine, or that they're willing to propose a kernel patch when the time comes 😂

"10:1 or 128:1 or similar mapping"

I'm worried that your proposed fix is going to just add another layer of broken to this lasagna of madness.

It seems to me the better solution would be to go all in on how cpu-shares are relative to other processes running on the machine in the context of the magic_number 1024, or go all in on cfs quotas like k8s has.

As far as backwards compatibility is concerned, why not just implement new resource constraint? cpu-shares which enables us to get this issue correct, similar to how cpu-cores as proposed in #8473

herter4171 closed this as completed Apr 15, 2020

schmichael reopened this Feb 5, 2021

schmichael added stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/client theme/driver/docker type/bug labels Feb 5, 2021

tgross added this to Needs Roadmapping in Nomad - Community Issues Triage Feb 12, 2021

tgross removed this from Needs Roadmapping in Nomad - Community Issues Triage Mar 4, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Docker Driver Fails With Upper Limit of 262144 CPU Shares #7731

Docker Driver Fails With Upper Limit of 262144 CPU Shares #7731

herter4171 commented Apr 15, 2020

shishir-a412ed commented Apr 15, 2020

herter4171 commented Apr 15, 2020

tgross commented Apr 17, 2020

herter4171 commented Apr 17, 2020

schmichael commented Feb 5, 2021 •

edited

Loading

flyinprogrammer commented Mar 25, 2021

Docker Driver Fails With Upper Limit of 262144 CPU Shares #7731

Docker Driver Fails With Upper Limit of 262144 CPU Shares #7731

Comments

herter4171 commented Apr 15, 2020

Nomad version

Operating system and Environment details

Issue

Reproduction steps

shishir-a412ed commented Apr 15, 2020

herter4171 commented Apr 15, 2020

tgross commented Apr 17, 2020

herter4171 commented Apr 17, 2020

schmichael commented Feb 5, 2021 • edited Loading

flyinprogrammer commented Mar 25, 2021

schmichael commented Feb 5, 2021 •

edited

Loading