Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

1.3.0 container memory constraints not in effect leading to OOMs #13031

Closed
djenriquez opened this issue May 16, 2022 · 6 comments · Fixed by #13058
Closed

1.3.0 container memory constraints not in effect leading to OOMs #13031

djenriquez opened this issue May 16, 2022 · 6 comments · Fixed by #13058
Assignees
Labels
stage/accepted Confirmed, and intend to work on. No timeline committment though. stage/needs-investigation theme/cgroups cgroups issues type/bug
Milestone

Comments

@djenriquez
Copy link

Nomad version

Nomad v1.2.4 (9f21b72)
Nomad v1.3.0 (52e95d6)

Operating system and Environment details

Amazon Linux 2

Issue

We recently upgraded our servers and nomad clients to 1.3.0 and witnessed a substantial increase in OOMs reported for java containerized applications in our system. Investigation shows that the OOMs are all being reported on Nomad 1.3.0 clients.

When comparing the docker inspect of two different allocations for the same job, I noticed the following difference:
v1.3.0: "CgroupParent": "cpuset",
v.1.2.4: "CgroupParent": "",

And in the environment variables, this new env var pops up for 1.3.0: "NOMAD_PARENT_CGROUP=/nomad",

Also, an early clue was that only Java applications which used the -XX:MaxRAMPercentage, and related configuration, were affected. Java applications that used a hard defined -Xmx and -Xms were not affected. This lead us to thinking that cgroup limits were not being respected, which ended up being the case.

We believe this issue is related to #12274. When looking at our client properties, we do see a cgroup version, and see that we are v1, which leads us to believe this was supposed to be a no-op, but unfortunately, was not.
Screen Shot 2022-05-16 at 10 00 08 AM

@djenriquez
Copy link
Author

djenriquez commented May 16, 2022

To add additional context:

In 1.2.4, we had a job defined with 4000MB of memory and a XX:MaxRAMPercentage of 50. When looking at the java heap, we see it is allocated ~2000MB, which is what we expect.

On 1.3.0, that same job was scheduled an allocation that reported a heap of ~8000MB, completely ignoring the 4000MB limit we had set.

Lastly, completely rolling back our clients to 1.2.4 resolved this OOM problem for us.

@shoenig shoenig added this to Needs Triage in Nomad - Community Issues Triage via automation May 17, 2022
@shoenig shoenig self-assigned this May 17, 2022
@shoenig
Copy link
Member

shoenig commented May 17, 2022

Thanks for reporting @djenriquez, indeed it is surprising to see any difference on a system where cgroups v1 is in use. Other than the new environment variable, nothing in that code path should have changed.

That docker now sees a CgroupParent is definitely a red flag and gives me a starting point on where to look.

Edit: here's the bit that changed the way docker gets configured:

in drivers/docker/driver.go

+       // Extract the cgroup parent from the nomad cgroup (bypass the need for plugin config)
+       parent, _ := cgutil.SplitPath(task.Resources.LinuxResources.CpusetCgroupPath)
+
        hostConfig := &docker.HostConfig{
+               CgroupParent: parent,
+
                Memory:            memory,            // hard limit
                MemoryReservation: memoryReservation, // soft limit

Not 100% sure this is the root cause; might need something to reproduce with after gating this field on cgroups v2.

@shoenig shoenig added the theme/cgroups cgroups issues label May 17, 2022
@shoenig shoenig moved this from Needs Triage to In Progress in Nomad - Community Issues Triage May 17, 2022
@shoenig shoenig added this to the 1.3.x milestone May 17, 2022
@shoenig shoenig added the stage/accepted Confirmed, and intend to work on. No timeline committment though. label May 17, 2022
Nomad - Community Issues Triage automation moved this from In Progress to Done May 24, 2022
@binelson
Copy link

It looks like a fix for this was merged in, but there isn't included in the release of 1.3.1. Can get we get a release cut for this to fix this issue? We are also having issues with java apps using the -XX:MaxRAMPercentage flag after upgrading to 1.3.1.

@tgross
Copy link
Member

tgross commented Jun 23, 2022

1.3.1 was a security and panic fix, so it didn't include the rest of the work merged into the 1.3.x series. We have a Nomad 1.3.2 planned soonish that'll include this.

@binelson
Copy link

Great, thanks @tgross

@lgfa29 lgfa29 modified the milestones: 1.3.x, 1.3.2 Aug 24, 2022
@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 22, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
stage/accepted Confirmed, and intend to work on. No timeline committment though. stage/needs-investigation theme/cgroups cgroups issues type/bug
Projects
Development

Successfully merging a pull request may close this issue.

5 participants