Allow for cgroups in MEMORY_SIZE detection #2345

sxa · 2021-03-09T17:14:40Z

This is a bit unpleasant but if we're in a restricted docker container we need to restrict the number of CPUs and RAM quantity selected otherwise it'll mis-detect the physical machine's quantity and overwhelm the container and probably exhaust the container's RAM. While it won't affect other containers, but will likely cause lots of timeouts as the container tries to deal with all the process contention and result in potentially intermittent hard-to-diagnose test failures. I'm not sure how many Linux systems won't have /sys/fs/cgroup but I've put some detection in to cover the case where that doesn't exist (It should fall back to the same values as before)

This is an enhancement to #1427 and will hopefully resolve adoptium/infrastructure#2002 (Also Ref: General container issue #2138)

I'm going to run some more testing on this, but it seems to do the right thing in several of the checks I've done.

Signed-off-by: Stewart X Addison sxa@redhat.com

sxa · 2021-03-09T17:18:11Z

Based on running the two jobs at 7544 and 7545 on a host that was showing problems it is now doing a better job of restricting itself. The load on the host is around 15-17 when running on two 8-core docker images, and the RAM usage of each is about 2Gb out of the 8Gb they have allocated. This contrasts with a load of over 100 and all 8Gb in use before this patch was applied.

sxa · 2021-03-09T18:13:37Z

Taking back to draft as CGPROCS isn't reliable enough

Signed-off-by: Stewart X Addison <sxa@redhat.com>

sxa · 2021-03-09T19:55:55Z

Updated to use a slightly messy stripping of the last 5 characters from cpu.cfs_quota_us instead of as otherwise the quoting to divide it by 100000 gets a bit messy since we're already within backquote characters. This should be adequate for almost all cases unless less than one core is made available, which is fairly unlikely... Back in review

Final (hopefully!) testing of jdk_time_1 with ten iterations at:

https://ci.adoptopenjdk.net/job/Grinder/7564 (Physical machine - sxa's odroid N2)
https://ci.adoptopenjdk.net/job/Grinder/7565 (Docker machine - build-docker-fedora33-armv8-2)
I had been trying to test with the AWS aarch64 system but it's having issues detecting the ant version so I'll ignore that for the purposes of this PR and diagnose it tomorrow: https://ci.adoptopenjdk.net/job/Grinder/7566/

[For historic reference, the previous version of this PR that didn't work well used a count of entries in /sys/fs/cgroup/cpu/cpu.cfs_quota_us]

jerboaa · 2021-03-10T09:37:42Z

@sxa Sorry, but what is this cpu.cfs_quota_us fiddling supposed to do? Are you inferring CPU settings from cpu.cfs_quota_us only? This doesn't seem right. I'd think you'd need some form of fraction between cpu.cfs_quota_us and cpu.cfs_period_us at least. But then there are cpu.shares how quota could be enforced and what not else. Not sure that's worth it. See also:
https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/6/html/resource_management_guide/sec-cpu

Are you sure you'll need to do anything in terms of cgroups CPU quotas? As an initial shot I'd only factor in the cgroup (v1!) memory limit and tests will likely run better with it in place. It's probably what you are observing now anyway. If you really need CPU too, this is a bit of a can of worms as it might get cpu limited in a variety of ways. This will get even more fun once you add in cgroups v2 hosts into the mix and docker containers.

sxa · 2021-03-10T10:19:03Z

I'm not sure of anything at the moment :-) I was just trying to find something that would stop the jobs failing due to being overloaded by detecting significantly more resources than the containers had and this was the first option that seemed to work. This close to the JDK16 GA I'm keen to get something in that will help even if we change it to something more suitable later. I fully admit that the quick look I had last night didn't give me a good answer on how to get a suitable figure so I'm very much open to other options 😁

I'll give it a shot with just the memory limit today and see if that works. Is certainly a far more definite thing than the CPU values I'm picking up

jerboaa

This patch seems OK.

sxa · 2021-03-10T11:04:18Z

Looks to be working - the container I'm testing with is sitting at less than it's capacity with just the memory check :-)

Signed-off-by: Stewart X Addison <sxa@redhat.com>

smlambert

Thanks @sxa !

sxa added the bug label Mar 9, 2021

sxa added this to the March 2021 milestone Mar 9, 2021

sxa requested review from jerboaa and smlambert March 9, 2021 17:14

sxa self-assigned this Mar 9, 2021

sxa marked this pull request as draft March 9, 2021 17:46

sxa force-pushed the cgroup_coredetect branch 2 times, most recently from 16ff43f to 5df1afa Compare March 9, 2021 18:00

sxa marked this pull request as ready for review March 9, 2021 18:08

sxa marked this pull request as draft March 9, 2021 18:13

Allow for cgroups in NPROC/MEMORY_SIZE detection

ed1d3fa

Signed-off-by: Stewart X Addison <sxa@redhat.com>

sxa force-pushed the cgroup_coredetect branch from 8cb72f7 to 393539f Compare March 9, 2021 19:45

Alternate method for detecing CPUs (Needs more tweaking)

ea969f6

Signed-off-by: Stewart X Addison <sxa@redhat.com>

sxa force-pushed the cgroup_coredetect branch from 393539f to ea969f6 Compare March 9, 2021 19:46

sxa marked this pull request as ready for review March 9, 2021 19:56

jerboaa approved these changes Mar 10, 2021

View reviewed changes

sxa changed the title ~~Allow for cgroups in NPROC/MEMORY_SIZE detection~~ Allow for cgroups in MEMORY_SIZE detection Mar 10, 2021

Switch to only detecting memory in containers, not CPU

91724d5

Signed-off-by: Stewart X Addison <sxa@redhat.com>

sxa force-pushed the cgroup_coredetect branch from 797fdd1 to 91724d5 Compare March 10, 2021 12:11

smlambert approved these changes Mar 10, 2021

View reviewed changes

smlambert merged commit 480a80c into adoptium:master Mar 10, 2021

sxa mentioned this pull request Mar 10, 2021

Identify which tests seem unstable in docker containers #2138

Open

sxa mentioned this pull request Mar 12, 2021

jdk16 aarch64/xLinux: jdk/incubator/vector testcases hang/timeout eclipse-openj9/openj9#12190

Closed

sxa mentioned this pull request Mar 21, 2021

Update Linux/ppc64le systems with newer OSs adoptium/infrastructure#1815

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow for cgroups in MEMORY_SIZE detection #2345

Allow for cgroups in MEMORY_SIZE detection #2345

sxa commented Mar 9, 2021

sxa commented Mar 9, 2021

sxa commented Mar 9, 2021

sxa commented Mar 9, 2021

jerboaa commented Mar 10, 2021

sxa commented Mar 10, 2021 •

edited

Loading

jerboaa left a comment

sxa commented Mar 10, 2021

smlambert left a comment

Allow for cgroups in MEMORY_SIZE detection #2345

Allow for cgroups in MEMORY_SIZE detection #2345

Conversation

sxa commented Mar 9, 2021

sxa commented Mar 9, 2021

sxa commented Mar 9, 2021

sxa commented Mar 9, 2021

jerboaa commented Mar 10, 2021

sxa commented Mar 10, 2021 • edited Loading

jerboaa left a comment

Choose a reason for hiding this comment

sxa commented Mar 10, 2021

smlambert left a comment

Choose a reason for hiding this comment

sxa commented Mar 10, 2021 •

edited

Loading