Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow for cgroups in MEMORY_SIZE detection #2345

Merged
merged 3 commits into from
Mar 10, 2021

Conversation

sxa
Copy link
Member

@sxa sxa commented Mar 9, 2021

This is a bit unpleasant but if we're in a restricted docker container we need to restrict the number of CPUs and RAM quantity selected otherwise it'll mis-detect the physical machine's quantity and overwhelm the container and probably exhaust the container's RAM. While it won't affect other containers, but will likely cause lots of timeouts as the container tries to deal with all the process contention and result in potentially intermittent hard-to-diagnose test failures. I'm not sure how many Linux systems won't have /sys/fs/cgroup but I've put some detection in to cover the case where that doesn't exist (It should fall back to the same values as before)

This is an enhancement to #1427 and will hopefully resolve adoptium/infrastructure#2002 (Also Ref: General container issue #2138)

I'm going to run some more testing on this, but it seems to do the right thing in several of the checks I've done.

Signed-off-by: Stewart X Addison sxa@redhat.com

@sxa sxa added the bug label Mar 9, 2021
@sxa sxa added this to the March 2021 milestone Mar 9, 2021
@sxa sxa requested review from jerboaa and smlambert March 9, 2021 17:14
@sxa sxa self-assigned this Mar 9, 2021
@sxa
Copy link
Member Author

sxa commented Mar 9, 2021

Based on running the two jobs at 7544 and 7545 on a host that was showing problems it is now doing a better job of restricting itself. The load on the host is around 15-17 when running on two 8-core docker images, and the RAM usage of each is about 2Gb out of the 8Gb they have allocated. This contrasts with a load of over 100 and all 8Gb in use before this patch was applied.

@sxa sxa marked this pull request as draft March 9, 2021 17:46
@sxa sxa force-pushed the cgroup_coredetect branch 2 times, most recently from 16ff43f to 5df1afa Compare March 9, 2021 18:00
@sxa sxa marked this pull request as ready for review March 9, 2021 18:08
@sxa
Copy link
Member Author

sxa commented Mar 9, 2021

Taking back to draft as CGPROCS isn't reliable enough

@sxa sxa marked this pull request as draft March 9, 2021 18:13
Signed-off-by: Stewart X Addison <sxa@redhat.com>
Signed-off-by: Stewart X Addison <sxa@redhat.com>
@sxa
Copy link
Member Author

sxa commented Mar 9, 2021

Updated to use a slightly messy stripping of the last 5 characters from cpu.cfs_quota_us instead of as otherwise the quoting to divide it by 100000 gets a bit messy since we're already within backquote characters. This should be adequate for almost all cases unless less than one core is made available, which is fairly unlikely... Back in review

Final (hopefully!) testing of jdk_time_1 with ten iterations at:

[For historic reference, the previous version of this PR that didn't work well used a count of entries in /sys/fs/cgroup/cpu/cpu.cfs_quota_us]

@sxa sxa marked this pull request as ready for review March 9, 2021 19:56
@jerboaa
Copy link
Contributor

jerboaa commented Mar 10, 2021

@sxa Sorry, but what is this cpu.cfs_quota_us fiddling supposed to do? Are you inferring CPU settings from cpu.cfs_quota_us only? This doesn't seem right. I'd think you'd need some form of fraction between cpu.cfs_quota_us and cpu.cfs_period_us at least. But then there are cpu.shares how quota could be enforced and what not else. Not sure that's worth it. See also:
https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/6/html/resource_management_guide/sec-cpu

Are you sure you'll need to do anything in terms of cgroups CPU quotas? As an initial shot I'd only factor in the cgroup (v1!) memory limit and tests will likely run better with it in place. It's probably what you are observing now anyway. If you really need CPU too, this is a bit of a can of worms as it might get cpu limited in a variety of ways. This will get even more fun once you add in cgroups v2 hosts into the mix and docker containers.

@sxa
Copy link
Member Author

sxa commented Mar 10, 2021

I'm not sure of anything at the moment :-) I was just trying to find something that would stop the jobs failing due to being overloaded by detecting significantly more resources than the containers had and this was the first option that seemed to work. This close to the JDK16 GA I'm keen to get something in that will help even if we change it to something more suitable later. I fully admit that the quick look I had last night didn't give me a good answer on how to get a suitable figure so I'm very much open to other options 😁

I'll give it a shot with just the memory limit today and see if that works. Is certainly a far more definite thing than the CPU values I'm picking up

Copy link
Contributor

@jerboaa jerboaa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This patch seems OK.

@sxa sxa changed the title Allow for cgroups in NPROC/MEMORY_SIZE detection Allow for cgroups in MEMORY_SIZE detection Mar 10, 2021
@sxa
Copy link
Member Author

sxa commented Mar 10, 2021

Looks to be working - the container I'm testing with is sitting at less than it's capacity with just the memory check :-)

Signed-off-by: Stewart X Addison <sxa@redhat.com>
Copy link
Contributor

@smlambert smlambert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @sxa !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Many extended.openjdk tests timing out on aarch64
3 participants