Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Identify which tests seem unstable in docker containers #2138

Open
sxa opened this issue Dec 29, 2020 · 23 comments
Open

Identify which tests seem unstable in docker containers #2138

sxa opened this issue Dec 29, 2020 · 23 comments

Comments

@sxa
Copy link
Member

sxa commented Dec 29, 2020

This is partially for my own notes, but need to be looked at, and may also be covered elsewhere. Looks like the DDR stuff (not too surprising) will need some work

Other's (on initial look - not too deep!) seem ok

Memo to self - how to check for RAM/CPU limits in a container:

  • CPU: wc -l /sys/fs/cgroup/cpu,cpuacct/cgroup.procs (Not accurate)
  • RAM: expr cat /sys/fs/cgroup/memory/memory.limit_in_bytes / 1024 / 1024 / 1024 (Or divide by 1073741824)
  • Show stats: while true; do clear && uptime && docker stats --no-stream; sleep 60; done
@sxa
Copy link
Member Author

sxa commented Jan 4, 2021

NOTE - runs on the Fedora docker image testing after patching and rebooting the server:

@sxa
Copy link
Member Author

sxa commented Jan 4, 2021

Also trying on a couple of X64 docker images (Fedora 33 and Ubuntu 20.04)

@sxa
Copy link
Member Author

sxa commented Jan 6, 2021

NUMA interrogation is failing in Docker

[EDIT: Issue shows up with just numactl -s in the container. A resolution is to use --cap=sys_nice which gives the container access to the CPU scheduling options - se docker docs for details]

@sxa
Copy link
Member Author

sxa commented Jan 6, 2021

core dump generation is also failing (I've tried starting the container with various options that might help but to no avail ... so far) ... potentially same as described in adoptium/run-aqa#59

[EDIT: The (host) systems on which core files were not being produced had |/usr/share/apport/apport %p %s %c %d %P %E in /proc/sys/kernel/core_pattern - changing it to core resolves it (but we'll need to make that persistent) - raised https://github.com/adoptium/infrastructure/issues/1817]

@sxa
Copy link
Member Author

sxa commented Jan 9, 2021

Also not specific to docker, but we have seen instances if this when LANG is not set to en_US.UTF-8. It occurs only on OpenJ9 sanity.openjdk on JDK11 and above (not seen on 8 so far)

21:41:41  ACTION: main -- Failed. Execution failed: `main' threw exception: java.util.IllformedLocaleException: Ill-formed language: c.u [at index 0]
21:41:41  REASON: User specified action: run main/othervm -Duser.language.display=ja -Duser.language.format=zh LocaleCategory 
21:41:41  TIME:   8.802 seconds
21:41:41  messages:

This will be progressed via adoptium/run-aqa#59

@sophia-guo
Copy link
Contributor

sophia-guo commented Jan 14, 2021

Grinder on testc-packet-fedora33-amd-2 and got

ERROR: Error cloning remote repo 'origin'
hudson.plugins.git.GitException: Command "git fetch --tags --force --progress -- https://github.com/AdoptOpenJDK/openjdk-tests.git +refs/heads/*:refs/remotes/origin/*" returned status code 128:
stdout: 
stderr: fatal: unable to access 'https://github.com/AdoptOpenJDK/openjdk-tests.git/': OpenSSL SSL_connect: Connection reset by peer in connection to github.com:443 

https://ci.adoptopenjdk.net/view/work-in-progress/job/grinder_sandbox/203/console

Suppose testc-packet-fedora33-amd-2 is one docker container?

@sxa
Copy link
Member Author

sxa commented Jan 15, 2021

Suppose testc-packet-fedora33-amd-2 is one docker container?

Yes - it's a docker container.

Hmmm that's a bit odd ... It's also nothing to do with the test if it's failing that early in the process. I've re-run it as 205 and it completed without any fatal failures so hopefully that won't occur, but if you see any further instances let me know so we can see if it happens regularly.

@smlambert
Copy link
Contributor

From https://adoptopenjdk.slack.com/archives/C5219G28G/p1612761729068300, we should check whether the timeouthandler added to openj9 openjdk test runs is able to write a System dump in dockerized environment.

@knn-k
Copy link
Contributor

knn-k commented Feb 25, 2021

I wonder if eclipse-openj9/openj9#12038 is another example of failure in docker environments or not.
"AssertionError: Free Physical Memory size cannot be greater than total Physical Memory Size."

@sxa
Copy link
Member Author

sxa commented Feb 25, 2021

I wonder if eclipse-openj9/openj9#12038 is another example of failure in docker environments or not.
"AssertionError: Free Physical Memory size cannot be greater than total Physical Memory Size."

Hmmm interesting thought. Certainly possibly but this is the first I've heard of it. Some of those containers we have are called in terms of CPU and RAM which could explain why you wouldn't necessarily be able to replicate locally without doing the same.

@jerboaa
Copy link
Contributor

jerboaa commented Mar 2, 2021

sanity.openjdk on JDK 8 (Hotspot) seems to randomly fail for these tests:

java/util/Arrays/TimSortStackSize2.java.TimSortStackSize2
java.lang.OutOfMemoryError: Java heap space
	at TimSortStackSize2.createArray(TimSortStackSize2.java:164)
	at TimSortStackSize2.doTest(TimSortStackSize2.java:59)
	at TimSortStackSize2.main(TimSortStackSize2.java:43)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at com.sun.javatest.regtest.agent.MainWrapper$MainThread.run(MainWrapper.java:127)
	at java.lang.Thread.run(Thread.java:748)

java/util/ResourceBundle/Bug4168625Test.java.Bug4168625Test 
14:10:19  ACTION: main -- Error. Agent communication error: java.io.EOFException; check console log for any additional details

java/lang/invoke/LFCaching/LFSingleThreadCachingTest.java.LFSingleThreadCachingTest 
Unexpected exit from test [exit code: 137]

See:
https://ci.adoptopenjdk.net/view/Test_upstream/job/Test_openjdk8_hs_sanity.openjdk_x86-64_linux_upstream/75/

Especially LFSingleThreadCachingTest.java looks like an OOM kill. Would be nice to overlay that failure with the kernel OOM kill logs.

@sxa
Copy link
Member Author

sxa commented Mar 2, 2021

Above error was on test-docker-fedora33-x64-2 hosted on test-packet-ubuntu2004-amd-1. Those systems were all started with 4 cores and 6GB allocated to them. Re-testing at https://ci.adoptopenjdk.net/job/Grinder/7350 (Failed but I'm not sure if it's the same failure) Correct test from upstream at https://ci.adoptopenjdk.net/job/Grinder/7351

@smlambert In the log Severin referenced above it gives the Grinder re-run link for the individual test as https://ci.adoptopenjdk.net/job/Grinder/parambuild/?JDK_VERSION=8&JDK_IMPL=hotspot&JDK_VENDOR=oracle&BUILD_LIST=openjdk&PLATFORM=x86-64_linux_xl&TARGET=jdk_lang_1 which is clearly wrong as it doesn't reference upstream and the PLATFORM has _xl in it - is that a bug?

EDIT: https://ci.adoptopenjdk.net/job/Grinder/7353/console passed on a real machine (IBMCLOUD RHEL8) but https://ci.adoptopenjdk.net/job/Grinder/7350/console gfailed on the machine mentioned above (Both jdk_lang_1 target)

@sxa
Copy link
Member Author

sxa commented Mar 4, 2021

Potential resource starvation reported by @lumpfish on build-docker-fedora33-armv8-3 in adoptium/infrastructure#2002 - I see a "docker day" in my near future ... (Will diagnose using jdk_time-1):

06:58:21 TEST RESULT: Error. Program /home/jenkins/workspace/Test_openjdk16_hs_extended.openjdk_aarch64_linux/openjdkbinary/j2sdk-image/bin/java' timed out (timeout set to 960000ms, elapsed time including timeout handling was 1006476ms).`

@sxa
Copy link
Member Author

sxa commented Mar 8, 2021

At the moment at least some docker images hosted on build-packet-ubuntu1804-armv8-1 (U1804b_2223 in particular) this job currently running and docker-packet-ubuntu2004-amd-1 (U2004_2224 (this job currently running) in particular) are using a lot of CPU so potentially need to be properly capped. The failures being seen above may well only be occurring on those systems.

When the systems are quiesced tomorrow (since we're running the weekend piplines for JDK16 again due to adoptium/ci-jenkins-pipelines#87) I can look at adjusting the capping of the tests

Related to @kumpfish's jdk_time_1 failure I have one pass at https://ci.adoptopenjdk.net/job/Grinder/7515/ on build-docker-ubuntu1804-armv8-​2 but all other attempts on the machine failued

@sxa
Copy link
Member Author

sxa commented Mar 9, 2021

OK I've brought the following offline for now while investigations occur as some of these have shown problems with jdk_time_1:
build-docker--armv8- nodes hosted on build-packet-ubuntu1804-armv8-1 and docker-packet-ubuntu2004-intel-1)

  • fedora33-2 fedora33-3 fedora33-4 fedora33-5 ubuntu1804-2 ubuntu1804-3 ubuntu1804-4 ubuntu1804-5 ubuntu1804-6 ubuntu1804-armv8l-1 (Hosted on build-packet-ubuntu1804-armv8-1)
  • And test-docker-fedora33-x64-3 which has been showing issues too

jdk_time_1 has passed on the alibaba arm node and also test-docker-fedora-x64-1 (Failed at 7531 though) but at least it's just a recurring problem on all Fedora systems as it passed at 7506!)

@sxa
Copy link
Member Author

sxa commented Mar 9, 2021

sanity.openjdk on JDK 8 (Hotspot) seems to randomly fail for these tests:

java/util/Arrays/TimSortStackSize2.java.TimSortStackSize2
java.lang.OutOfMemoryError: Java heap space
	at TimSortStackSize2.createArray(TimSortStackSize2.java:164)
	at TimSortStackSize2.doTest(TimSortStackSize2.java:59)
	at TimSortStackSize2.main(TimSortStackSize2.java:43)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at com.sun.javatest.regtest.agent.MainWrapper$MainThread.run(MainWrapper.java:127)
	at java.lang.Thread.run(Thread.java:748)

java/util/ResourceBundle/Bug4168625Test.java.Bug4168625Test 
14:10:19  ACTION: main -- Error. Agent communication error: java.io.EOFException; check console log for any additional details

java/lang/invoke/LFCaching/LFSingleThreadCachingTest.java.LFSingleThreadCachingTest 
Unexpected exit from test [exit code: 137]

See:
https://ci.adoptopenjdk.net/view/Test_upstream/job/Test_openjdk8_hs_sanity.openjdk_x86-64_linux_upstream/75/

Especially LFSingleThreadCachingTest.java looks like an OOM kill. Would be nice to overlay that failure with the kernel OOM kill logs.

This looks to be the same issue that's covered in #2310 and not specific to docker

@sxa
Copy link
Member Author

sxa commented Mar 10, 2021

With the merging of #2345 i've brought most systems back online - I've left
build-docker-fedora33-armv8-5 build-docker-ubuntu1804-5 build-docker-ubuntu1804-6

[EDIT: Load on the machine during the nightly testing is sitting at under 16 and there are 64 cores so I have re-enabled these three remaining executors]

@sophia-guo
Copy link
Contributor

Another one adoptium/adoptium#63 (comment)

@sxa
Copy link
Member Author

sxa commented Aug 10, 2021

@sophia-guo That looks like the tests have a dependency on the fakeroot tool which I wasn't aware we required. Can yuou supply a Grinder re-run link for that problem, as I'm not sure it'll be specific to docker - we do not have fakeroot available on all of our systems at present.

@smlambert
Copy link
Contributor

Example run in Grinder: https://ci.adoptopenjdk.net/job/Grinder/1203

Rerun in Grinder on same machine link

@sophia-guo
Copy link
Contributor

@sxa if I login in test machine I can run fakeroot, which means it is installed by default in Linux probably. Though aarch64 has the same issue, which I will open an issue in infra. adoptium/infrastructure#2291

@sophia-guo
Copy link
Contributor

sophia-guo commented Oct 20, 2021

on arm jdk11:
java/beans/PropertyChangeSupport/Test4682386.java.Test4682386
java/beans/XMLEncoder/Test4631471.java.Test4631471
java/beans/XMLEncoder/Test4903007.java.Test4903007
java/beans/XMLEncoder/javax_swing_DefaultCellEditor.java.javax_swing_DefaultCellEditor
java/beans/XMLEncoder/javax_swing_JTree.java.javax_swing_JTree
javax/imageio/plugins/shared/ImageWriterCompressionTest.java.ImageWriterCompressionTest

passed on non-docker and failed on docker ones consistently.
#2989 (comment)

https://ci.adoptopenjdk.net/job/Test_openjdk11_hs_extended.openjdk_arm_linux_testList_2/9/

@sophia-guo
Copy link
Contributor

sophia-guo commented May 19, 2022

java/beans/PropertyEditor/TestFontClassJava.java.TestFontClassJava
java/beans/PropertyEditor/TestFontClassValue.java.TestFontClassValue
java/beans/XMLEncoder/Test4631471.java.Test4631471
java/beans/XMLEncoder/Test4903007.java.Test4903007
java/beans/XMLEncoder/javax_swing_DefaultCellEditor.java.javax_swing_DefaultCellEditor
java/beans/XMLEncoder/javax_swing_JTree.java.javax_swing_JTree
javax/imageio/plugins/shared/ImageWriterCompressionTest.java.ImageWriterCompressionTest

error message:

Stacktrace
Execution failed: `main' threw exception: java.lang.NullPointerException: Cannot load from short array because "sun.awt.FontConfiguration.head" is null    
Standard Output
Property class: class java.awt.Font
PropertyEditor class: class com.sun.beans.editors.FontEditor
    
Standard Error
java.lang.NullPointerException: Cannot load from short array because "sun.awt.FontConfiguration.head" is null
	at java.desktop/sun.awt.FontConfiguration.getVersion(FontConfiguration.java:1262)
	at java.desktop/sun.awt.FontConfiguration.readFontConfigFile(FontConfiguration.java:224)

https://ci.adoptopenjdk.net/job/Test_openjdk18_hs_extended.openjdk_x86-64_linux_testList_2/26/

#3640

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Todo
Development

No branches or pull requests

6 participants