Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

some aix machine got error message : no space left #2448

Closed
sophia-guo opened this issue Jan 18, 2022 · 26 comments
Closed

some aix machine got error message : no space left #2448

sophia-guo opened this issue Jan 18, 2022 · 26 comments

Comments

@sophia-guo
Copy link

32 tests failed with no space left.

https://ci.adoptopenjdk.net/job/Test_openjdk17_hs_extended.openjdk_ppc64_aix_testList_2/19/#showFailuresLink

@aixtools
Copy link
Contributor

  • I see, now, that the storage provisioned for the last three servers was 62G rather than 80G.

  • Also see that test-osuosl-aix715-ppc64-3 had a slightly smaller sized /home than the other 7 osuosl servers (30+ rather than 32G) - this has been fixed to 32G.

  • re-running the failed tests in grinder, but that may be different than a job triggered by the build system.

  • please keep monitoring

  • I'll grow the storage to 80G asap.

@aixtools
Copy link
Contributor

  • The attempt to repeat the test failed (from upstream, iirc), so have to wait for the next run.
  • will work on growing storage as soon as I can, i.e., not done yet.

@smlambert
Copy link
Contributor

You can repeat the test with the minor change to switch SDK_RESOURCE=nightly (which will take the nightly build from the API, versus having to wait for an upstream build to be present).

@aixtools
Copy link
Contributor

System storage is updated on -3 and -4, enlarging /home to 40G

@aixtools
Copy link
Contributor

Except for build-1 (adopt01), all osuosl systems have /home==40G.

Re-running test on nightly asap.

@aixtools
Copy link
Contributor

-- My conclusion is that the test string "Np space on Device:" is misleading.

Running on test-osuosl-aix71-ppc64-2 in /home/jenkins/workspace/Grinder (aka adopt06)

00:31:16.657  TEST RESULT: Failed. Execution failed: `main' threw exception: java.lang.RuntimeException: Too few bytes transferred: 2139930624 < 2147484671

00:31:16.658  ACTION: testng -- Failed. Execution failed: `main' threw exception: java.lang.Exception: failures: 2
00:31:16.658  REASON: User specified action: run testng/timeout=300 Transfer4GBFile 

00:31:16.658  STDOUT:
00:31:16.658  test Transfer4GBFile.xferTest04(): failure
00:31:16.658  java.io.IOException: No space left on device
00:31:16.658  	at java.base/java.io.FileOutputStream.writeBytes(Native Method)
00:31:16.658  	at java.base/java.io.FileOutputStream.write(FileOutputStream.java:349)
00:31:16.658  	at java.base/sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:234)
00:31:16.658  	at java.base/sun.nio.cs.StreamEncoder.implFlushBuffer(StreamEncoder.java:313)
00:31:16.658  	at java.base/sun.nio.cs.StreamEncoder.implFlush(StreamEncoder.java:318)
00:31:16.658  	at java.base/sun.nio.cs.StreamEncoder.flush(StreamEncoder.java:160)
00:31:16.658  	at java.base/java.io.OutputStreamWriter.flush(OutputStreamWriter.java:248)
00:31:16.658  	at java.base/java.io.BufferedWriter.flush(BufferedWriter.java:257)
00:31:16.658  	at Transfer4GBFile.initTestFile(Transfer4GBFile.java:170)
00:31:16.658  	at Transfer4GBFile.xferTest04(Transfer4GBFile.java:57)
00:31:16.658  	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
00:31:16.658  	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
...

00:31:16.660  TEST RESULT: Failed. Execution failed: `main' threw exception: java.lang.Exception: failures: 2

++++++++
New Test ??
++++++++
00:31:18.017  TEST: java/nio/channels/FileChannel/TransferOverlappedFileChannel.java
00:31:18.017  TEST JDK: /home/jenkins/workspace/Grinder/openjdkbinary/j2sdk-image
...
00:31:18.037  ACTION: main -- Failed. Execution failed: `main' threw exception: java.io.IOException: No space left on device
00:31:18.037  ACTION: main -- Failed. Execution failed: `main' threw exception: java.io.IOException: No space left on device
00:31:18.037  REASON: Assumed action based on file name: run main TransferOverlappedFileChannel 
00:31:18.037  TIME:   0.372 seconds
00:31:18.037  messages:
00:31:18.037  command: main TransferOverlappedFileChannel
00:31:18.037  reason: Assumed action based on file name: run main TransferOverlappedFileChannel 
00:31:18.037  Mode: agentvm
00:31:18.037  Agent id: 7
00:31:18.037  elapsed time (seconds): 0.372
00:31:18.037  configuration:
00:31:18.037  Boot Layer
00:31:18.037    class path: /home/jenkins/workspace/Grinder/jvmtest/openjdk/jtreg/lib/javatest.jar 
00:31:18.037                /home/jenkins/workspace/Grinder/jvmtest/openjdk/jtreg/lib/jtreg.jar 
00:31:18.037                /home/jenkins/workspace/Grinder/jvmtest/openjdk/jtreg/lib/junit.jar 
00:31:18.037                /home/jenkins/workspace/Grinder/jvmtest/openjdk/jtreg/lib/hamcrest.jar 
00:31:18.037                /home/jenkins/workspace/Grinder/jvmtest/openjdk/jtreg/lib/testng.jar 
00:31:18.037                /home/jenkins/workspace/Grinder/jvmtest/openjdk/jtreg/lib/jcommander.jar 
00:31:18.037                /home/jenkins/workspace/Grinder/jvmtest/openjdk/jtreg/lib/guice.jar 
00:31:18.037    patch:      java.base /home/jenkins/workspace/Grinder/aqa-tests/TKG/output_16426672445135/jdk_nio_0/work/patches/java.base
00:31:18.037  
00:31:18.037  Test Layer
00:31:18.037    class path: /home/jenkins/workspace/Grinder/aqa-tests/TKG/output_16426672445135/jdk_nio_0/work/classes/java/nio/channels/FileChannel/TransferOverlappedFileChannel.d
00:31:18.037                /home/jenkins/workspace/Grinder/aqa-tests/openjdk/openjdk-jdk/test/jdk/java/nio/channels
...
00:31:18.042  TEST RESULT: Failed. Execution failed: `main' threw exception: java.io.IOException: No space left on device
00:31:18.042  --------------------------------------------------

00:31:21.552  TEST: java/nio/channels/FileChannel/TransferToChannel.java
00:31:21.552  TEST JDK: /home/jenkins/workspace/Grinder/openjdkbinary/j2sdk-image
00:31:21.552  
00:31:21.552  ACTION: build -- Passed. Build successful
00:31:21.552  REASON: Named class compiled on demand
00:31:21.552  TIME:   0.225 seconds
00:31:21.552  messages:
00:31:21.552  command: build TransferToChannel
00:31:21.552  reason: Named class compiled on demand
00:31:21.552  Test directory:
00:31:21.552    compile: TransferToChannel
00:31:21.552  elapsed time (seconds): 0.225
00:31:21.552  
00:31:21.552  ACTION: compile -- Passed. Compilation successful
00:31:21.552  REASON: .class file out of date or does not exist
00:31:21.552  TIME:   0.225 seconds
00:31:21.552  messages:
00:31:21.552  command: compile /home/jenkins/workspace/Grinder/aqa-tests/openjdk/openjdk-jdk/test/jdk/java/nio/channels/FileChannel/TransferToChannel.java
00:31:21.552  reason: .class file out of date or does not exist

...
00:31:21.553          -classpath /home/jenkins/workspace/Grinder/aqa-tests/TKG/output_16426672445135/jdk_nio_0/work/classes/java/nio/channels/FileChannel/TransferToChannel.d:/home/jenkins/workspace/Grinder/aqa-tests/openjdk/openjdk-jdk/test/jdk/java/nio/channels/FileChannel:/home/jenkins/workspace/Grinder/jvmtest/openjdk/jtreg/lib/javatest.jar:/home/jenkins/workspace/Grinder/jvmtest/openjdk/jtreg/lib/jtreg.jar \
00:31:21.553          TransferToChannel
00:31:21.553  STDERR:
00:31:21.557  java.io.IOException: No space left on device

Later on there seems to be a core dump related failure:

00:48:46.005  ACTION: main -- Failed. Unexpected exit from test [exit code: 134]
00:48:46.005  REASON: User specified action: run main/othervm/timeout=150 TestMaxCachedBufferSize 
00:48:46.005  TIME:   0.545 seconds
00:48:46.005  messages:
00:48:46.005  command: main TestMaxCachedBufferSize
00:48:46.005  reason: User specified action: run main/othervm/timeout=150 TestMaxCachedBufferSize 
00:48:46.005  Mode: othervm [/othervm specified]
00:48:46.005  Additional options from @modules: --add-modules java.management
00:48:46.005  elapsed time (seconds): 0.545
00:48:46.005  configuration:
00:48:46.005  Boot Layer
00:48:46.005    add modules: java.management
00:48:46.005  
00:48:46.005  STDOUT:
00:48:46.005  #
00:48:46.005  # A fatal error has been detected by the Java Runtime Environment:
00:48:46.005  #
00:48:46.005  #  SIGBUS (0xa) at pc=0x000000000000e898, pid=21561436, tid=258
00:48:46.005  #
00:48:46.005  # JRE version:  (17.0.1+12) (build )
00:48:46.005  # Java VM: OpenJDK 64-Bit Server VM (17.0.1-beta+12-202201182341, mixed mode, tiered, compressed oops, compressed class ptrs, g1 gc, aix-ppc64)
00:48:46.005  # Problematic frame:
00:48:46.005  # C  0x000000000000e898
00:48:46.005  #
00:48:46.005  # Core dump will be written. Default location: /home/jenkins/workspace/Grinder/aqa-tests/TKG/output_16426672445135/jdk_nio_0/work/scratch/core or core.21561436

yet, on AIX (errpt):

LABEL: CORE_DUMP
IDENTIFIER: A924A5FC

Date/Time: Thu Jan 20 09:09:57 2022
Sequence Number: 337
Machine Id: 00FAC25F4B00
Node Id: ojdk06
Class: S
Type: PERM
WPAR: Global
Resource Name: SYSPROC

Description
SOFTWARE PROGRAM ABNORMALLY TERMINATED

Probable Causes
SOFTWARE PROGRAM

User Causes
USER GENERATED SIGNAL

    Recommended Actions
    CORRECT THEN RETRY

Failure Causes
SOFTWARE PROGRAM

    Recommended Actions
    RERUN THE APPLICATION PROGRAM
    IF PROBLEM PERSISTS THEN DO THE FOLLOWING
    CONTACT APPROPRIATE SERVICE REPRESENTATIVE

Detail Data
SIGNAL NUMBER
6
USER'S PROCESS ID:
21561436
FILE SYSTEM SERIAL NUMBER
12
INODE NUMBER
240003
CORE FILE NAME
/home/jenkins/workspace/Grinder/aqa-tests/TKG/output_16426672445135/jdk_nio_0/work/scratch/core
PROGRAM NAME
java
STACK EXECUTION DISABLED
0
COME FROM ADDRESS REGISTER
??
PROCESSOR ID
hw_fru_id: N/A
hw_cpu_id: N/A

ADDITIONAL INFORMATION
abort F8
??

- in short, this never was an issue with disk space. There are other issues blocking these.
- when did/do these tests pass on  AIX?

@aixtools
Copy link
Contributor

From this: https://ci.adoptopenjdk.net/job/Grinder/3073/testReport/junit/java_nio_file_Files_walkFileTree_SkipSiblings/java/SkipSiblings/

  • And looking what is on the system - again, the error message: No space ..." is misleading because errpt never reports FS space exhaustion.
  • looking at /tmp, all the files (the tree did not get cleaned up completely) - there are 67 items left - 47 not a directory, 20 are a directory.
  • ulimit may shed some light
root@p9-aix1-adopt05:[/root]ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
file size               (blocks, -f) unlimited
max memory size         (kbytes, -m) unlimited
open files                      (-n) 2000
pipe size            (512 bytes, -p) 64
stack size              (kbytes, -s) 4194304
cpu time               (seconds, -t) unlimited
max user processes              (-u) unlimited
virtual memory          (kbytes, -v) unlimited
root@p9-aix1-adopt05:[/root]ulimit
unlimited
root@p9-aix1-adopt05:[/root]su - jenkins
jenkins@p9-aix1-adopt05:[/home/jenkins]ulimit -a
core file size          (blocks, -c) unlimited
data seg size           (kbytes, -d) unlimited
file size               (blocks, -f) unlimited
max memory size         (kbytes, -m) unlimited
open files                      (-n) unlimited
pipe size            (512 bytes, -p) 64
stack size              (kbytes, -s) 4194304
cpu time               (seconds, -t) unlimited
max user processes              (-u) unlimited
virtual memory          (kbytes, -v) unlimited
jenkins@p9-aix1-adopt05:[/home/jenkins]ulimit -n
unlimited
  • it seems surprising (to me) that jenkins thinks it can have unlimited filedescriptors open. (That is what the settings say, but)
root@p9-aix1-adopt06:[/root]/usr/bin/grep -p jenkins: /etc/security/limits
jenkins:
        fsize = -1
        core = -1
        cpu = -1
        data = -1
        rss = -1
        stack = -1
        nofiles = -1

root@p9-aix1-adopt06:[/root]/usr/bin/grep -p default: /etc/security/limits
default:
        fsize = 2097151
        core = 2097151
        cpu = -1
        data = 262144
        rss = 65536
        stack = 65536
        nofiles = 2000

root@p9-aix1-adopt06:[/root]/usr/bin/grep -p root: /etc/security/limits
root:
        fsize = -1
        cpu = -1
        data = -1
        stack = -1
        core = -1
        rss = -1
  • In short, this is a test that needs deeper examination as the message does not appear accurate. What is the test actually doing - does it have any relationship with reality

@aixtools
Copy link
Contributor

please add label: testFail

@aixtools
Copy link
Contributor

Ram again, and saved workspace. When running manually - as jenkins (su - jenkins from root) - I do not see any errors. Sadly.
grinder_3074.txt

@sxa sxa added the testFail label Jan 26, 2022
@sxa
Copy link
Member

sxa commented Jan 26, 2022

@sophia-guo You mentioned in the call today that you were still seeing this problem during this week. Can you provide a link to the log please. Also can you confirm if this is limited to test-osuosl-aix715-ppc64-3 and if not please update the title of this issue accordingly.

@sophia-guo
Copy link
Author

sophia-guo commented Jan 26, 2022

No, this is not limited to test-osuosl-aix715-ppc64-3

The run on build-osuosl-aix71-ppc64-2 Jan 24
https://ci.adoptopenjdk.net/job/Grinder/3272/

Earlier one

  • run on `test-osuosl-aix715-ppc64-4` Jan 20
    
     https://ci.adoptopenjdk.net/job/Test_openjdk17_hs_extended.openjdk_ppc64_aix_testList_2/20/
    
  • run on test-osuosl-aix72-ppc64-2 Jan20

https://ci.adoptopenjdk.net/job/Test_openjdk17_hs_extended.openjdk_ppc64_aix_testList_1/20/

@sophia-guo sophia-guo changed the title test-osuosl-aix715-ppc64-3 : no space left some aix machine got error message : no space left Jan 26, 2022
@aixtools
Copy link
Contributor

  • I have been doing as much testing as I can. Recently: https://ci.adoptopenjdk.net/view/Test_grinder/job/Grinder/3309/ on build-2 with MANY failures when testing for 'jdk_custom` and CUSTOM_TARGET=java/nio
  • Of the individual test specified the file system reported is /tmp, not /home as I originally expected.
  • At this moment I think the test is inaccurate: a resource is being consumed, but system logging shows that no file systems have run out of space. Need to better understand the purpose and ways of these tests to come to a conclusion.

@aixtools
Copy link
Contributor

So, yesterday - not sure why yet - test-osuosl-aix715-ppc64-4 did run out of FS space. I know because there is an errpt:

root@adopt08:[/tmp]errpt -aj F7FA22C9
---------------------------------------------------------------------------
LABEL:          J2_FS_FULL
IDENTIFIER:     F7FA22C9

Date/Time:       Thu Jan 20 09:30:11 2022
Sequence Number: 62
Machine Id:      00FB3A2C4C00
Node Id:         adopt08
Class:           O
Type:            INFO
WPAR:            Global
Resource Name:   SYSJ2

Description
UNABLE TO ALLOCATE SPACE IN FILE SYSTEM

Probable Causes
FILE SYSTEM FULL

        Recommended Actions
        INCREASE THE SIZE OF THE ASSOCIATED FILE SYSTEM
        REMOVE UNNECESSARY DATA FROM FILE SYSTEM
        USE FUSER UTILITY TO LOCATE UNLINKED FILES STILL REFERENCED

Detail Data
JFS2 MAJOR/MINOR DEVICE NUMBER
000A 0007
FILE SYSTEM DEVICE AND MOUNT POINT
/dev/hd3, /tmp

@aixtools
Copy link
Contributor

aixtools commented Jan 27, 2022

So, as I watch system 'in action' I am getting the impression that the way some 2G and 4G transfers are 'so-called' being performed is to transfer them via /tmp (just saw /tmp on test-osuosl-aix715-ppc64-3 fill up /tmp).

So, in order to accomodate these new test commands I am restoring /home back top 32G and adding 8 G to /tmp (so it grows to 10G).

@aixtools
Copy link
Contributor

Looking at test: ./openjdk/jdk/test/jdk/java/nio/channels/DatagramChannel/SendReceiveMaxSize.java I am wondering if the default udp network sizing is getting in the way:

root@nim.bak:[/home/root]no -h udp_sendspace
Help for tunable udp_sendspace:
Purpose:
Specifies the system default socket buffer size (in bytes) for sending UDP data.

Values:
        Default: 9216
        Range: 4096 - 9223372036854775807
        Type: Connect
        Unit: byte
Tuning:
The udp_sendspace attribute must specify a socket buffer size less than or equal to the setting of the sb_max attribute. udp_sendspace must be at least as large as the largest datagram size that the application will send. Increase size, preferably to multiple of 4096.

root@nim.bak:[/home/root]no -h udp_recvspace
Help for tunable udp_recvspace:
Purpose:
Specifies the system default socket buffer size for receiving UDP data.

Values:
        Default: 42080
        Range: 4096 - 9223372036854775807
        Type: Connect
        Unit: byte
Tuning:
Change when nonzero n in netstat -s report of udp: n socket buffer overflows. The udp_recvspace parameter must specify a socket buffer size less than or equal to the setting of the sb_max parameter. Increase size, preferably to multiple of 4096.
  • netstat -s stats (after many days, except adopt03 which was rebooted this morning)
adopt01:
        251193 dropped due to no socket
        3203896 broadcast/multicast datagrams dropped due to no socket
==============
adopt02:
        250096 dropped due to no socket
        3203893 broadcast/multicast datagrams dropped due to no socket
==============
adopt03:
        527 dropped due to no socket
        7936 broadcast/multicast datagrams dropped due to no socket
==============
adopt04:
        5887989 dropped due to no socket
        10689117 broadcast/multicast datagrams dropped due to no socket
==============
adopt05:
        850200 dropped due to no socket
        10597917 broadcast/multicast datagrams dropped due to no socket
==============
adopt06:
        850089 dropped due to no socket
        10708896 broadcast/multicast datagrams dropped due to no socket
==============
adopt07:
        172502 dropped due to no socket
        2142938 broadcast/multicast datagrams dropped due to no socket
==============
adopt08:
        161167 dropped due to no socket
        2102010 broadcast/multicast datagrams dropped due to no socket
==============
  • And in the test (originally written for macos)
/*
 * @test
 * @bug 8239355 8242885 8240901
 * @key randomness
 * @summary Check that it is possible to send and receive datagrams of
 *          maximum size on macOS.
 * @library /test/lib
 * @build jdk.test.lib.net.IPSupport
 * @run testng/othervm SendReceiveMaxSize
 * @run testng/othervm -Djava.net.preferIPv4Stack=true SendReceiveMaxSize
 */
...
public class SendReceiveMaxSize {
    private final static int IPV4_SNDBUF = 65507;
    private final static int IPV6_SNDBUF = 65527;
    private final static Class<IOException> IOE = IOException.class;
    private final static Random random = RandomFactory.getRandom();
  • Given above - it seems reasonable to increase both udp settings on AIX

  • following the helptext for udp_sendspace and the code above : 65536 (multiple of 4096)

  • following best practice performance for udp_recvspace: start with 10x sendspace (655360)

  • short term: will zero all the stats, and then re-run jdk_custom_0 (that runs this test and has been failing on test-osuosl-aix715-ppc64-3 aka adopt07) and see what the netstat -s stats return.

  • same test on adopt08 (-4) where it test just passed.

@aixtools
Copy link
Contributor

OK - lots of detail.

  • short summary
  • job-3337 (after manual, not permanent change) - a two tests out of many fail, but the error is "language error" rather than timeout, lack of space, etc..
  • job 3327 (on test-3) and job-3335 (on test-4) are very similiar - the first test (SendReceiveMaxSize.java) does not time out - instead:
00:15:14.742  ACTION: testng -- Failed. Execution failed: `main' threw exception: java.lang.Exception: failures: 2
00:15:14.742  REASON: User specified action: run testng/othervm SendReceiveMaxSize 
00:15:14.742  TIME:   0.718 seconds

Gory Details


job 3337:

00:15:14.742          -classpath /home/jenkins/workspace/Grinder/aqa-tests/openjdk/openjdk-jdk/test/jdk/java/nio/channels/DatagramChannel:/home/jenkins/workspace/Grinder/aqa-tests/TKG/output_16433043487557/jdk_nio_0/work/classes/java/nio/channels/DatagramChannel/SendReceiveMaxSize.d:/home/jenkins/workspace/Grinder/aqa-tests/TKG/output_16433043487557/jdk_nio_0/work/classes/test/lib:/home/jenkins/workspace/Grinder/jvmtest/openjdk/jtreg/lib/testng.jar:/home/jenkins/workspace/Grinder/jvmtest/openjdk/jtreg/lib/jcommander.jar:/home/jenkins/workspace/Grinder/jvmtest/openjdk/jtreg/lib/guice.jar /home/jenkins/workspace/Grinder/aqa-tests/openjdk/openjdk-jdk/test/jdk/java/nio/channels/DatagramChannel/SendReceiveMaxSize.java
00:15:14.742  
00:15:14.742  ACTION: testng -- Failed. Execution failed: `main' threw exception: java.lang.Exception: failures: 2
00:15:14.742  REASON: User specified action: run testng/othervm SendReceiveMaxSize 
00:15:14.742  TIME:   0.718 seconds
00:15:14.742  messages:
00:15:14.742  command: testng SendReceiveMaxSize
00:15:14.742  reason: User specified action: run testng/othervm SendReceiveMaxSize 
00:15:14.742  Mode: othervm [/othervm specified]
00:15:14.742  elapsed time (seconds): 0.718
00:15:14.742  configuration:
00:15:14.742  STDOUT:
00:15:14.742  Seed from RandomFactory = 1691768899053978340L
...
00:15:14.743  ===============================================
00:15:14.743  java/nio/channels/DatagramChannel/SendReceiveMaxSize.java
00:15:14.743  Total tests run: 8, Passes: 6, Failures: 2, Skips: 0
00:15:14.743  ===============================================
00:15:14.743  
00:15:14.743  STDERR:
00:15:14.743  java.lang.Exception: failures: 2
00:15:14.743  	at com.sun.javatest.regtest.agent.TestNGRunner.main(TestNGRunner.java:96)
00:15:14.743  	at com.sun.javatest.regtest.agent.TestNGRunner.main(TestNGRunner.java:54)
00:15:14.743  	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
00:15:14.743  	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
00:15:14.743  	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
00:15:14.743  	at java.base/java.lang.reflect.Method.invoke(Method.java:568)
00:15:14.743  	at com.sun.javatest.regtest.agent.MainWrapper$MainThread.run(MainWrapper.java:127)
00:15:14.743  	at java.base/java.lang.Thread.run(Thread.java:833)
00:15:14.743  
00:15:14.743  JavaTest Message: Test threw exception: java.lang.Exception: failures: 2
00:15:14.743  JavaTest Message: shutting down test
00:15:14.743  
00:15:14.743  STATUS:Failed.`main' threw exception: java.lang.Exception: failures: 2

with job 3327 (on test-3)

00:27:54.538          -classpath /home/jenkins/workspace/Grinder/aqa-tests/openjdk/openjdk-jdk/test/jdk/java/nio/channels/DatagramChannel:/home/jenkins/workspace/Grinder/aqa-tests/TKG/output_16432908593865/jdk_nio_0/work/classes/java/nio/channels/DatagramChannel/SendReceiveMaxSize.d:/home/jenkins/workspace/Grinder/aqa-tests/TKG/output_16432908593865/jdk_nio_0/work/classes/test/lib:/home/jenkins/workspace/Grinder/jvmtest/openjdk/jtreg/lib/testng.jar:/home/jenkins/workspace/Grinder/jvmtest/openjdk/jtreg/lib/jcommander.jar:/home/jenkins/workspace/Grinder/jvmtest/openjdk/jtreg/lib/guice.jar /home/jenkins/workspace/Grinder/aqa-tests/openjdk/openjdk-jdk/test/jdk/java/nio/channels/DatagramChannel/SendReceiveMaxSize.java
00:27:54.539  
00:27:54.539  ACTION: testng -- Error. Program `/home/jenkins/workspace/Grinder/openjdkbinary/j2sdk-image/bin/java' timed out (timeout set to 960000ms, elapsed time including timeout handling was 960603ms).
00:27:54.539  REASON: User specified action: run testng/othervm SendReceiveMaxSize 
00:27:54.539  TIME:   960.609 seconds
00:27:54.539  messages:
00:27:54.539  command: testng SendReceiveMaxSize
00:27:54.539  reason: User specified action: run testng/othervm SendReceiveMaxSize 
00:27:54.539  Mode: othervm [/othervm specified]
00:27:54.539  Timeout information:
00:27:54.539  Running jstack on process 18546722
00:27:54.539  2022-01-27 14:04:08
00:27:54.539  Full thread dump OpenJDK 64-Bit Server VM (17.0.2-beta+8-202201252341 mixed mode):
00:27:54.539  
00:27:54.539  Threads class SMR info:
00:27:54.539  _java_thread_list=0x0000000114265d30, length=13, elements={
00:27:54.539  0x000000011024ce90, 0x0000000112cf6e70, 0x0000000112cf8810, 0x000000011317c150,
00:27:54.539  0x000000011317f470, 0x0000000113182970, 0x0000000113186c90, 0x0000000113c08d70,
00:27:54.539  0x00000001140268b0, 0x00000001147bd450, 0x0000000114c230f0, 0x0000000114eccd90,
00:27:54.539  0x0000000114265610
00:27:54.539  }
00:27:54.539  
...
00:27:54.541          -Dtest.class.path.prefix=/home/jenkins/workspace/Grinder/aqa-tests/TKG/output_16432908593865/jdk_nio_0/work/classes/java/nio/channels/DatagramChannel/SendReceiveMaxSize.d:/home/jenkins/workspace/Grinder/aqa-tests/openjdk/openjdk-jdk/test/jdk/java/nio/channels/DatagramChannel:/home/jenkins/workspace/Grinder/aqa-tests/TKG/output_16432908593865/jdk_nio_0/work/classes/test/lib \
00:27:54.541          -ea \
00:27:54.541          -esa \
00:27:54.541          -Xmx512m \
00:27:54.541          -XX:+UseCompressedOops \
00:27:54.541          -Djava.library.path=/home/jenkins/workspace/Grinder/openjdkbinary/openjdk-test-image/jdk/jtreg/native \
00:27:54.541          com.sun.javatest.regtest.agent.MainWrapper /home/jenkins/workspace/Grinder/aqa-tests/TKG/output_16432908593865/jdk_nio_0/work/java/nio/channels/DatagramChannel/SendReceiveMaxSize.d/testng.0.jta java/nio/channels/DatagramChannel/SendReceiveMaxSize.java false SendReceiveMaxSize
00:27:54.541  
00:27:54.541  TEST RESULT: Error. Program `/home/jenkins/workspace/Grinder/openjdkbinary/j2sdk-image/bin/java' timed out (timeout set to 960000ms, elapsed time including timeout handling was 960603ms).
00:27:54.541  --------------------------------------------------
00:32:50.196  TEST: java/nio/channels/FileChannel/Transfer2GPlus.java

or job 3335 (on test -4, before no - network options changes)
00:29:27.947          -classpath /home/jenkins/workspace/Grinder/aqa-tests/openjdk/openjdk-jdk/test/jdk/java/nio/channels/DatagramChannel:/home/jenkins/workspace/Grinder/aqa-tests/TKG/output_16433015622546/jdk_nio_0/work/classes/java/nio/channels/DatagramChannel/SendReceiveMaxSize.d:/home/jenkins/workspace/Grinder/aqa-tests/TKG/output_16433015622546/jdk_nio_0/work/classes/test/lib:/home/jenkins/workspace/Grinder/jvmtest/openjdk/jtreg/lib/testng.jar:/home/jenkins/workspace/Grinder/jvmtest/openjdk/jtreg/lib/jcommander.jar:/home/jenkins/workspace/Grinder/jvmtest/openjdk/jtreg/lib/guice.jar /home/jenkins/workspace/Grinder/aqa-tests/openjdk/openjdk-jdk/test/jdk/java/nio/channels/DatagramChannel/SendReceiveMaxSize.java
00:29:27.947  
00:29:27.947  ACTION: testng -- Error. Program `/home/jenkins/workspace/Grinder/openjdkbinary/j2sdk-image/bin/java' timed out (timeout set to 960000ms, elapsed time including timeout handling was 960859ms).
00:29:27.947  REASON: User specified action: run testng/othervm SendReceiveMaxSize 
00:29:27.947  TIME:   960.876 seconds
00:29:27.947  messages:
00:29:27.947  command: testng SendReceiveMaxSize
00:29:27.947  reason: User specified action: run testng/othervm SendReceiveMaxSize 
00:29:27.947  Mode: othervm [/othervm specified]
00:29:27.947  Timeout information:
00:29:27.947  Running jstack on process 15597716
00:29:27.947  2022-01-27 17:02:37
00:29:27.947  Full thread dump OpenJDK 64-Bit Server VM (17.0.2-beta+8-202201252341 mixed mode):
...
00:29:27.961          -Dtest.class.path.prefix=/home/jenkins/workspace/Grinder/aqa-tests/TKG/output_16433015622546/jdk_nio_0/work/classes/java/nio/channels/DatagramChannel/SendReceiveMaxSize.d:/home/jenkins/workspace/Grinder/aqa-tests/openjdk/openjdk-jdk/test/jdk/java/nio/channels/DatagramChannel:/home/jenkins/workspace/Grinder/aqa-tests/TKG/output_16433015622546/jdk_nio_0/work/classes/test/lib \
00:29:27.961          -ea \
00:29:27.961          -esa \
00:29:27.961          -Xmx512m \
00:29:27.961          -XX:+UseCompressedOops \
00:29:27.961          -Djava.library.path=/home/jenkins/workspace/Grinder/openjdkbinary/openjdk-test-image/jdk/jtreg/native \
00:29:27.961          com.sun.javatest.regtest.agent.MainWrapper /home/jenkins/workspace/Grinder/aqa-tests/TKG/output_16433015622546/jdk_nio_0/work/java/nio/channels/DatagramChannel/SendReceiveMaxSize.d/testng.0.jta java/nio/channels/DatagramChannel/SendReceiveMaxSize.java false SendReceiveMaxSize
00:29:27.961  
00:29:27.961  TEST RESULT: Error. Program `/home/jenkins/workspace/Grinder/openjdkbinary/j2sdk-image/bin/java' timed out (timeout set to 960000ms, elapsed time including timeout handling was 960859ms).
00:29:27.961  --------------------------------------------------
Aborted by Michael Felt

@sophia-guo
Copy link
Author

@sxa
Copy link
Member

sxa commented Jul 12, 2022

test-osuosl-aix715-ppc64-2

That machine currently has a 31Gb /home with 2.5Gb in use. If it hit an out of space message it would suggest some significant problem occurring during that job (Excessive number or size of core file sizes perhaps?).

If we don't spot the failure it may be worth re-running it with KEEP_WORKSPACE in order to determine what is using up space during the execution of that job.

It looks like @aixtools is currently re-running an equivalent Grinder at https://ci.adoptopenjdk.net/job/Grinder/5168/console on the machine so I will leave it to him to check

@aixtools
Copy link
Contributor

re: no space left:

15:08:18  test Transfer4GBFile.xferTest04(): failure
15:08:18  java.io.IOException: No space left on device

Perhaps it is not /home that is overrunning, but /tmp or /var (or /) where information is being stored as part of the xfer.

For whatever reason - errpt is not signalling which filesystem is overloaded. Maybe the error message is wrong ??

@aixtools
Copy link
Contributor

IMHO - there are so many errors - focusing on this one is a waste of time - more likely an artifact of something else (e.g., 5+ core dumps with core files)

I am more concerned by:

Failed

java/nio/channels/FileChannel/Transfer2GPlus.java.Transfer2GPlus (from java_nio_channels_FileChannel_Transfer2GPlus.java)

Failing for the past 1 build (Since #5168 )

Stacktrace

Execution failed: `main' threw exception: java.lang.RuntimeException: Too few bytes transferred: 2126643200 < 2147484671    

Standard Output

Transferred 2147484671 bytes
    

Standard Error

java.lang.RuntimeException: Too few bytes transferred: 2126643200 < 2147484671
	at Transfer2GPlus.testToWritableByteChannel(Transfer2GPlus.java:128)
	at Transfer2GPlus.main(Transfer2GPlus.java:59)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:568)
	at com.sun.javatest.regtest.agent.MainActionHelper$AgentVMRunnable.run(MainActionHelper.java:312)
	at java.base/java.lang.Thread.run(Thread.java:833)

JavaTest Message: Test threw exception: java.lang.RuntimeException
JavaTest Message: shutting down test

</pre><!--EndFragment-->
Failed

java/nio/channels/FileChannel/Transfer2GPlus.java.Transfer2GPlus (from java_nio_channels_FileChannel_Transfer2GPlus.java)
Failing for the past 1 build (Since
#5168 )
Took 17 sec.
Stacktrace

Execution failed: `main' threw exception: java.lang.RuntimeException: Too few bytes transferred: 2126643200 < 2147484671

Standard Output

Transferred 2147484671 bytes

Standard Error

java.lang.RuntimeException: Too few bytes transferred: 2126643200 < 2147484671
at Transfer2GPlus.testToWritableByteChannel(Transfer2GPlus.java:128)
at Transfer2GPlus.main(Transfer2GPlus.java:59)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:568)
at com.sun.javatest.regtest.agent.MainActionHelper$AgentVMRunnable.run(MainActionHelper.java:312)
at java.base/java.lang.Thread.run(Thread.java:833)

JavaTest Message: Test threw exception: java.lang.RuntimeException
JavaTest Message: shutting down test

@aixtools
Copy link
Contributor

re: filespace. Re-read the whole message tree - it seems to be /tmp that is running out of space (0 bytes atm):

root@adopt06:[/tmp]/usr/bin/du -sg *dat
2.00    dst17555747448270386703.dat
1.98    dst3224747580702339507.dat
0.00    src8049571388609044249.dat

root@adopt06:[/tmp]ls -l *.dat
-rw------- 1 jenkins staff 2147484671 Jul 12 13:08 dst17555747448270386703.dat
-rw-r--r-- 1 jenkins staff 2126643200 Jul 12 13:08 dst3224747580702339507.dat
-rw------- 1 jenkins staff 2147484671 Jul 12 13:07 src8049571388609044249.dat

Will try increasing /tmp..

@aixtools
Copy link
Contributor

p.s.: https://ci.adoptopenjdk.net/job/Test_openjdk17_j9_extended.functional_ppc64_aix/143/ looks stalled - as /tmp had zero (0) bytes free.

03:49:10  JVMSHRC561E Failed to initialize the shared classes cache, there is not enough space in the file system. Available free disk space bytes = 0, requested bytes = 16777216.

Maybe the run should be aborted - better restarted - now that /tmp has been enlarged to 8GB.

@aixtools
Copy link
Contributor

Increasing the size of /tmp seems to be resolving this (and more??).
https://ci.adoptopenjdk.net/view/Test_grinder/job/Grinder/5175/

I'll re-run on system ppc64-2 (as https://ci.adoptopenjdk.net/job/Grinder/5184/) and ppc64-3 (with old size of /tmp) as https://ci.adoptopenjdk.net/job/Grinder/5185/.

@aixtools
Copy link
Contributor

@sophia-guo : you say critical, I say solved.

In short, /tmp at 4G to too small for testing. The others, surprisedly had all been manually set to a larger value - which is why this did not occur on other systems (recently).

imho: this can be closed.

@sxa
Copy link
Member

sxa commented Jul 14, 2022

I'm a little surprised that the test is using /tmp rather than the jenkins workspace, so that has been useful to discover, but I agree that the solution means this can be closed - thanks @aixtools

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Development

No branches or pull requests

4 participants