Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some platforms failing to run CITGM #779

Closed
BethGriggs opened this issue Jan 8, 2020 · 14 comments
Closed

Some platforms failing to run CITGM #779

BethGriggs opened this issue Jan 8, 2020 · 14 comments

Comments

@BethGriggs
Copy link
Member

When running CITGM on v10.x releases I'm hitting issues with a few platforms failing to run CITGM.

On ppcle-ubuntu1404 - https://ci.nodejs.org/job/citgm-smoker/nodes=ppcle-ubuntu1404/2202/console

18:28:31 Build timed out (after 30 minutes). Marking the build as failed.

And on aix61-ppc64 - https://ci.nodejs.org/job/citgm-smoker/nodes=aix61-ppc64/2201/console

13:59:42 { [Error: EEXIST: file already exists, rmdir '/ramdisk0/citgm/127a72a6-0c9e-437c-99a6-bef6fe6b624a/npm_config_tmp']
13:59:42   errno: -17,
13:59:42   code: 'EEXIST',
13:59:42   syscall: 'rmdir',
13:59:42   path:
13:59:42    '/ramdisk0/citgm/127a72a6-0c9e-437c-99a6-bef6fe6b624a/npm_config_tmp' }

ping @nodejs/build @nodejs/platform-ppc

@sam-github
Copy link

For ppcle-ubuntu, I think its time we should stop running citgm on it now that 8.x is EOL.

We don't use it for release, its questionable if we should use it for test, of any kind. centos7-ppcle is our prefered build and test plaform now for 10.x and above: https://github.com/nodejs/build/blob/adcc075b3e17a99f321e258a9a1fb99c8ce43412/jenkins/scripts/VersionSelectorScript.groovy#L35-L37

@nodejs/platform-ppc -- any reason to not just remove it?

@sam-github
Copy link

AIX6.1 isn't supported by Node.js 10.x: https://github.com/nodejs/node/blob/v10.x/BUILDING.md#supported-platforms-1

Perhaps we should immediately start using AIX7.1 for citgm? I think we were waiting, because it might be destabilizing... but AIX6.1 isn't so stable.

@sam-github
Copy link

git grep rmdir gives me nothing in citgm, but it does call rimraf: https://github.com/nodejs/citgm/blob/master/lib/temp-directory.js#L41

Those directories do exist on the target system, but that doesn't seem wrong:

# ls -l /ramdisk0/citgm/*                                                                                                                                            
/ramdisk0/citgm/126d78c6-074f-403b-aaf1-cb64aad3fec8:
total 12424
drwxrwxr-x   12 iojs     staff          4096 Jan 08 06:00 esprima
-rw-r--r--    1 iojs     staff       6352334 Jan 08 05:58 esprima-4.0.1.tgz
drwxr-xr-x    3 iojs     staff           256 Jan 08 05:58 home
drwxr-xr-x    4 iojs     staff           256 Jan 08 05:59 npm_config_tmp

/ramdisk0/citgm/127a72a6-0c9e-437c-99a6-bef6fe6b624a:
total 16
drwxrwxr-x    3 iojs     staff          4096 Jan 08 06:00 clinic
drwxr-xr-x    3 iojs     staff           256 Jan 08 06:00 home
drwxr-xr-x    3 iojs     staff          4096 Jan 08 06:00 npm_config_tmp

And I can rmdir them with rimraf:

-bash-4.3$ ./smoker/bin/node ./smoker/lib/node_modules/citgm/node_modules/rimraf/bin.js  /ramdisk0/citgm/127a72a6-0c9e-437c-99a6-bef6fe6b624a                
-bash-4.3$ ./smoker/bin/node ./smoker/lib/node_modules/citgm/node_modules/rimraf/bin.js  /ramdisk0/citgm/126d78c6-074f-403b-aaf1-cb64aad3fec8 
-bash-4.3$ pwd
/home/iojs/build/workspace/citgm-smoker/nodes/aix61-ppc64
-bash-4.3$ env | grep LIB
LIBPATH=/home/iojs/gmake/opt/freeware/lib:/home/iojs/gcc-6.3.0-1/opt/freeware/lib/gcc/powerpc-ibm-aix6.1.0.0/6.3.0/pthread/ppc64:/home/iojs/gcc-6.3.0-1/opt/freeware/lib

So, I've no idea why citgm failed to. Try again?

@BethGriggs
Copy link
Member Author

BethGriggs commented Jan 8, 2020

I've kicked off the aix61-ppc run a few times with the same results - the error seems consistent.

For ppc-ubuntu1404, I'm currently doing a run on the previous release (v10.18.0) to check that we haven't regressed anything to cause the failure - https://ci.nodejs.org/job/citgm-smoker/nodes=ppcle-ubuntu1404/2203/console

Based on your comments, it does not sound like it would be worth holding up v10.18.1 (nodejs/node#31248) for these two platforms CITGM results?

@sam-github
Copy link

I assume aix61-ppc has run without this problem before? Is this a new citgm version?

I'm running by hand with CITGM_LOGLEVEL of silly on test-osuosl-aix61-ppc64_be-3, its running where the logs you pointed to showed it dying pretty much on startup.

I kicked off a build from the jenkins UI, loglevel silly: https://ci.nodejs.org/job/citgm-smoker/2205

Even though it would only run on aix61-ppc, and that has no citgm-smoker running on it now, it is waiting for https://ci.nodejs.org/job/citgm-smoker/2204/ , that's not ideal, IMO, but I'm not going to mess with the concurrency setting unilaterally.

Once it starts, I can check to see how its doing.

@nodejs/citgm-admins ^-- can that be changed? I understand not wanting multiple citgm jobs running on the same host, at the same time, but it seems that if non is running on aix61, I should be able to run one.

@BethGriggs
Copy link
Member Author

@rsam I've just cancelled the ppcle-ubuntu1404 job that was running. For v10.18.0 it didn't time out...but there was a point during that job where it hung for ~27minutes. That makes it hard to tell if it is a coincidence (with it being so near to the 30min time out) or there is something new in v10.18.1 actually causing it to time out.

From https://ci.nodejs.org/job/citgm-smoker/nodes=ppcle-ubuntu1404/2204/console:

21:02:31 error: @nearform/bubbleprof done| done - the test suite for @nearform/bubbleprof version 3.0.1 failed
21:28:51 error: failure             | The canary is dead: 

And yes, we've released a new CITGM version recently (~6 in the past month).

Ref AIX 6.1 - https://ci.nodejs.org/job/citgm-smoker/nodes=aix61-ppc64/2205/console - looks like it is progressing further 🤞

@richardlau
Copy link
Member

I assume aix61-ppc has run without this problem before? Is this a new citgm version?

Yes. There have been new versions of citgm but they've been updates to the lookup table rather than changes to actual citgm functionality.

@nodejs/citgm-admins ^-- can that be changed? I understand not wanting multiple citgm jobs running on the same host, at the same time, but it seems that if non is running on aix61, I should be able to run one.

The issue wasn't multiple citgm jobs running on the same host (which isn't possible in our CI as we only have one node per host and a node can only run one job at a time) -- the concurrency limit was put in place to prevent non-CITGM jobs being backed up if lots of CITGM jobs had been submitted. Personally I have no issues removing/relaxing the concurrency but I wasn't the one who put it there in the first place. Let's move that conversation to nodejs/build#1882.

And AFAIK the concurrency setting is at the job level, not the axis level.

@sam-github
Copy link

citgm/191bf25b-ae39-4006-b2a6-74867c022e41:
total 12424
drwxrwxr-x   12 iojs     staff          4096 Jan  8 14:39 esprima
-rw-r--r--    1 iojs     staff       6352334 Jan  8 14:38 esprima-4.0.1.tgz
drwxr-xr-x    3 iojs     staff           256 Jan  8 14:38 home
drwxr-xr-x    4 iojs     staff           256 Jan  8 14:38 npm_config_tmp

citgm/191bf25b-ae39-4006-b2a6-74867c022e41/npm_config_tmp:
total 0
drwxr-xr-x    3 iojs     staff           256 Jan  8 14:38 npm-13762772-2808b0bd
drwxr-xr-x    2 iojs     staff           256 Jan  8 14:38 npm-13762774-d3b4917f

citgm/7e764d75-3efb-461c-a2d7-90f08e47fd6d:
total 16
drwxrwxr-x    3 iojs     staff          4096 Jan  8 14:40 clinic
drwxr-xr-x    3 iojs     staff           256 Jan  8 14:39 home
drwxr-xr-x    3 iojs     staff          4096 Jan  8 14:40 npm_config_tmp

citgm/7e764d75-3efb-461c-a2d7-90f08e47fd6d/npm_config_tmp:
total 0
drwx------    2 iojs     staff           256 Jan  8 14:39 clinic-test-cewqab

Failed same way. Above is what it looks like on the fs. Two citgm/UUID dirs get created per run, but that second one, the one that failed, it has that odd clinic-test-cewqab directory in it, I wonder if the clinic module is using the npm test directory to store test output? the other npm_config_tmp only has npm-* files in it.

@richardlau
Copy link
Member

Failed same way. Above is what it looks like on the fs. Two citgm/UUID dirs get created per run, but that second one, the one that failed, it has that odd clinic-test-cewqab directory in it, I wonder if the clinic module is using the npm test directory to store test output? the other npm_config_tmp only has npm-* files in it.

It's probably writing to the tempdir, which citgm redirects to the npm_config_tmp dir it creates

options.env['TEMP'] = context.npmConfigTmp;
options.env['TMP'] = context.npmConfigTmp;
options.env['TMPDIR'] = context.npmConfigTmp;
(this isn't new, citgm has been doing this for quite some time).

@BethGriggs
Copy link
Member Author

For AIX, it was the clinic module causing the Jenkins run to not complete. When I ran against my PR in #780 the rest of the results were published. I can reproduce this issue on previous versions of v10.x - so I have marked it as skip on AIX for now.

For the ppcle-ubuntu1404 the CITGM run did not time out when running on test-osuosl-ubuntu1404-ppc64_le-2 rather than test-osuosl-ubuntu1404-ppc64_le-4 - perhaps there is something up with that machine? (cc: @AshCripps)

@richardlau
Copy link
Member

I think we're seeing the same clinic issue that we saw on AIX on Windows, e.g. https://ci.nodejs.org/job/citgm-smoker/2199/nodes=win-vs2017/console

18:02:05 error: clinic npm:         | npm-test Timed Out  
18:02:05 error: failure             | Test Timed Out      
18:02:08 { [Error: EBUSY: resource busy or locked, rmdir 'C:\Users\ADMINI~1\AppData\Local\Temp\4573097d-4499-4983-a4de-89db228e5837\clinic']
18:02:08   errno: -4082,
18:02:08   code: 'EBUSY',
18:02:08   syscall: 'rmdir',
18:02:08   path:
18:02:08    'C:\\Users\\ADMINI~1\\AppData\\Local\\Temp\\4573097d-4499-4983-a4de-89db228e5837\\clinic' }
18:02:08 Build step 'Conditional steps (multiple)' marked build as failure

@sam-github
Copy link

I wonder if clinic causes a segfault for test purposes, and the kernels won't delete the binaries while they are being core dumped, or something of the like?

@richardlau
Copy link
Member

Since the clinic tests have been timed out I'm wondering if clinic is spawning processes and our time out logic in citgm only kills the parent process, leaving something behind that is still writing into the directory.

@richardlau
Copy link
Member

I think this was addressed by #829.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants