Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reprovision the linuxone s390x machines #2080

Closed
AshCripps opened this issue Dec 6, 2019 · 65 comments · Fixed by #2104
Closed

Reprovision the linuxone s390x machines #2080

AshCripps opened this issue Dec 6, 2019 · 65 comments · Fixed by #2104

Comments

@AshCripps
Copy link
Member

AshCripps commented Dec 6, 2019

The current linuxone machines *-rhel72-s390x-* will be switched off in the new year and we are being migrated to a new data (EDIT(sam): "new datacentre's") machines

We have been given access to the five replacement machines and work is being undertaken to ansible them and set them up for ci

@AshCripps
Copy link
Member Author

AshCripps commented Dec 13, 2019

All five machine have been ansibled and a build+test passes.

Next steps:

  • Add machines to Jenkins
  • Test builds pass on the release and test machines
  • Begin to switch over the jobs to these new machines (will need new jobs I think)
  • Deprovision the old s390x machines as they will be turned off by the end of the year

@richardlau
Copy link
Member

  • Begin to switch over the jobs to these new machines (will need new jobs I think)

We shouldn't need new jobs.

@sam-github
Copy link
Contributor

@sam-github
Copy link
Contributor

Note: need to change the /data reference to:

/home/iojs/git/io.js.reference

in the main job when I switch the labels.

@sam-github
Copy link
Contributor

sam-github commented Dec 18, 2019

Failed:

10:03:04 Started by upstream project "node-test-commit-linuxone-sam-github" build number 8
10:03:04 originally caused by:
10:03:04  Started by user Sam Roberts
10:03:04 Running as SYSTEM
10:03:04 [EnvInject] - Loading node environment variables.
10:03:04 Building remotely on test-ibm-rhel7-s390x-4 (7.7 s390x-RedHatEnterprise RedHatEnterprise-7.7 rhel7-s390x s390x-RedHatEnterprise-7.7 RedHatEnterprise s390x) in workspace /home/iojs/build/workspace/node-test-commit-linuxone-sam-github/nodes/rhel7-s390x
10:03:04 Also:   hudson.remoting.Channel$CallSiteStackTrace: Remote call to JNLP4-connect connection from 148.100.86.94/148.100.86.94:38882
10:03:04 java.lang.NoClassDefFoundError: Could not initialize class com.sun.proxy.$Proxy11
10:03:04 Caused: java.io.IOException: Remote call on JNLP4-connect connection from 148.100.86.94/148.100.86.94:38882 failed

Tried again, I think https://ci.nodejs.org/computer/test-ibm-rhel7-s390x-4 is broken, taking it offline.

@sam-github
Copy link
Contributor

sam-github commented Dec 18, 2019

not ok 2109 sequential/test-timers-throw-reschedule # TODO : Fix flaky test

ci-release:

If builds are PASS on all the machines, I will change the build label in node-test-commit-linuxone from rhel72-s390x to rhel7-s390x. Once that's green for a while, we'll remove the old machines.

@richardlau
Copy link
Member

If builds are PASS on all the machines, I will change the build label in node-test-commit-linuxone from rhel72-s390x to rhel7-s390x. Once that's green for a while, we'll remove the old machines.

We will need to update the other jobs that build on Linux One (e.g. CITGM, libuv, V8, node-addons-api) before removing the old machines. I can start updating some of those in a few hours time.

@sam-github
Copy link
Contributor

Thanks for the reminder, and that'd be great, thanks.

Switch node-test-commit-linuxone to the new machines:

@richardlau
Copy link
Member

richardlau commented Dec 19, 2019

@richardlau
Copy link
Member

not ok 348 - udp_multicast_join
# timeout
# Output from process `udp_multicast_join`: (no output)
not ok 349 - udp_multicast_join6
# timeout
# Output from process `udp_multicast_join6`: (no output)

Firewall config? cc @AshCripps @sam-github
FYI @nodejs/libuv

@richardlau
Copy link
Member

* https://ci.nodejs.org/job/libuv-test-commit-linux/
  Test build: https://ci.nodejs.org/job/libuv-test-commit-linux/1718/nodes=rhel7-s390x/
  Looks like some tests failed that weren't seen before:
not ok 348 - udp_multicast_join
# timeout
# Output from process `udp_multicast_join`: (no output)
not ok 349 - udp_multicast_join6
# timeout
# Output from process `udp_multicast_join6`: (no output)

Firewall config? cc @AshCripps @sam-github
FYI @nodejs/libuv

Looks like we (=@miladfarca) tweaked something on the old machines: libuv/libuv#2185 (comment)

@AshCripps
Copy link
Member Author

AshCripps commented Dec 19, 2019

@richardlau thought it had to be there for a reason. I can add back the rules to the ansible, ill pr them in shortly.

I know they failed because the dest file /etc/sysconfig/iptables doesnt exist on the machines so Ill need to create it first

@AshCripps
Copy link
Member Author

So ive been able to add the rules back to the ansible and the files gets updated.
I had to the create the file using iptables-save

issue is now I cant restart iptables to get the rules to take effects because iptables.service is not running. From a quick google iptables.service doesnt run by defaults because firewalld is used instead so I have to install a yum package to get the service.

This is getting beyond my depth so I was wondering if anyone else had any idea or knows if Im missing anything else?

@richardlau
Copy link
Member

The V8 job has socket time outs on the new machines (#2080 (comment)). Does anyone (@nodejs/v8-update @nodejs/platform-s390 ?) know if the v8test label that is on https://ci.nodejs.org/computer/test-linuxonecc-rhel72-s390x-1/ implies any additional set up that was done there that we would need to replicate to the new machines?

@sam-github
Copy link
Contributor

For the v8 builds, it looks like gn and ninja need copying to all the rhel-s390x hosts. We generally do this by packing the redistributables into a package and putting them on the ci download host, but that takes infra privs. Efforts are underway by @miladfarca to get the upstream google projects to include these binaries, at which point this won't be necessary anymore.

The old machines look like:

% parallel-ssh -h hosts/rhel72-s390x -i ls -l '/home/iojs/build-tools/{gn,ninja}'
[1] 08:19:15 [SUCCESS] test-linuxonecc-rhel72-s390x-2
-rwx--xr-x 1 root root 4116371 Apr  3  2018 /home/iojs/build-tools/gn
-rwxr-xr-x 1 root root 3448520 Apr  3  2018 /home/iojs/build-tools/ninja
[2] 08:19:15 [FAILURE] test-linuxonecc-rhel72-s390x-3 Exited with error code 2
Stderr: ls: cannot access /home/iojs/build-tools/gn: No such file or directory
ls: cannot access /home/iojs/build-tools/ninja: No such file or directory
[3] 08:19:15 [SUCCESS] test-linuxonecc-rhel72-s390x-1
lrwxrwxrwx 1 iojs iojs      20 Aug  9 16:50 /home/iojs/build-tools/gn -> /data/iojs/gn/out/gn
-rwxr-xr-x 1 root root 3448520 Apr  3  2018 /home/iojs/build-tools/ninja

@richardlau
Copy link
Member

I'll try to update the CITGM jobs later this evening (which are the remaining jobs that I know of -- please comment if there are any that have been missed).

@sam-github
Copy link
Contributor

I copied the build-tools directory from test-linuxonecc-rhel72-s390x-2 onto test-ibm-rhel7-s390x-*. That should unblock this for now, but the gn and ninja binaries should be curled, put into the infrastructure downloads, and ansibled.

@richardlau
Copy link
Member

I copied the build-tools directory from test-linuxonecc-rhel72-s390x-2 onto test-ibm-rhel7-s390x-*. That should unblock this for now, but the gn and ninja binaries should be curled, put into the infrastructure downloads, and ansibled.

Thanks. Started another test build: https://ci.nodejs.org/job/node-test-commit-v8-linux/2717/

@richardlau
Copy link
Member

I copied the build-tools directory from test-linuxonecc-rhel72-s390x-2 onto test-ibm-rhel7-s390x-*. That should unblock this for now, but the gn and ninja binaries should be curled, put into the infrastructure downloads, and ansibled.

Thanks. Started another test build: https://ci.nodejs.org/job/node-test-commit-v8-linux/2717/

Unfortunately this has also failed with socket time outs.

@sam-github
Copy link
Contributor

That sounds like a symptom of #2104, @AshCripps has a plan there, he thinks he's found the right way to switch the systems from firewalld to iptables.

@richardlau
Copy link
Member

That sounds like a symptom of #2104, @AshCripps has a plan there, he thinks he's found the right way to switch the systems from firewalld to iptables.

If it helps, we remove firewalld on centos and fedora: #1879
Looks like we used to tweak firewalld rules on fedora but dropped it in #1977

@AshCripps
Copy link
Member Author

Tried again, I think https://ci.nodejs.org/computer/test-ibm-rhel7-s390x-4 is broken, taking it offline.

@rsam Had a discussion with @sxa555 and we couldn't see anything wrong with -4 asides from a connection issues so I restarted the agent and ran a build on it which passed - https://ci.nodejs.org/view/All/job/node-test-commit-linuxone-sam-github/15/nodes=rhel7-s390x4/

@richardlau
Copy link
Member

Reopening as the issue(s) build/testing V8 are still to be resolved.

@AshCripps
Copy link
Member Author

whoops hit the wrong button

@richardlau
Copy link
Member

whoops hit the wrong button

I don't think you did -- GitHub autoclosed because #2104 had a "fixes" line.

@miladfarca
Copy link

miladfarca commented Dec 28, 2019

devtoolset packages are only available through Redhat. devtoolset-6 RPMs can be located on box, I have sent invites in case you need them, they can be installed using yum. Regarding the details of devtoolset and how you can access other versions and install/use them, look under issue number 11 in our internal node zenhub page, I have added some details some time ago.

GN has started using C++17 features since around Sep and needs gcc >=7 to compile. You can use any machine with devtoolset to compile and copy the binary. Seems like the latest GN on the other s390 machines is working. We are working on adding cross compiled GN to devtoolset natively.

Regarding the V8 failure messages, make sure you have all the pre-req packages installed (i.e pkg-config) and your env variable are set accordingly, here are the packages we install on our Ubuntu test boxes:

pkg-config 
libnss3-dev 
libcups2-dev 
libglib2.0-dev 
libpango1.0-dev 
libgconf2-dev
libgnome-keyring-dev
libatk1.0-dev
libgtk-3-dev

@AshCripps
Copy link
Member Author

AshCripps commented Dec 30, 2019

pkg-config 
libnss3-dev 
libcups2-dev 
libglib2.0-dev 
libpango1.0-dev 
libgconf2-dev
libgnome-keyring-dev
libatk1.0-dev
libgtk-3-dev

I found libgconf2 was missing, when installed it installed glib as well. kicked off a build to test - https://ci.nodejs.org/job/node-test-commit-v8-linux/2746/

EDIT: it passed! 🎉 https://ci.nodejs.org/job/node-test-commit-v8-linux/2746/nodes=rhel7-s390x,v8test=v8test/

AshCripps pushed a commit to AshCripps/build that referenced this issue Dec 30, 2019
@AshCripps
Copy link
Member Author

I have disabled the three old machines in jenkins, waiting to see if theres any fallout before deleting them but all jobs should be using the new rhel7-s390x labeled machines

@sam-github
Copy link
Contributor

Took the old rhel s390x machine offline in ci-release.

The last step I need someone from @nodejs/releasers to check: is the release ssh key setup correctly?

Please try:

ssh release-ibm-rhel7-s390x-1

I'm not on releasers, so I don't have the private key, so I can't confirm. I can ssh in use the IBM mgmt key, so I copied the authorized ssh key from the AIX machines.

I think that's all the release specific setup needed.

@targos
Copy link
Member

targos commented Dec 31, 2019

Releasers don't ssh into release machines. I'm not aware of having access to any of the existing ones

@sam-github
Copy link
Contributor

I looked more closely, it looks like the only people with ssh access to all the release machines are:

% ls .gpg 
'bugs@bergstroem.nu'         'reis@janeasystems.com'
'michael_dawson@ca.ibm.com'  'rod@vagg.org'

@jbergstroem @joaocgreis @rvagg @mhdawson --- can one of you confirm that ssh release-ibm-rhel7-s390x-1 works for you with the nodejs_build_release private key?

@joaocgreis
Copy link
Member

@sam-github it does not work for me (other release servers work though).

@sam-github
Copy link
Contributor

@joaocgreis do you think you could fix that? I'm not a member of the releasers team, so I don't have access to the secrets, so I'm not in a great position to put those secrets on the machine. If my copy of the .ssh authorized keys across didn't work, I'm not sure what else I can do.

@sam-github
Copy link
Contributor

@joaocgreis contacted me offline, and helped get the correct releasers key onto the release box, so any with release infrastructure access should be able to ssh in.

@mhdawson
Copy link
Member

mhdawson commented Jan 6, 2020

@sam-github ssh-d into the the release rhel7 with the nodejs_build_release key.

@sam-github
Copy link
Contributor

sam-github commented Jan 13, 2020

@sam-github
Copy link
Contributor

https://ci.nodejs.org/computer/test-ibm-rhel7-s390x-1/ -- does anybody know where the labels come from?

rhel7-s390x is explicitly configured in https://ci.nodejs.org/computer/test-ibm-rhel7-s390x-1/configure, and it is used, all the others are unused (they have no jobs that use them). I was going to clean up the unused lables, but I don't see how. Its messy and confusing, but it doesn't cause any operational problems.

@sam-github
Copy link
Contributor

Nothing left to do, the label cleanup is unrelated tidying.

@richardlau
Copy link
Member

richardlau commented Jan 13, 2020

https://ci.nodejs.org/computer/test-ibm-rhel7-s390x-1/ -- does anybody know where the labels come from?

rhel7-s390x is explicitly configured in https://ci.nodejs.org/computer/test-ibm-rhel7-s390x-1/configure, and it is used, all the others are unused (they have no jobs that use them). I was going to clean up the unused lables, but I don't see how. Its messy and confusing, but it doesn't cause any operational problems.

I think they're coming from this plugin: https://github.com/jenkinsci/platformlabeler-plugin
image

@sam-github
Copy link
Contributor

@richardlau OK, that makes sense. I'll assume we are using that.

@richardlau
Copy link
Member

@richardlau OK, that makes sense. I'll assume we are using that.

We are. When I first posted I was searching through the list of plugins installed in our Jenkins for the word label and picked out the most plausible sounding one (after reading the description on the linked page). I've subsequently found the configuration page where it's enabled and posted the screenshot as an edit to my previous comment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants