-
Notifications
You must be signed in to change notification settings - Fork 167
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MacStadium maintenance window on January 23rd #3616
Comments
Update (9 AM ET):
|
Update (10 AM ET):
|
We will need to recover the machines manually in order to make the Orka cluster working again. cc: @nodejs/build. I am not available today, but I can try to work on it tomorrow (potentially), but feel free to take leadership if you want. IMPORTANT: You can use this table (#3240 (comment)) as a reference to know where to locate the vms within the cluster in order to align the VMs with the inventory |
I am afraid that I won't be able to work on it today, I will start to work on it only from next Monday. 😓 |
@UlisesGascon thanks for working on it. One question is if the machine recovery is needed because they were not shut down properly (I only noticed the original issue too late to help out) or if that would have been required regardless? |
This situation is a bit tricky, drawing from past experiences such as #3112. The VMs allocated in specific slots, including port mapping, are expected to be shut down and effectively 'removed' from the Orka cluster nodes. Once the cluster is back, a manual relocation process is necessary to create new VMs using the images. This ensures the correct slots are filled, maintaining the expected mapping from the inventory and Jenkins (IPs and ports). In this case, we didn't I'll have a clearer picture on Monday. Unfortunately, I haven't been able to connect yet to check the status of the cluster or the nodes after the upgrade. |
@UlisesGascon I don't think I'm up to speed enough to do the bring back, but if a second set of hands would be helpfull when you have time to look at it and I'm around I'm happy to get on a call and help if that make sense. |
I will start to work on it now |
10.15 machines are back. I am working to re-ansible Macos11 VMs, but the process is taking time |
I'm currently facing some challenges with LLVM installation on macOS11. The build process seems unusually time-consuming, taking hours (whereas I recall it used to be around 30 minutes in the past). The process was so lengthy that the Ansible SSH connection generated a timeout. So, I just changed the strategy and execute this step manually (via SSH). I'm also puzzled about why the applied patch is |
So, the machines made some progress during the night. Currently the machines continue installing dependencies (after restoring SSH sessions due timeouts), not sure why is so slow, but we are making progress. |
I think these long compile/install steps are due to Homebrew removing support for outdated macOS (it has to install deps from source instead of downloading prebuilt binaries). |
This makes totally sense. We need to commit the image changes after this process because the recovery process is very long. |
The ansible process worked fine, I will finish soon with the manual steps for |
So, I am still working on macos11 test machines, the dependencies build is quite long |
🥳 I will commit the image changes once the queue is reduced to zero, to avoid making more bottleneck effects in the PRs. Here are the first jobs from the queue, I will check they are passing before doing the commit of the images:
Update: the CI jobs were fine as far I can see. |
I will start with the image commit, so.. I will disconnect eventually the machines from Jenkins while doing the commit. |
I got an error while connecting to the VPN. I created a support ticket SERVICE-178721. |
The login error got solved, but I needed to open a separate ticket to ask for support as I am getting errors while saving the changes, Ticket SERVICE-178790 |
I think it is fine by now, so I am closing this issue in order to unblock #3642 |
As described in ticket: SERVICE-176962
Potentialaffected machines:release-orka-macos11-x64-1
(tracked on Jenkins)test-orka-macos10.15-x64-1
(test-orka-macos10.15-x64-1 is DOWN jenkins-alerts#1237)test-orka-macos10.15-x64-2
(test-orka-macos10.15-x64-2 is DOWN jenkins-alerts#703)test-orka-macos11-x64-1
(test-orka-macos11-x64-1 is DOWN jenkins-alerts#1238)test-orka-macos11-x64-2
(test-orka-macos11-x64-2 is DOWN jenkins-alerts#1239)Next steps
I am not sure that I will be able to manage the "save and shut down" for the VMs before the deadline (tomorrow), anyone is available to do it (@nodejs/build)?test-orka-macos10.15-x64-1:
test-orka-macos10.15-x64-1
in Orka clustertest-orka-macos10.15-x64-1
test-orka-macos10.15-x64-1
in Jenkinstest-orka-macos10.15-x64-2:
test-orka-macos10.15-x64-2
in Orka clustertest-orka-macos10.15-x64-2
test-orka-macos10.15-x64-2
in Jenkinstest-orka-macos11-x64-1:
test-orka-macos11-x64-1
in Orka clustertest-orka-macos11-x64-1
test-orka-macos11-x64-1
in Jenkinstest-orka-macos11-x64-2:
test-orka-macos11-x64-2
in Orka clustertest-orka-macos11-x64-2
test-orka-macos11-x64-2
in Jenkinsrelease-orka-macos11-x64-1:
release-orka-macos11-x64-1
in Orka clusterrelease-orka-macos11-x64-1
release-orka-macos11-x64-1
release-orka-macos11-x64-1
in JenkinsThe text was updated successfully, but these errors were encountered: