Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARM runner has been stuck for multiple days #80

Open
Biswa96 opened this issue Oct 16, 2023 · 23 comments
Open

ARM runner has been stuck for multiple days #80

Biswa96 opened this issue Oct 16, 2023 · 23 comments

Comments

@Biswa96
Copy link
Member

Biswa96 commented Oct 16, 2023

This CI job is running for days https://github.com/msys2-arm/msys2-autobuild/actions/runs/6508662089

@jeremyd2019

@Biswa96 Biswa96 changed the title ARM runner has been stuck for mutiple days ARM runner has been stuck for multiple days Oct 16, 2023
@jeremyd2019
Copy link
Member

It managed to hang up right as I was leaving on a long weekend trip, and I didn't notice until I got back. I wanted to get a fresh runner going anyway, for the latest Windows Updates, but was waiting until I got back to try to avoid any issues while I was gone 😁. New runner is going now

@Biswa96
Copy link
Member Author

Biswa96 commented Oct 19, 2023

@jeremyd2019 Would you like to check if this CI job is stuck again https://github.com/msys2-arm/msys2-autobuild/actions/runs/6570301888 ?

@lazka
Copy link
Member

lazka commented Nov 24, 2023

@jeremyd2019
Copy link
Member

I've been ruminating on the idea of some sort of 'watchdog' to detect and kill stuck pacman processes automatically, but I haven't settled on the best language/technology to do so. It seems like python would be most convenient since autobuild is already python, I could put a background thread like I did to try polling the token, but I'm not familiar with process querying/killing modules.

What I've got so far is a cygwin commands to get the cygwin pid of the process I want to kill (what I really want is the child pacman process, this gets the newest pacman process older than 1800 seconds)

pgrep -xn -O 1800 pacman

coupled with the script I already had (because when stuck in this state cygwin kill is not sufficient)
https://github.com/jeremyd2019/winautoconfig/blob/master/msys2-runner-setup/setupscripts/wkill.sh

@Biswa96
Copy link
Member Author

Biswa96 commented Nov 26, 2023

It would be a bit clear if the reason of such CI failure is explained.

@jeremyd2019
Copy link
Member

jeremyd2019 commented Nov 28, 2023

lost power, so any lack of runner in the near future will be due to that

power is back

@lazka lazka pinned this issue Dec 29, 2023
@lazka
Copy link
Member

lazka commented Dec 29, 2023

@lazka
Copy link
Member

lazka commented Feb 11, 2024

@jeremyd2019
Copy link
Member

unstuck it. the powershell variant in git-for-windows/git-for-windows-automation#61 (comment) was intriguing, it seems like it could be close to being turned into a 'watchdog', would just need to also query CreationDate field to see any pacman processes that have been running a long time (like a half hour? or hour?), and then arrange for it to run continuously (scheduled task?). Of course, I'd much rather get whatever bug is causing this fixed...

@lazka
Copy link
Member

lazka commented May 3, 2024

@jeremyd2019
Copy link
Member

There's a stuck job now, but it doesn't seem to be the runner this time. Probably something on Github's end.

@Biswa96
Copy link
Member Author

Biswa96 commented Jul 1, 2024

@jeremyd2019
Copy link
Member

jeremyd2019 commented Jul 1, 2024

This seems to be a different issue. I think maybe the machine rebooted. I did a quick check and didn't notice any excess packages installed.

@Biswa96
Copy link
Member Author

Biswa96 commented Jul 26, 2024

@lazka
Copy link
Member

lazka commented Aug 6, 2024

"echo: write error: No space left on device"

@jeremyd2019
Copy link
Member

What?!? I deleted some of the cruft under %USERPROFILE% (go, .cargo mainly) and increased some free space. Will try to build rust again

@lazka
Copy link
Member

lazka commented Aug 6, 2024

Is #76 related?

Otherwise, try good old WinDirStat :)

@Biswa96
Copy link
Member Author

Biswa96 commented Aug 25, 2024

@lazka
Copy link
Member

lazka commented Sep 8, 2024

@lazka
Copy link
Member

lazka commented Oct 19, 2024

@lazka
Copy link
Member

lazka commented Oct 19, 2024

https://github.com/msys2-arm/msys2-autobuild/actions/runs/11418583395/job/31772208187

seems to have gotten unstuck and errored out after some hours. (or anyone poked at it?)

Unrelated note: Runner groups are now available for everyone it seems. Not that it makes much difference with the current setup with a separate org, but good to know: https://github.blog/changelog/2024-10-17-actions-runner-groups-now-available-for-organizations-on-free-plan/

@jeremyd2019
Copy link
Member

jeremyd2019 commented Oct 20, 2024

seems to have gotten unstuck and errored out after some hours. (or anyone poked at it?)\

I killed the child pacman process, as usual.

Unrelated note: Runner groups are now available for everyone it seems. Not that it makes much difference with the current setup with a separate org, but good to know: https://github.blog/changelog/2024-10-17-actions-runner-groups-now-available-for-organizations-on-free-plan/

Yeah, I could do away with the extra labels to differentiate between autobuild and CI instances and enforce it with runner groups instead, presumably.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants