-
Notifications
You must be signed in to change notification settings - Fork 109
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FreeBSD tasks signaled to exit, not restarted, likely not OOM ? #592
Comments
Indeed seems they were preempted! Cirrus CI relies on GCE's startup script to bootstrap Cirrus CI Agent on a VM and start executing the task. There were some issues with it on FreeBSD (see #594). Cirrus CI also relies on shutdown script to detect preeptions: if shutdown script is executed before while task is still executing that mean that something from the outside killed it e.g. potential preemtion. I've checked the most recent task 4958657070759936 and the VM was preempted but shutdown script was never executed. @emaste who is preparing and publishing these VMs? I would like to get in touch with them to add some integration tests that new VMs are compatible with Cirrus CI (startup and shutdown are executed as expected). |
I'm having the same issue with FreeBSD images. I need to re-run them and then they sometimes work as expected. |
Under discussion on the FreeBSD-cloud mailing list: https://lists.freebsd.org/pipermail/freebsd-cloud/2020-March/000229.html |
I just tested a shutdown script using the 12.1 RELEASE image and the latest 13-CURRENT image and the shutdown script was run, so I'm not sure what's happening. |
@fkorotkov where does "Signaled to exit!" come from? Is it from a Cirrus script? |
@fkorotkov I created a VM with shutdown-script custom metadata set to:
and then saw a new entry appear for each time I shut the VM down. |
@fkorotkov can you share your shutdown script please? |
@fkorotkov ah, so the "Signaled to exit!" is from the standard FreeBSD shutdown mechanism terminating all processes. I agree having FreeBSD-Cirrus CI would be very valuable; do you have suggestions on how to test the operation of the Cirrus shutdown script? |
I've just tried myself and I see that the shutdown was executed. Digging though GCP docs I've found some limitation of shutdown scripts:
So GCP itself does not guarantee that the script will be executed all the time. On the second though I've just added additional check for this "Signaled to exit!" so in a little while this will be an additional indication of preemption to not rely completely on |
I don't know why GCP has that limitation, but you could hook into FreeBSD's standard shutdown system to report that the instance is terminating - i.e. a /usr/local/etc/rc.d/cirrus script with a |
What is the status of this? |
Seems I'm still getting an occasional "Signaled to exit!" failure, e.g. https://cirrus-ci.com/task/5995876548083712 |
Me too. Saw one yesterday, two more just now: Happens while running |
Any update here? |
Any updates on this? We continue to see jobs get randomly killed, including non-FreeBSD ones. See the |
Just to confirm, this happens on FreeBSD and which other OSes? |
Sorry missed the @swills' message. I think the original issue should be fixed as of early May where Cirrus beside the stop hook started to check VM statuses before deleting and detecting the preemption in another way. I'll try to collect more data to verify it. @timwoj's case looks differently. It's Linux and I see that a corresponding container wasn't preempted. Since the task continued execution it means that |
I'm still restarting about one FreeBSD job in ten. I'll start collecting URLs of failing tasks again, so you can take a look. |
@Minoru, sorry to hear that. Links will be helpful. 🙌 |
Of course now that I'm waiting for failures, they happen less often… But here's one that occurred just now (with FreeBSD): https://cirrus-ci.com/task/6014718282301440 I also had a couple Linux jobs fail in the same way: I'll keep collecting them in hopes of amassing enough info that the root cause becomes obvious. |
Yay, failures happen more often now: |
The issue was that VMs were getting preempted (shutting down due to preemption and causing signaled to exit) while GCE API was not showing the VM as preempted. I've implemented another mechanism of detecting such cases. Let's see how it goes. 🤞 |
It's been two weeks, and I haven't seen this issue pop up in that time. I'd consider it fixed now. Thanks for the hard work, @fkorotkov! |
Awesome! Glad to hear it! 🙌 |
Hi, Sorry for being the party pooper but it seems like it's still happening: |
I think it also happening again for me: |
I'm seeing it as well: |
Maybe related to #591, but wasn't sure if they actually tracked it down to OOM, I've seen several FreeBSD tasks with "Signaled to exit!" at various points:
https://cirrus-ci.com/task/5566901552152576
https://cirrus-ci.com/task/6189245568122880
https://cirrus-ci.com/task/4958657070759936
https://cirrus-ci.com/task/6374248700706816
https://cirrus-ci.com/task/6374248700706816
https://cirrus-ci.com/task/6568457424601088
The OOM angle didn't seem right because the those two are very early on, just installing dependency packages. Didn't yet check how much memory to expect that to use at that point, but if installing packages takes more than 8GB, guess I'd be surprised!
Any way to get more information for reason those are killed? If it's just pre-emption, I expected an auto-restart? Or let me know if there are other limits to be aware of, like disk usage that could explain it (what's the default limit in that case?)
The text was updated successfully, but these errors were encountered: