You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, if the host reports a boot failure over the IPCC link, we respond by recording the information and resuming normal business. When the host immediately follows that up with a reboot request, we dutifully reboot the host.
Because we haven't taken any additional actions to fix the boot failure (by, for instance, flipping the host flash mux), this will probably always produce a reboot loop.
While this sort of reboot loop is likely not destructive, it's distracting: the machine cycles, the logs/ringbufs get overwritten, power is wasted, etc. I think after a boot failure like this, we should probably not attempt to boot the host until we have reason to believe the failure has been repaired.
The SP itself doesn't have sufficient context to know how to "repair" such a failure. If the failure was hit while attempting a recover image boot through Wicket, for instance, we specifically do not want to do an automatic slot fallback. If we hit it during a production software upgrade, we might, depending on circumstances, want to do a slot fallback. The right answer in basically all cases appears to be: escalate to the control plane, where context is more easily available.
So, I think we should stop rebooting the host after a boot failure, period, and wait for messages over the network. The boot failure is stored in a place the control plane can get to it (in the control-plane-agent). If we had a way of proactively sounding an alarm, we could do that, but for now it'd have to be polled.
Concretely, I discussed this briefly with @wesolows and the simplest thing appears to be:
Honor reboot requests from the host normally, except
If we get a host boot failure message, set a flag that causes the next reboot request to be interpreted as "power down and intervene."
The text was updated successfully, but these errors were encountered:
This is in response to #1613. If the host reports a boot failure (such
as a phase mismatch, but not limited to that reason) simply rebooting it
blindly is unlikely to fix the problem. We need intervention from a
higher power (the control plane) to fix the issue.
So to avoid a bootloop that wastes energy and overwrites our circular
buffers with spam, this change alters the response to the IPCC Request
Reboot message if received shortly after a Boot Failed message -- it is
interpreted as a power off request.
Fixes#1614.
(Fallout from #1613)
Currently, if the host reports a boot failure over the IPCC link, we respond by recording the information and resuming normal business. When the host immediately follows that up with a reboot request, we dutifully reboot the host.
Because we haven't taken any additional actions to fix the boot failure (by, for instance, flipping the host flash mux), this will probably always produce a reboot loop.
While this sort of reboot loop is likely not destructive, it's distracting: the machine cycles, the logs/ringbufs get overwritten, power is wasted, etc. I think after a boot failure like this, we should probably not attempt to boot the host until we have reason to believe the failure has been repaired.
The SP itself doesn't have sufficient context to know how to "repair" such a failure. If the failure was hit while attempting a recover image boot through Wicket, for instance, we specifically do not want to do an automatic slot fallback. If we hit it during a production software upgrade, we might, depending on circumstances, want to do a slot fallback. The right answer in basically all cases appears to be: escalate to the control plane, where context is more easily available.
So, I think we should stop rebooting the host after a boot failure, period, and wait for messages over the network. The boot failure is stored in a place the control plane can get to it (in the control-plane-agent). If we had a way of proactively sounding an alarm, we could do that, but for now it'd have to be polled.
Concretely, I discussed this briefly with @wesolows and the simplest thing appears to be:
The text was updated successfully, but these errors were encountered: