SP should not auto-reboot host in response to a host-reported boot failure #1614

cbiffle · 2024-02-14T20:58:11Z

(Fallout from #1613)

Currently, if the host reports a boot failure over the IPCC link, we respond by recording the information and resuming normal business. When the host immediately follows that up with a reboot request, we dutifully reboot the host.

Because we haven't taken any additional actions to fix the boot failure (by, for instance, flipping the host flash mux), this will probably always produce a reboot loop.

While this sort of reboot loop is likely not destructive, it's distracting: the machine cycles, the logs/ringbufs get overwritten, power is wasted, etc. I think after a boot failure like this, we should probably not attempt to boot the host until we have reason to believe the failure has been repaired.

The SP itself doesn't have sufficient context to know how to "repair" such a failure. If the failure was hit while attempting a recover image boot through Wicket, for instance, we specifically do not want to do an automatic slot fallback. If we hit it during a production software upgrade, we might, depending on circumstances, want to do a slot fallback. The right answer in basically all cases appears to be: escalate to the control plane, where context is more easily available.

So, I think we should stop rebooting the host after a boot failure, period, and wait for messages over the network. The boot failure is stored in a place the control plane can get to it (in the control-plane-agent). If we had a way of proactively sounding an alarm, we could do that, but for now it'd have to be polled.

Concretely, I discussed this briefly with @wesolows and the simplest thing appears to be:

Honor reboot requests from the host normally, except
If we get a host boot failure message, set a flag that causes the next reboot request to be interpreted as "power down and intervene."

This is in response to #1613. If the host reports a boot failure (such as a phase mismatch, but not limited to that reason) simply rebooting it blindly is unlikely to fix the problem. We need intervention from a higher power (the control plane) to fix the issue. So to avoid a bootloop that wastes energy and overwrites our circular buffers with spam, this change alters the response to the IPCC Request Reboot message if received shortly after a Boot Failed message -- it is interpreted as a power off request. Fixes #1614.

cbiffle linked a pull request Feb 14, 2024 that will close this issue

host-sp-comms: do not reboot host on boot failure #1618

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SP should not auto-reboot host in response to a host-reported boot failure #1614

SP should not auto-reboot host in response to a host-reported boot failure #1614

cbiffle commented Feb 14, 2024

SP should not auto-reboot host in response to a host-reported boot failure #1614

SP should not auto-reboot host in response to a host-reported boot failure #1614

Comments

cbiffle commented Feb 14, 2024