Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

host-sp-comms: do not reboot host on boot failure #1618

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
34 changes: 33 additions & 1 deletion task/host-sp-comms/src/main.rs
Original file line number Diff line number Diff line change
Expand Up @@ -258,6 +258,13 @@ struct ServerImpl {
reboot_state: Option<RebootState>,
host_kv_storage: HostKeyValueStorage,
hf_mux_state: Option<HfMuxState>,

/// Set when we receive a Boot Failed message from the host. This flag
/// changes the behavior in response to the next Request Reboot message. It
/// is cleared at Request Reboot, *or* at any of the messages sent during
/// host boot, to try and keep our state from getting out of sync for
/// whatever reason.
host_boot_just_failed: bool,
}

impl ServerImpl {
Expand Down Expand Up @@ -290,6 +297,8 @@ impl ServerImpl {
reboot_state: None,
host_kv_storage: HostKeyValueStorage::claim_static_resources(),
hf_mux_state: None,

host_boot_just_failed: false,
}
}

Expand Down Expand Up @@ -701,13 +710,24 @@ impl ServerImpl {

// We defer any actions until after we've serialized our response to
// avoid borrow checker issues with calling methods on `self`.
//
// NOTE: setting any action will also clear the `host_boot_just_failed`
// flag, to avoid keeping incorrect state around. Since we have no
// explicit way of tracking the host boot software's lifecycle, this is
// a compromise.
let mut action = None;

let response = match request {
HostToSp::_Unused => {
Some(SpToHost::DecodeFailure(DecodeFailureReason::Deserialize))
}
HostToSp::RequestReboot => {
action = Some(Action::RebootHost);
if core::mem::take(&mut self.host_boot_just_failed) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The use of mem::take here is very cute! But, IIUC, since we clear the flag later if action is Some, we don't actually need to clear it when we read the flag's value here, do we? Since both branches set action to Some regardless?

Up to you whether this actually ought to change, but at a first glance it kind of appeared that this was the only case where we clear the flag, so it was a bit less obvious that any branch where action is Some clears the flag (despite the comment). 🤷‍♀️

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I wound up doing both here to make it really really unlikely that we forget to clear the flag. Arguably unnecessary. What do you think would be clearest?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think maybe leaving it as is, but commenting that this also gets cleared because we've set action = Some(...)? Honestly, I think it's not really a big deal --- mostly, I wanted to make sure my understanding was correct.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fact that it threw you suggests that I need to make it more clear, so I'll see about adding some comments.

// Do _not_ reboot. Reinterpret this as a poweroff request.
action = Some(Action::PowerOffHost);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy to look at this more tomorrow, but we might need to address the TODO I left in handle_jefe_notification. There may be a sequence of reboot operations where we still reboot even if we set this action.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to know, let's talk tomorrow.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I have convinced myself this is okay (where "this" is really "the handling of self.reboot_state"):

  1. self.reboot_state starts out None
  2. It can only go from None to Some(_) inside a call to power_off_host(true)
  3. Once it goes to Some(_), it will eventually cause the host to come back on through one of these paths:
  • If power_off_host(true) succeeds in asking the sequencer to go to A2, it's set to Some(WaitingForA2). Once we get a notification from jefe that we've entered A2, it will change to Some(WaitingInA2RebootDelay) and start the Timers::WaitingInA2ToReboot timer. Once that timer fires, we'll enter handle_reboot_waiting_in_a2_timer, set the state to A0, and set self.reboot_state back to `None.
  • If power_off_host(true) fails to ask the sequencer to go to A2 because we're already in A2, it's set to Some(WaitingInA2RebootDelay) and we start the Timers::WaitingInA2ToReboot timer. Once that timer fires, we'll enter handle_reboot_waiting_in_a2_timer, set the state to A0, and set self.reboot_state back to `None.

power_off_host(true) is only called in two places: if we get a jefe notification that we've gone to A0Reset, or if the host sends us HostToSp::RequestReboot. If either of those happened and before we go through the above timer-based process (~5 seconds) to reboot the host we landed here with a host boot fail, the PowerOffHost would be effectively ignored because we'd already be in the process of rebooting the host. But that doesn't seem likely, and if it happens it's maybe fine to honor the "first request" that triggered the reboot?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, if the host sends back-to-back "power off" and "reboot" requests, we'll behave ... rather strangely. I wonder if it might be worth adding code to stop listening to IPCC entirely starting at any reboot/powerdown request and continuing up until we believe the host to have been powered back on.

} else {
action = Some(Action::RebootHost);
}
None
}
HostToSp::RequestPowerOff => {
Expand Down Expand Up @@ -777,6 +797,11 @@ impl ServerImpl {
for b in &mut self.host_kv_storage.last_boot_fail[n..] {
*b = 0;
}

// Record that this failure happened, which will change the
// behavior of the reboot request we expect to follow shortly.
self.host_boot_just_failed = true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we care about the reason (which we're ..'ing out in the match arm), or should all boot failure reasons be treated equally? Maybe a question for @citrus-it

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be my strong preference to have the same behavior regardless of reported reason, but I'm open to changing that if somebody wants to make an argument!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't ever expect to see any boot failures apart from in the lab where they're triggered pretty often in my experience. It just takes somebody to do something like configure net boot and forget to start the boot server on the butler, or start it with a typo in the options. We should probably send a heads-up about this new behaviour.

It is very unlikely that any of the boot failure reasons indicate that there would be a different result if retried, I think they should all be treated equally as requiring intervention.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, I did just hit this exact case the other day personally because I wasn't fast enough in the lab with net boot/partially distracted. Is there something that makes it very clear from the sequencer state or related that we're in this case as a human looking at the system?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The behavior as implemented in this commit won't affect the sequencer state -- the machine will just power itself down on boot failure and wait to be turned back on. We don't currently have a great way of exposing "the last problem with this was _____" data in production builds. As a user, what sort of experience would you hope for?


Some(SpToHost::Ack)
}
HostToSp::HostPanic { .. } => {
Expand All @@ -793,6 +818,10 @@ impl ServerImpl {
for b in &mut self.host_kv_storage.last_panic[n..] {
*b = 0;
}

// Neither set nor clear the host_boot_just_failed flag, for
// this message often immediately follows a boot failure.

Some(SpToHost::Ack)
}
HostToSp::GetStatus => {
Expand Down Expand Up @@ -888,6 +917,9 @@ impl ServerImpl {
// Now that all buffer borrowing is done, we can borrow `self` mutably
// again to perform any necessary action.
if let Some(action) = action {
// Protectively clear this flag to avoid keeping state.
self.host_boot_just_failed = false;

match action {
Action::RebootHost => self.power_off_host(true),
Action::PowerOffHost => self.power_off_host(false),
Expand Down
Loading