-
Notifications
You must be signed in to change notification settings - Fork 167
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
host-sp-comms: do not reboot host on boot failure #1618
base: master
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -258,6 +258,13 @@ struct ServerImpl { | |
reboot_state: Option<RebootState>, | ||
host_kv_storage: HostKeyValueStorage, | ||
hf_mux_state: Option<HfMuxState>, | ||
|
||
/// Set when we receive a Boot Failed message from the host. This flag | ||
/// changes the behavior in response to the next Request Reboot message. It | ||
/// is cleared at Request Reboot, *or* at any of the messages sent during | ||
/// host boot, to try and keep our state from getting out of sync for | ||
/// whatever reason. | ||
host_boot_just_failed: bool, | ||
} | ||
|
||
impl ServerImpl { | ||
|
@@ -290,6 +297,8 @@ impl ServerImpl { | |
reboot_state: None, | ||
host_kv_storage: HostKeyValueStorage::claim_static_resources(), | ||
hf_mux_state: None, | ||
|
||
host_boot_just_failed: false, | ||
} | ||
} | ||
|
||
|
@@ -701,13 +710,24 @@ impl ServerImpl { | |
|
||
// We defer any actions until after we've serialized our response to | ||
// avoid borrow checker issues with calling methods on `self`. | ||
// | ||
// NOTE: setting any action will also clear the `host_boot_just_failed` | ||
// flag, to avoid keeping incorrect state around. Since we have no | ||
// explicit way of tracking the host boot software's lifecycle, this is | ||
// a compromise. | ||
let mut action = None; | ||
|
||
let response = match request { | ||
HostToSp::_Unused => { | ||
Some(SpToHost::DecodeFailure(DecodeFailureReason::Deserialize)) | ||
} | ||
HostToSp::RequestReboot => { | ||
action = Some(Action::RebootHost); | ||
if core::mem::take(&mut self.host_boot_just_failed) { | ||
// Do _not_ reboot. Reinterpret this as a poweroff request. | ||
action = Some(Action::PowerOffHost); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm happy to look at this more tomorrow, but we might need to address the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good to know, let's talk tomorrow. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think I have convinced myself this is okay (where "this" is really "the handling of
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Currently, if the host sends back-to-back "power off" and "reboot" requests, we'll behave ... rather strangely. I wonder if it might be worth adding code to stop listening to IPCC entirely starting at any reboot/powerdown request and continuing up until we believe the host to have been powered back on. |
||
} else { | ||
action = Some(Action::RebootHost); | ||
} | ||
None | ||
} | ||
HostToSp::RequestPowerOff => { | ||
|
@@ -777,6 +797,11 @@ impl ServerImpl { | |
for b in &mut self.host_kv_storage.last_boot_fail[n..] { | ||
*b = 0; | ||
} | ||
|
||
// Record that this failure happened, which will change the | ||
// behavior of the reboot request we expect to follow shortly. | ||
self.host_boot_just_failed = true; | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do we care about the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It would be my strong preference to have the same behavior regardless of reported reason, but I'm open to changing that if somebody wants to make an argument! There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We don't ever expect to see any boot failures apart from in the lab where they're triggered pretty often in my experience. It just takes somebody to do something like configure net boot and forget to start the boot server on the butler, or start it with a typo in the options. We should probably send a heads-up about this new behaviour. It is very unlikely that any of the boot failure reasons indicate that there would be a different result if retried, I think they should all be treated equally as requiring intervention. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. FWIW, I did just hit this exact case the other day personally because I wasn't fast enough in the lab with net boot/partially distracted. Is there something that makes it very clear from the sequencer state or related that we're in this case as a human looking at the system? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The behavior as implemented in this commit won't affect the sequencer state -- the machine will just power itself down on boot failure and wait to be turned back on. We don't currently have a great way of exposing "the last problem with this was _____" data in production builds. As a user, what sort of experience would you hope for? |
||
|
||
Some(SpToHost::Ack) | ||
} | ||
HostToSp::HostPanic { .. } => { | ||
|
@@ -793,6 +818,10 @@ impl ServerImpl { | |
for b in &mut self.host_kv_storage.last_panic[n..] { | ||
*b = 0; | ||
} | ||
|
||
// Neither set nor clear the host_boot_just_failed flag, for | ||
// this message often immediately follows a boot failure. | ||
|
||
Some(SpToHost::Ack) | ||
} | ||
HostToSp::GetStatus => { | ||
|
@@ -888,6 +917,9 @@ impl ServerImpl { | |
// Now that all buffer borrowing is done, we can borrow `self` mutably | ||
// again to perform any necessary action. | ||
if let Some(action) = action { | ||
// Protectively clear this flag to avoid keeping state. | ||
self.host_boot_just_failed = false; | ||
|
||
match action { | ||
Action::RebootHost => self.power_off_host(true), | ||
Action::PowerOffHost => self.power_off_host(false), | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The use of
mem::take
here is very cute! But, IIUC, since we clear the flag later ifaction
isSome
, we don't actually need to clear it when we read the flag's value here, do we? Since both branches setaction
toSome
regardless?Up to you whether this actually ought to change, but at a first glance it kind of appeared that this was the only case where we clear the flag, so it was a bit less obvious that any branch where
action
isSome
clears the flag (despite the comment). 🤷♀️There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I wound up doing both here to make it really really unlikely that we forget to clear the flag. Arguably unnecessary. What do you think would be clearest?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think maybe leaving it as is, but commenting that this also gets cleared because we've set
action = Some(...)
? Honestly, I think it's not really a big deal --- mostly, I wanted to make sure my understanding was correct.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fact that it threw you suggests that I need to make it more clear, so I'll see about adding some comments.