Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Diagnostic features for update failures #1867

Open
cbiffle opened this issue Sep 12, 2024 · 6 comments
Open

Diagnostic features for update failures #1867

cbiffle opened this issue Sep 12, 2024 · 6 comments
Labels
robustness Fixing this would improve robustness of deployed firmware service processor Related to the service processor.

Comments

@cbiffle
Copy link
Collaborator

cbiffle commented Sep 12, 2024

We had a sidecar SP fail update at a customer site today in a rather ambiguous manner. This issue is intended to collect ideas for diagnostic tools we could have built that would have helped today, so that we can hopefully build them before this reproduces much more.

One possibility is that this is simply an MGS timeout that has drifted out of sync with how long Sidecar takes to boot in practice. We know Sidecar boot is nondeterministic (https://github.com/oxidecomputer/hardware-sidecar/issues/741) so if the timeout is marginal, it could happen rarely for certain units.

Potential root causes I've floated, and tools that might help distinguish them, include:

  • Update was written incorrectly due to corruption, off-by-one, or other bug.
    • The ability to read out, or at least hash, the idle bank would let us check for this.
  • SP rebooted into new image which was written correctly, but something failed during initialization that prevented the network from coming up.
    • It would be really great to be able to read a dump from a previous boot out of the dump area, to see if anything panicked last boot.
  • Sidecar may have taken too long to start up for the timeout in MGS, and this might all be an illusion.
    • MGS may want to revise up that timeout (I would also argue for making it configurable, for the next time this happens)
    • We should take a pass over Sidecar startup and check for any optimizations we could make there.

Please add more ideas.

@cbiffle cbiffle added service processor Related to the service processor. robustness Fixing this would improve robustness of deployed firmware labels Sep 12, 2024
@labbott
Copy link
Collaborator

labbott commented Sep 12, 2024

We need a way for the RoT to tell us that it triggered a bank swap besides the ringbuf

@cbiffle
Copy link
Collaborator Author

cbiffle commented Sep 12, 2024

Note that the successful SP reboot we observed was 26 seconds; the timeout is 30 seconds, and we've seen 5+ seconds of variability. oxidecomputer/management-gateway-service#284

@labbott
Copy link
Collaborator

labbott commented Sep 12, 2024

Read out/measure the auxflash (thanks @lzrd for the idea)

@jgallagher
Copy link
Contributor

Ask the SP what its current time is (this is exclusively upstack work; the MgsRequest::CurrentTime message already exists and is supported on the SP: oxidecomputer/management-gateway-service#283)

@cbiffle
Copy link
Collaborator Author

cbiffle commented Sep 12, 2024

In case anyone's investigating cores from this, we also get a crash of both the sequencer and power tasks. These crashes are both deliberate in the code and don't appear to be related, except possibly in causing some of the boot time nondeterminism.

https://github.com/oxidecomputer/hubris/blob/master/task/power/src/bsp/sidecar_bcd.rs#L42C1-L42C33

https://github.com/oxidecomputer/hubris/blob/master/drv/sidecar-seq-server/src/main.rs#L919

@cbiffle
Copy link
Collaborator Author

cbiffle commented Sep 16, 2024

Note that the successful SP reboot we observed was 26 seconds; the timeout is 30 seconds, and we've seen 5+ seconds of variability. oxidecomputer/management-gateway-service#284

@rmustacc pointed out that at least part of the variability will be coming from our accidental hardware random number generator: https://github.com/oxidecomputer/hardware-qsfp-x32/issues/116

One such debugging session, with logs and stuff, here: https://github.com/oxidecomputer/hardware-sidecar/issues/830

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
robustness Fixing this would improve robustness of deployed firmware service processor Related to the service processor.
Projects
None yet
Development

No branches or pull requests

3 participants