-
Notifications
You must be signed in to change notification settings - Fork 178
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Diagnostic features for update failures #1867
Comments
We need a way for the RoT to tell us that it triggered a bank swap besides the ringbuf |
Note that the successful SP reboot we observed was 26 seconds; the timeout is 30 seconds, and we've seen 5+ seconds of variability. oxidecomputer/management-gateway-service#284 |
Read out/measure the auxflash (thanks @lzrd for the idea) |
Ask the SP what its current time is (this is exclusively upstack work; the |
In case anyone's investigating cores from this, we also get a crash of both the sequencer and power tasks. These crashes are both deliberate in the code and don't appear to be related, except possibly in causing some of the boot time nondeterminism. https://github.com/oxidecomputer/hubris/blob/master/drv/sidecar-seq-server/src/main.rs#L919 |
@rmustacc pointed out that at least part of the variability will be coming from our accidental hardware random number generator: https://github.com/oxidecomputer/hardware-qsfp-x32/issues/116 One such debugging session, with logs and stuff, here: https://github.com/oxidecomputer/hardware-sidecar/issues/830 |
We had a sidecar SP fail update at a customer site today in a rather ambiguous manner. This issue is intended to collect ideas for diagnostic tools we could have built that would have helped today, so that we can hopefully build them before this reproduces much more.
One possibility is that this is simply an MGS timeout that has drifted out of sync with how long Sidecar takes to boot in practice. We know Sidecar boot is nondeterministic (https://github.com/oxidecomputer/hardware-sidecar/issues/741) so if the timeout is marginal, it could happen rarely for certain units.
Potential root causes I've floated, and tools that might help distinguish them, include:
Please add more ideas.
The text was updated successfully, but these errors were encountered: