Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sidecar requires a power cycle or RPC call to get out of A2 #1448

Closed
mkeeter opened this issue Jun 28, 2023 · 4 comments
Closed

Sidecar requires a power cycle or RPC call to get out of A2 #1448

mkeeter opened this issue Jun 28, 2023 · 4 comments
Milestone

Comments

@mkeeter
Copy link
Collaborator

mkeeter commented Jun 28, 2023

After the thermal loop shut it down (#1441), @bnaecker couldn't power the Sidecar back up:

ben@drteeth ~ $ pilot sp ls -t sidecar
MAC               SERIAL      TYPE    IMAGE            IP
a8:40:25:05:1e:00 BRM23230007 sidecar ac75c9dd4fc33ee7 fe80::aa40:25ff:fe05:1e00
a8:40:25:05:20:00 BRM23230004 sidecar ac75c9dd4fc33ee7 fe80::aa40:25ff:fe05:2000
ben@drteeth ~ $ pilot sp status BRM23230007
BRM23230007        off (A2)
ben@drteeth ~ $ pilot sp on BRM23230007
BRM23230007        ok
ben@drteeth ~ $ pilot sp status BRM23230007
BRM23230007        off (A2)

(chat timestamp)

Instead, we had to manually call humility rpc -c Sequencer.clear_tofino_seq_error, which isn't possible in systems without udprpc (i.e. our production configuration).

It should be possible to get the system out of A2 without this extra RPC call, using the stock faux-mgs power-state subcommand.

@mkeeter mkeeter added this to the FCS milestone Jun 28, 2023
@mkeeter mkeeter changed the title Sidecar requires a power cycle to get out of A2 Sidecar requires a power cycle or RPC call to get out of A2 Jun 28, 2023
@mkeeter
Copy link
Collaborator Author

mkeeter commented Jun 28, 2023

Oddly, getting stuck in A2 isn't true in the general case. On niles, I went from A0 -> A2 -> A0 without any issues. That was admittedly a Sidecar rev B; I haven't tested on a rev C yet.

@arjenroodselaar I wonder whether the troublesome Sidecar did actually experience a power fault somewhere in the process of powering down?

@arjenroodselaar
Copy link
Contributor

Are you sure during that sequence on niles you tripped the sequencer? And alternatively, if the sequencer task gets restarted it clears any outstanding faults. Did you trip the task or SP by chance, causing an implicit clear?

match &server.tofino.sequencer.status().unwrap().abort {

Either way, yes adding a faux-mgs call is appropriate since we'll need to be able to do this manually on chassis not running udprpc.

@arjenroodselaar
Copy link
Contributor

A task to track a more permanent solution; #1457.

@arjenroodselaar
Copy link
Contributor

With the diff above I think this is sufficiently solved for the short term. We'll work on more involved automation and error reporting in upcoming releases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants