-
Notifications
You must be signed in to change notification settings - Fork 167
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Disable transceivers which reply with too many NACKs #1441
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This broadly looks good, but I've left a couple thoughts on where I think we should make changes.
drv/transceivers-server/src/main.rs
Outdated
} | ||
for (step, f) in [ | ||
Transceivers::assert_reset, | ||
Transceivers::assert_lpmode, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While asserting LPMode makes sense, in reality driving LPMode to some modules while they're unpowered causes other 3V3 issues, so we want this step to be deassert_lpmode
. Our investigations of this on the board: https://github.com/oxidecomputer/hardware-qsfp-x32/issues/47#issuecomment-1329846157
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks, I remembered the leakage issue but got the sign wrong here!
drv/transceivers-server/src/udp.rs
Outdated
/// This function reads a `ModuleResultNoFailure` and populates error | ||
/// information at the end of the trailing data buffer. This means it should | ||
/// be called as the last operation before sending the response. For results | ||
/// where a `ModuleResult` is returned, use sandle_errors_and_failures |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// where a `ModuleResult` is returned, use sandle_errors_and_failures | |
/// where a `ModuleResult` is returned, use handle_errors_and_failures |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed, thanks!
drv/transceivers-server/src/udp.rs
Outdated
@@ -742,6 +845,63 @@ impl ServerImpl { | |||
(count, desired_result) | |||
} | |||
|
|||
fn get_status_v2( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just putting a reminder here to rename this to align with w/e we land on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, renamed to get_extended_status
538eddf
to
7f8c8c4
Compare
I think this and oxidecomputer/transceiver-control#113 are read to go. I tested on the
(it's probabilistic, since we have to have 3 failures in a row) Once ports are disabled, we see it in
We get reasonable errors from
(time passes, port 1 also gets disabled by Hubris)
Resetting the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for tackling this!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few style nits, but all functionality LGTM!
drv/transceivers-server/src/main.rs
Outdated
} | ||
} | ||
} | ||
Err(e) => { | ||
ringbuf_entry!(Trace::TemperatureReadUnexpectedError(i, e)); | ||
} | ||
} | ||
|
||
self.consecutive_nacks[i] = if got_nack { | ||
self.consecutive_nacks[i].wrapping_add(1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like we'd want to saturate instead of wrap, though I hope that's pretty unlikely in any case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, I failed my coin flip. Fixed in 5100704
@@ -600,9 +649,20 @@ impl ServerImpl { | |||
// any modules at index 32->63 are not currently supported. | |||
let invalid_modules = ModuleId(0xffffffff00000000); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's an all_sidecar()
helper function to return the lower 32 bits, so you could use that and negate it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see a function with this name anywhere in Hubris, can you point me to it?
Ah, sorry for the confusion. It’s a public constructor on the ModuleId type.
|
Hmmm, I see |
This PR adds support for ports disabled by Hubris (see oxidecomputer/hubris#1441) - Adds a wider `ExtendedStatus` type and message, which includes a bit for `DISABLED_BY_SP` (and 23 spare bits, just in case) - Uses the `ExtendedStatus` message in `xcvradm` - Adds `HwError::DisabledBySp` and `HostRequest::ClearDisableLatch`
56f4c85
to
78c735e
Compare
We track NACKs on a per-port basis. If we see too many (3), then we disable the port at the Hubris level.
When disabled:
HwError::DisabledBySp
error typeStatusV2
request/response includesStatusV2::DISABLED_BY_SP
in the bitfieldPorts must be explicitly re-enabled with the
HostRequest::ClearDisableLatch
.Open questions:
HwError::DisabledBySp
andHostRequest::ClearDisableLatch
, since I don't love either of them