Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disable transceivers which reply with too many NACKs #1441

Merged
merged 18 commits into from
Jun 28, 2023

Conversation

mkeeter
Copy link
Collaborator

@mkeeter mkeeter commented Jun 26, 2023

We track NACKs on a per-port basis. If we see too many (3), then we disable the port at the Hubris level.

When disabled:

  • The port is powered off, reset is asserted, and lpmode is not asserted (to prevent leakage)
  • UDP messages to change power, reset, and lpmode are rejected with a new HwError::DisabledBySp error type
  • Ports are skipped in the thermal loop
  • The new StatusV2 request/response includes StatusV2::DISABLED_BY_SP in the bitfield

Ports must be explicitly re-enabled with the HostRequest::ClearDisableLatch.

Open questions:

  • Bikeshedding about names for HwError::DisabledBySp and HostRequest::ClearDisableLatch, since I don't love either of them
  • Should we also reject the Idol APIs to change transceiver state when disabled? Probably!

@mkeeter mkeeter marked this pull request as draft June 26, 2023 21:22
Cargo.toml Outdated Show resolved Hide resolved
Copy link
Contributor

@Aaron-Hartwig Aaron-Hartwig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This broadly looks good, but I've left a couple thoughts on where I think we should make changes.

}
for (step, f) in [
Transceivers::assert_reset,
Transceivers::assert_lpmode,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While asserting LPMode makes sense, in reality driving LPMode to some modules while they're unpowered causes other 3V3 issues, so we want this step to be deassert_lpmode. Our investigations of this on the board: https://github.com/oxidecomputer/hardware-qsfp-x32/issues/47#issuecomment-1329846157

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, I remembered the leakage issue but got the sign wrong here!

/// This function reads a `ModuleResultNoFailure` and populates error
/// information at the end of the trailing data buffer. This means it should
/// be called as the last operation before sending the response. For results
/// where a `ModuleResult` is returned, use sandle_errors_and_failures
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// where a `ModuleResult` is returned, use sandle_errors_and_failures
/// where a `ModuleResult` is returned, use handle_errors_and_failures

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, thanks!

drv/transceivers-server/src/udp.rs Show resolved Hide resolved
@@ -742,6 +845,63 @@ impl ServerImpl {
(count, desired_result)
}

fn get_status_v2(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just putting a reminder here to rename this to align with w/e we land on.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, renamed to get_extended_status

@mkeeter
Copy link
Collaborator Author

mkeeter commented Jun 27, 2023

I think this and oxidecomputer/transceiver-control#113 are read to go.

I tested on the niles Sidecar by adding a Chaos Monkey that gives a 50% change of failing every transceiver temperature read operation. Sure enough, this caused it to disable transceivers after some amount of time:

 NDX LINE      GEN    COUNT PAYLOAD
   4  336        1        1 GetInterfaceError(0x0, NotInitialized)
   5  336        1        1 GetInterfaceError(0x1, NotInitialized)
   6  336        1        1 GetInterfaceError(0x0, NotInitialized)
   7  336        1        1 GetInterfaceError(0x1, NotInitialized)
   8  336        1        1 GetInterfaceError(0x0, NotInitialized)
   9  336        1        1 GetInterfaceError(0x1, NotInitialized)
  10  336        1        1 GetInterfaceError(0x0, NotInitialized)
  11  336        1        1 GetInterfaceError(0x1, NotInitialized)
  12  295        1        1 GotInterface(0x0, Sff8636)
  13  295        1        1 GotInterface(0x1, Sff8636)
  14  408        1        2 TemperatureReadError(0x0, I2cAddressNack)
  15  408        1        1 TemperatureReadError(0x1, I2cAddressNack)
   0  408        2        1 TemperatureReadError(0x0, I2cAddressNack)
   1  461        2        1 DisablingPorts(LogicalPortMask(0x1))
   2  408        2       33 TemperatureReadError(0x1, I2cAddressNack)
   3  461        2        1 DisablingPorts(LogicalPortMask(0x2))

(it's probabilistic, since we have to have 3 failures in a row)

Once ports are disabled, we see it in xcvradm:

matt@niles ~ () $ ./xcvradm -iaxf4 -pfe80::aa40:25ff:fe05:ff00 -t0,1 status
Port Status
   0 PRESENT | RESET | INTERRUPT | DISABLED_BY_SP
   1 PRESENT | RESET | INTERRUPT | DISABLED_BY_SP

We get reasonable errors from xcvradm when performing Forbidden Operations:

matt@niles ~ () $ ./xcvradm -iaxf4 -pfe80::aa40:25ff:fe05:ff00 -t0,1 read-lower --sff 0 4
Port Data
   1 [0x11,0x08,0x00,0xf0]
Some operations failed, errors below
Port Error
   0 Hardware error accessing module 0: Port is disabled by the SP

(time passes, port 1 also gets disabled by Hubris)

matt@niles ~ () $ ./xcvradm -iaxf4 -pfe80::aa40:25ff:fe05:ff00 -t0,1 enable-power
Some operations failed, errors below
Port Error
   0 Hardware error accessing module 0: Port is disabled by the SP
   1 Hardware error accessing module 1: Port is disabled by the SP

Resetting the disabled flag with xcvradm clear-disable-latch allows us to power the transceivers back on and read from them, although they power off stochastically due to the Chaos Monkey.

Copy link
Contributor

@Aaron-Hartwig Aaron-Hartwig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for tackling this!

@mkeeter mkeeter marked this pull request as ready for review June 27, 2023 22:05
Copy link
Contributor

@bnaecker bnaecker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few style nits, but all functionality LGTM!

drv/transceivers-server/src/main.rs Outdated Show resolved Hide resolved
}
}
}
Err(e) => {
ringbuf_entry!(Trace::TemperatureReadUnexpectedError(i, e));
}
}

self.consecutive_nacks[i] = if got_nack {
self.consecutive_nacks[i].wrapping_add(1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like we'd want to saturate instead of wrap, though I hope that's pretty unlikely in any case.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, I failed my coin flip. Fixed in 5100704

drv/transceivers-server/src/udp.rs Outdated Show resolved Hide resolved
@@ -600,9 +649,20 @@ impl ServerImpl {
// any modules at index 32->63 are not currently supported.
let invalid_modules = ModuleId(0xffffffff00000000);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's an all_sidecar() helper function to return the lower 32 bits, so you could use that and negate it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see a function with this name anywhere in Hubris, can you point me to it?

drv/transceivers-server/src/udp.rs Outdated Show resolved Hide resolved
@bnaecker
Copy link
Contributor

bnaecker commented Jun 28, 2023 via email

@mkeeter
Copy link
Collaborator Author

mkeeter commented Jun 28, 2023

Hmmm, I see ModuleId::all(), but that actually sets every bit, including the invalid ones. Anyways, the 0xffffffff00000000 pattern was pre-existing, so I'm fine leaving it as-is.

mkeeter added a commit to oxidecomputer/transceiver-control that referenced this pull request Jun 28, 2023
This PR adds support for ports disabled by Hubris
(see oxidecomputer/hubris#1441)

- Adds a wider `ExtendedStatus` type and message, which includes a bit for
  `DISABLED_BY_SP` (and 23 spare bits, just in case)
- Uses the `ExtendedStatus` message in `xcvradm`
- Adds `HwError::DisabledBySp` and `HostRequest::ClearDisableLatch`
@bnaecker
Copy link
Contributor

@mkeeter Yeah, that's fine it's a nit. Here's the method I was referring to.

@mkeeter mkeeter merged commit b752985 into master Jun 28, 2023
65 checks passed
@mkeeter mkeeter deleted the transceiver-disable branch June 28, 2023 18:14
Aaron-Hartwig added a commit that referenced this pull request Aug 9, 2023
Building on the module disabling feature introduced in #1441, this PR lets that disable get cleared when a disabled module is removed. Now the SP will enable power at the port again, so when a module is inserted it will get powered without intervention from the host.

Fixes #1480
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants