Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SP can come up with a self-assigned MAC address #1034

Closed
jgallagher opened this issue Dec 19, 2022 · 3 comments · Fixed by #1171
Closed

SP can come up with a self-assigned MAC address #1034

jgallagher opened this issue Dec 19, 2022 · 3 comments · Fixed by #1171
Assignees
Milestone

Comments

@jgallagher
Copy link
Contributor

While debugging a presumably-unrelated issue on a gimlet in the compliance rack, we used faux-mgs to reset the SP over the management network via

faux-mgs \
    --listen-addr '[::]:0' \
    --interface axf1 \
    --discovery-addr '[fe80::aa40:25ff:fe04:10c]:11111' \
    reset

The reset took effect immediately (as evidenced by the fans ceasing their screaming), but faux-mgs did not see the SP come back online, and subsequent attempts to talk to it all timed out. After some frantic hair-pulling, we learned that the SP had actually come back fine, but had assigned itself a different MAC/IP address. After sending it a second reset (to its new address):

faux-mgs \
    --listen-addr '[::]:0' \
    --interface axf1 \
    --discovery-addr '[fe80::0c1d:57ff:fe6c:7c38]:11111' \
    reset

it came back with its original, expected MAC / IP address (fe80::aa40:25ff:fe04:10c).

@mkeeter
Copy link
Collaborator

mkeeter commented Dec 19, 2022

Note that the first reset occurred when the front I2C bus was locked, and the VPD is also on the front I2C bus.

@mkeeter mkeeter added this to the MVP milestone Feb 6, 2023
@jgallagher
Copy link
Contributor Author

This happened on sn5 in the lab today; dump below. Of note, control-plane-agent is reporting the correct MAC address, which probably means when the SP first booted, net had the right MAC / IP address, but after the most recent watchdog reset (of which there have been 12000+), net failed to read it and fell back to self-assigned.

hubris.core.0.tar.gz

@bcantrill bcantrill self-assigned this Feb 25, 2023
@bcantrill
Copy link
Collaborator

So, I think the the issue that we saw on sn5 is the front bus lockup as described in oxidecomputer/hardware-gimlet#1815; from the ringbuf for thermal:

   0  586    24310        1 MiscReadFailed(SensorId(0x2), I2cError(BusLockedMux))
   1  586    24310        1 MiscReadFailed(SensorId(0x1), I2cError(BusLockedMux))
   2  586    24310        1 MiscReadFailed(SensorId(0x0), I2cError(BusLockedMux))
   3  586    24310        1 MiscReadFailed(SensorId(0x2), I2cError(BusLockedMux))
   4  586    24310        1 MiscReadFailed(SensorId(0x1), I2cError(BusLockedMux))
   5  586    24310        1 MiscReadFailed(SensorId(0x0), I2cError(BusLockedMux))
   6  586    24310        1 MiscReadFailed(SensorId(0x2), I2cError(BusLockedMux))
   7  586    24310        1 MiscReadFailed(SensorId(0x1), I2cError(BusLockedMux))
   8  586    24310        1 MiscReadFailed(SensorId(0x0), I2cError(BusLockedMux))

And we are bouncing the controllers -- it isn't helping:

% humility -d ./hubris.core.0 ringbuf i2c_server
humility: attached to dump
humility: ring buffer drv_stm32xx_i2c_server::__RINGBUF in i2c_driver:
 NDX LINE      GEN    COUNT PAYLOAD
   6  175    22221        1 Reset(I2C2, PortIndex(0x1))
   7  259    22221        1 Error
   0  175    22222        1 Reset(I2C2, PortIndex(0x1))
   1  259    22222        1 Error
   2  175    22222        1 Reset(I2C2, PortIndex(0x1))
   3  310    22222        8 Error
   4  259    22222        1 Error
   5  175    22222        1 Reset(I2C2, PortIndex(0x1))

(Note in particular the very high generation count there.)

sn5 is (was) running bits that pre-date #1137. This machine has also seen this problem the most frequently (or at least, in the running with the machine in @kc8apf's basement, which has also seen this on a several-times-a-month cadence), so it is going to be interesting to see this on bits that post-date #1137.

That said, @jclulow also saw this issue -- and importantly, saw it on a machine that has the capacity for in situ dumps (#1055/#1099). And the
ring buffer frtom i2c_server is very revealing:

 NDX LINE      GEN    COUNT PAYLOAD
   0  508        1        3 SegmentFailed(ControllerLocked)
   1  323        1        1 Error
   2  175        1        1 Reset(I2C2, PortIndex(0x1))
   3  323        1      949 Error

What this is telling us is potentially consistent with the bad MAC: we tried to configure the muxes on the front bus, and each time it failed indicating that the controller is busy and is refusing to become available. This is normally a condition under which we would reset the controller, but we don't in the case of configuring the muxes -- so the first transaction on this bus (which, empirically from an analyzer on a normal boot, at least can be reading the VPD) would get an error due to the locked controller (and the controller would also be reset). Provided the condition that necessitated the reset was addressed by the reset, subsequent reads would work without incident, which also matches what we saw on BRM42220006 (and on other machines that have seen this).

Now, a little bit of a twist: we appear to hit this very often when we reset the system -- and always this bus (the front bus), and always cleared by a reset. The race appears to be what the first transaction is after boot -- and it is (very often) the thermal loop and/or the power loop (either of which will absorb a single failed I2C transaction without much visible effect). And indeed, to hit this reliably, one can modify the power and thermal tasks (the other two tasks that hit this bus) to sleep for a period before starting, assuring that the first transaction on the front bus is to read the Gimlet VPD (which fails -- and isn't retried).

So why is the first transaction on this bus causing a reset? That remains a mystery, and here is where it gets weird: on the analyzer, the bus looks fine -- up until the point that wait_until_notbusy reads the ISR, after which SCL will remain low until the I2C block is reset. (!!) For example, here is the image on the logic analyzer of this with the reset added once the first mux transaction fails:

2023-02-26-23-05-00_2561x755

Note the 2.14 ms time that SCL is low corresponds to the hl::sleep_for(2) plus the 300 spins. (If we eliminate the call to wait_until_notbusy entirely, we see the same behavior when we first start the I2C transaction -- with SCL going down again until the block is reset.)

It gets weirder: while entirely reproducible, this is also frustratingly sensitive to timing. If a sleep of one millisecond (or more) is inserted before muxes are configured, it goes away. More surprisingly, when the kernel is compiled with traptrace, it goes away. (Indeed, just loading an empty events table somehow makes it go away!) If the order in which the muxes are processed is reversed, it goes away -- and if the order in which pins are processed is reversed, the sleep time to make it go away rises to ~20 ms. (!)

That last data point is especially weird, and it feels like something might be going on with the port muxing -- perhaps similar
to what @cbiffle discovered in #1137. (Of note, this bus shares a controller (I2C2) with the mid bus.) This may also be
related to #1126 and/or #1122.

We should strive to understand why the I2C controller is becoming confused, but in the meantime, a reasonable workaround is to do what we should have done anyway: if, when configuring muxes, we encounter a condition that necessitates a controller reset, we should reset the controller.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants