-
Notifications
You must be signed in to change notification settings - Fork 174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SP can come up with a self-assigned MAC address #1034
Comments
Note that the first reset occurred when the front I2C bus was locked, and the VPD is also on the front I2C bus. |
This happened on sn5 in the lab today; dump below. Of note, |
So, I think the the issue that we saw on sn5 is the front bus lockup as described in oxidecomputer/hardware-gimlet#1815; from the ringbuf for
And we are bouncing the controllers -- it isn't helping: % humility -d ./hubris.core.0 ringbuf i2c_server
humility: attached to dump
humility: ring buffer drv_stm32xx_i2c_server::__RINGBUF in i2c_driver:
NDX LINE GEN COUNT PAYLOAD
6 175 22221 1 Reset(I2C2, PortIndex(0x1))
7 259 22221 1 Error
0 175 22222 1 Reset(I2C2, PortIndex(0x1))
1 259 22222 1 Error
2 175 22222 1 Reset(I2C2, PortIndex(0x1))
3 310 22222 8 Error
4 259 22222 1 Error
5 175 22222 1 Reset(I2C2, PortIndex(0x1)) (Note in particular the very high generation count there.) sn5 is (was) running bits that pre-date #1137. This machine has also seen this problem the most frequently (or at least, in the running with the machine in @kc8apf's basement, which has also seen this on a several-times-a-month cadence), so it is going to be interesting to see this on bits that post-date #1137. That said, @jclulow also saw this issue -- and importantly, saw it on a machine that has the capacity for in situ dumps (#1055/#1099). And the
What this is telling us is potentially consistent with the bad MAC: we tried to configure the muxes on the front bus, and each time it failed indicating that the controller is busy and is refusing to become available. This is normally a condition under which we would reset the controller, but we don't in the case of configuring the muxes -- so the first transaction on this bus (which, empirically from an analyzer on a normal boot, at least can be reading the VPD) would get an error due to the locked controller (and the controller would also be reset). Provided the condition that necessitated the reset was addressed by the reset, subsequent reads would work without incident, which also matches what we saw on BRM42220006 (and on other machines that have seen this). Now, a little bit of a twist: we appear to hit this very often when we reset the system -- and always this bus (the front bus), and always cleared by a reset. The race appears to be what the first transaction is after boot -- and it is (very often) the thermal loop and/or the power loop (either of which will absorb a single failed I2C transaction without much visible effect). And indeed, to hit this reliably, one can modify the So why is the first transaction on this bus causing a reset? That remains a mystery, and here is where it gets weird: on the analyzer, the bus looks fine -- up until the point that Note the 2.14 ms time that SCL is low corresponds to the It gets weirder: while entirely reproducible, this is also frustratingly sensitive to timing. If a sleep of one millisecond (or more) is inserted before muxes are configured, it goes away. More surprisingly, when the kernel is compiled with That last data point is especially weird, and it feels like something might be going on with the port muxing -- perhaps similar We should strive to understand why the I2C controller is becoming confused, but in the meantime, a reasonable workaround is to do what we should have done anyway: if, when configuring muxes, we encounter a condition that necessitates a controller reset, we should reset the controller. |
While debugging a presumably-unrelated issue on a gimlet in the compliance rack, we used faux-mgs to reset the SP over the management network via
The reset took effect immediately (as evidenced by the fans ceasing their screaming), but faux-mgs did not see the SP come back online, and subsequent attempts to talk to it all timed out. After some frantic hair-pulling, we learned that the SP had actually come back fine, but had assigned itself a different MAC/IP address. After sending it a second reset (to its new address):
it came back with its original, expected MAC / IP address (
fe80::aa40:25ff:fe04:10c
).The text was updated successfully, but these errors were encountered: