-
Notifications
You must be signed in to change notification settings - Fork 167
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rev-C gimlet came up with a self-assigned IP address / corrupt VPD #1413
Comments
A hubris core in this state is at |
Reading the VPD manually (with
All of the header checksums are correct; I didn't bother the check the body checksums. However, notice that the root header checksum ( This is surprising; we'd expect at least one of them to be correct. Doing a bit of math,
The entire header is read in a single call to Summarizing the observations above, we are consistently reading the incorrect header
instead of the correct header
It's surprising that exactly two bytes are corrupted repeatably. I've got a sneaking suspicion about what's going on. There are many AT24CSW080 EEPROMs in the system, all with the same I2C address (0x50), and all behind different muxes. In theory, the mux code ensures that we only talk to one at a time. Reading the VPD header for the Sharkfin A FRU, we see
Here's the fun part: if we take the bitwise AND of this header and our correct header, we get the incorrect header (!!). This suggests that the local VPD and at least one sharkfin VPD are talking simultaneously during the header-read transaction. The eye of Sauron turns to our I2C mux initialization code... |
This is a thing which "cannot happen" in I2C. The I2C target device is supposed to sense what is on SDA vs what it is driving and abort it's transaction if it is not winning (Trying to source a 1 by not driving the line and sensing a 0). |
If the mux is a pass-gate device, then you can still get weirdness when multiple ports are enabled depending on the value and placement of the pull-up resistors. |
In looking at this, a very important piece of evidence is @mkeeter's determination that we had two segments enabled simultaneously. The muxes on the front bus are all PCA9545s; leaving aside the possibility of a stray write to a mux (which is to say: arbitrary data corruption), our code has no ability to enable multiple segments simultaneously. But the front bus does have multiple PCA9545 muxes; is it possible for our code to enable multiple muxes simultaneously? As it turns out, this is possible under certain error conditions during otherwise routine operation: if we cannot write to a mux to enable (or disable) a segment, we will abort the subsequent operation without updating our in memory mux state. The problem is that we cannot know if we in fact did write to the mux -- we know only that the write failed (not if/that it was partially successful). Under this condition, more or less anything is possible, including having the wrong segment enabled and (as in this case) having multiple segments enabled. So how possible are such failures? On the bench, at least, it turns out that they are disconcertingly common: using @cbiffle's facility for UART-encoded tracing we were able to observe that But is this what happened in the rack? Looking at the gathered dump, no task has restarted: humility: attached to dump
system time = 11664743
ID TASK GEN PRI STATE
0 jefe 0 0 recv, notif: fault timer(T+57)
1 net 0 5 recv, notif: eth-irq(irq61) wake-timer(T+19)
2 sys 0 1 recv
3 spi2_driver 0 3 recv
4 i2c_driver 0 3 recv
5 spd 0 2 notif: i2c1-irq(irq31/irq32)
6 packrat 0 1 recv
7 thermal 0 5 recv, notif: timer(T+725)
8 power 0 6 recv, notif: timer(T+673)
9 hiffy 0 5 notif: bit31(T+148)
10 gimlet_seq 0 4 recv, notif: timer(T+20)
11 hash_driver 0 2 recv
12 hf 0 3 recv
13 update_server 0 3 recv
14 sensor 0 4 recv, notif: timer(T+259)
15 host_sp_comms 0 7 recv, notif: jefe-state-change usart-irq(irq82) multitimer control-plane-agent
16 udpecho 0 6 notif: socket
17 udpbroadcast 0 6 notif: bit31(T+295)
18 udprpc 0 6 notif: socket
19 control_plane_agent 0 6 recv, notif: usart-irq(irq37) socket timer
20 sprot 0 4 notif: bit31(T+1)
21 validate 0 5 recv
22 vpd 0 5 recv
23 user_leds 0 2 recv, notif: timer
24 dump_agent 0 6 wait: reply from sprot/gen0
25 sbrmi 0 4 recv
26 idle 0 8 RUNNING So it is not due to the
This tells us only that we have seen errors -- but provides little additional detail. One additional important detail is that the read of the Gimlet VPD is very early in the life of the system: the first reads on the front bus are to the temp sensors, but the next reads will be to the Gimlet VPD data. (That is, the first mux+segment selected will be the mux+segment that corresponds to the Gimlet VPD data on the front bus.) Given all of this, what in fact seems most likely is that a segment on the non-Gimlet VPD mux remained enabled through an SP reset -- and failed to be disabled despite our two attempts to clear it on initial configuration (for reasons unknown). To fix this, we will:
|
Running with the mitigations in place and a deliberate stress test (in which we reset the SP in the midst of looping over all VPDs in the system), we were able to induce the condition that the mitigations address: of the 3,072 induced SP resets, we ended up hitting the configuration mitigation (that is, putting the bus in To configure I2C after an SP reset (and therefore, in the middle of an arbitrary I2C transaction), we effect a bus reset by "clocking through the problem" in This capture shows why the logic in We need to restructure This shows us clocking through some number of bits, and then sending a STOP condition. (We in fact send several STOP conditions -- but the spurious STOP conditions are hamless.) With the logic fixed, we were able to run through 2,015 SP resets without seeing a single case of the mux state not being able to be set after a single controller reset. (As for why a single controller reset is sometimes required, see #1034.) |
Addresses #1413 by: - Eliminating the possibility of having in-memory state not reflect hardware state by intoducing an explicit state (MuxState::Unknown) denoting unknown mux state for a bus (and therefore necessitating mux initialization on any subsequent transaction) - Making sure that any failed attempt to set mux state (e.g., in initial configuration) indicates that the bus is in MuxState::Unknown - Fixing a source of initial configuration failures due to reset during an I2C transaction by correcting the SCL wiggling code to correctly clock through the zombie transaction and assert a STOP condition - Greatly expanding the ring buffer holding I2C errors (which pushes us over a memory threshold -- but this is clearly a price worth paying) - Making the ring buffer much more explicit as to observed errors - Having reset of the controller reset all muxes on a bus rather than merely the mux that was involved in the transaction that induced the need for a reset -- and have all such resets set the state to be MuxState::Unknown Additionally, this eliminates the selected_mux_segment entry point, which has outlived its usefulness.
After the fix to #1413, the system is unwilling to ignore mux related errors -- which has generated a problem on some lab systems that are missing muxes: because we can never get the bus into a known mux state these systems now become entirely unusable, as even devices that aren't on the missing segments are not accessible. This fixes that by observing that a mux that is affirmatively missing -- that is, one that doesn't reply to its in-band management at all -- can be assumed to have segments that are similarly missing (and therefore as good as disabled). This (naturally) doesn't change the fact that accessing any devices attached to the missing mux will generate an error, but it allows the system to broadly drive on absent accesses to those devices.
The SP on BRM42220031 in the dogfood rack was updated and it came up with a self-assigned IP address. The packrat ringbuf indicates that the VPD checksum was invalid:
The dogfood documentation says that its host (not SP) IP was previously:
which tracks with what the VPD read shows (the host is two strides ahead, so
0x53
whereas the VPD data has 67 (0x43
))The text was updated successfully, but these errors were encountered: