Rev-C gimlet came up with a self-assigned IP address / corrupt VPD #1413

citrus-it · 2023-06-14T09:47:07Z

The SP on BRM42220031 in the dogfood rack was updated and it came up with a self-assigned IP address. The packrat ringbuf indicates that the VPD checksum was invalid:

jeeves% export HUMILITY_IP=fe80::c1d:7dff:fe2c:9295%rack2sw0tp1
jeeves% export HUMILITY_ARCHIVE=/net/catacomb/data/staff/dock/rack2/mupdate-20230614/build-gimlet-c-image-default.zip

jeeves% humility ringbuf packrat
humility: ring buffer drv_packrat_vpd_loader::__RINGBUF in gimlet_seq:
 NDX LINE      GEN    COUNT PAYLOAD
   0   51        1        1 MacLocalVpdError(ErrorOnNext(HeaderCorrupt { stored_checksum: 0xdb3bc619, computed_checksum: 0xdb3bc749 }))
   1   68        1        1 BarcodeLocalVpdError(ErrorOnNext(HeaderCorrupt { stored_checksum: 0xdb3bc619, computed_checksum: 0xdb3bc749 }))
   2   51        1        1 MacLocalVpdError(ErrorOnNext(HeaderCorrupt { stored_checksum: 0xdb3bc619, computed_checksum: 0xdb3bc749 }))
   3   68        1        1 BarcodeLocalVpdError(ErrorOnNext(HeaderCorrupt { stored_checksum: 0xdb3bc619, computed_checksum: 0xdb3bc749 }))
   4   51        1        1 MacLocalVpdError(ErrorOnNext(HeaderCorrupt { stored_checksum: 0xdb3bc619, computed_checksum: 0xdb3bc749 }))
   5   68        1        1 BarcodeLocalVpdError(ErrorOnNext(HeaderCorrupt { stored_checksum: 0xdb3bc619, computed_checksum: 0xdb3bc749 }))
   6   51        1        1 MacLocalVpdError(ErrorOnNext(HeaderCorrupt { stored_checksum: 0xdb3bc619, computed_checksum: 0xdb3bc749 }))
   7   68        1        1 BarcodeLocalVpdError(ErrorOnNext(HeaderCorrupt { stored_checksum: 0xdb3bc619, computed_checksum: 0xdb3bc749 }))
   8   51        1        1 MacLocalVpdError(ErrorOnNext(HeaderCorrupt { stored_checksum: 0xdb3bc619, computed_checksum: 0xdb3bc749 }))
   9   68        1        1 BarcodeLocalVpdError(ErrorOnNext(HeaderCorrupt { stored_checksum: 0xdb3bc619, computed_checksum: 0xdb3bc749 }))

jeeves% humility vpd -r -i 10
humility: connecting to fe80::c1d:7dff:fe2c:9295%rack2sw0tp1
[
    ("FRU0", [
        ("BARC", [
            "0XV1:9130000019:006:BRM42220031",
        ]),
        ("MAC0", [
            [
                168,
                64,
                37,
                4,
                3,
                67,
                8,
                0,
                8,
            ],
        ]),
    ]),
]

The dogfood documentation says that its host (not SP) IP was previously:

21    fe80::aa40:25ff:fe04:353  BRM42220031 ci 1abdab5/177d123 2023-05-30 15:26

which tracks with what the VPD read shows (the host is two strides ahead, so 0x53 whereas the VPD data has 67 (0x43))

The text was updated successfully, but these errors were encountered:

citrus-it · 2023-06-14T11:51:49Z

A hubris core in this state is at /net/catacomb/data/staff/core/hubris-1413/hubris.core.0

mkeeter · 2023-06-14T14:32:09Z

Reading the VPD manually (with humility hiffy -c Vpd.read -aindex=10,offset=...) and decoding it by hand, I get the following:

0x46, 0x52, 0x55, 0x30,     "FRU0"
0x4c, 0x0, 0x0, 0x0         length = 76
0xfd, 0xc6, 0x3b, 0xdb,     checksum (correct)

    0x42, 0x41, 0x52, 0x43      "BARC"
    0x1f, 0x0, 0x0, 0x0,        length = 31
    0xce, 0x3d, 0xd7, 0xf7      checksum (correct)
    0x30, 0x58, 0x56, 0x31, 0x3a, 0x39, 0x31, 0x33, 0x30, 0x30, 0x30, 0x30, 0x30, 0x31, 0x39, 0x3a, 0x30, 0x30, 0x36, 0x3a, 0x42, 0x52, 0x4d, 0x34, 0x32, 0x32, 0x32, 0x30, 0x30, 0x33, 0x31
    0x0                         padding
    0x39, 0xf7, 0xe6, 0x11      body checksum

    0x4d, 0x41, 0x43, 0x30      "MAC0"
    0x9, 0x0, 0x0, 0x0          length = 9
    0x61, 0x64, 0xd1, 0x7e      checksum (correct)
    0xa8, 0x40, 0x25, 0x4, 0x3, 0x43, 0x8, 0x0, 0x8,
    0x0, 0x0, 0x0,              padding
    0xa0, 0x2e, 0x9b, 0x24      body checksum

0x5d, 0x2b, 0xb6, 0x2c      body checksum

All of the header checksums are correct; I didn't bother the check the body checksums.

However, notice that the root header checksum (0xdb3bc6fd) doesn't match either stored_checksum or computed_checksum in the ringbuf!

This is surprising; we'd expect at least one of them to be correct.

Doing a bit of math, computed_checksum in the ringbuf indicates that we read a tag of FRU0 (correct) and a length of 0:

In [38]: hex(~(0x30555246 *  0x6b32_9f69 + 0) & 0xFFFFFFFF)
Out[38]: '0xdb3bc749'

stored_checksum is all correct except the lowest byte, which is 0x19 instead of 0xfd.

The entire header is read in a single call to read_exact by the tlvc crate, which in turn makes a single 12-byte I2C transaction.

Summarizing the observations above, we are consistently reading the incorrect header

[0x46, 0x52, 0x55, 0x30, 0x00, 0x00, 0x00, 0x00, 0x19, 0xc6, 0x3b, 0xdb]

instead of the correct header

[0x46, 0x52, 0x55, 0x30, 0x4c, 0x00, 0x00, 0x00, 0xfd, 0xc6, 0x3b, 0xdb]

It's surprising that exactly two bytes are corrupted repeatably.

I've got a sneaking suspicion about what's going on. There are many AT24CSW080 EEPROMs in the system, all with the same I2C address (0x50), and all behind different muxes. In theory, the mux code ensures that we only talk to one at a time.

Reading the VPD header for the Sharkfin A FRU, we see

[0x46, 0x52, 0x55, 0x30, 0x30, 0x00, 0x00, 0x00, 0x19, 0xc7, 0x3b, 0xdb]

Here's the fun part: if we take the bitwise AND of this header and our correct header, we get the incorrect header (!!).

This suggests that the local VPD and at least one sharkfin VPD are talking simultaneously during the header-read transaction.

The eye of Sauron turns to our I2C mux initialization code...

lzrd · 2023-06-15T18:12:43Z

This is a thing which "cannot happen" in I2C. The I2C target device is supposed to sense what is on SDA vs what it is driving and abort it's transaction if it is not winning (Trying to source a 1 by not driving the line and sensing a 0).
This implies that the mux is not a pass-gate device for SDA. I wonder what it does for SCL.

lzrd · 2023-06-15T18:15:18Z

If the mux is a pass-gate device, then you can still get weirdness when multiple ports are enabled depending on the value and placement of the pull-up resistors.

bcantrill · 2023-06-17T05:03:50Z

In looking at this, a very important piece of evidence is @mkeeter's determination that we had two segments enabled simultaneously. The muxes on the front bus are all PCA9545s; leaving aside the possibility of a stray write to a mux (which is to say: arbitrary data corruption), our code has no ability to enable multiple segments simultaneously. But the front bus does have multiple PCA9545 muxes; is it possible for our code to enable multiple muxes simultaneously?

As it turns out, this is possible under certain error conditions during otherwise routine operation: if we cannot write to a mux to enable (or disable) a segment, we will abort the subsequent operation without updating our in memory mux state. The problem is that we cannot know if we in fact did write to the mux -- we know only that the write failed (not if/that it was partially successful). Under this condition, more or less anything is possible, including having the wrong segment enabled and (as in this case) having multiple segments enabled.

So how possible are such failures? On the bench, at least, it turns out that they are disconcertingly common: using @cbiffle's facility for UART-encoded tracing we were able to observe that jefe recording a dump from a panicked net task would preempt the i2c_server task for ~66ms -- far longer than the 25ms I2C timeout. But because of the byte-atomicity of writes by the I2C controller, this preemption would happen after the write (the I2C writes are returning indicating both transfer completion and timeout). Which means on the bench, at least, this situation is quite possible.

But is this what happened in the rack? Looking at the gathered dump, no task has restarted:

humility: attached to dump
system time = 11664743
ID TASK                       GEN PRI STATE    
 0 jefe                         0   0 recv, notif: fault timer(T+57)
 1 net                          0   5 recv, notif: eth-irq(irq61) wake-timer(T+19)
 2 sys                          0   1 recv
 3 spi2_driver                  0   3 recv
 4 i2c_driver                   0   3 recv
 5 spd                          0   2 notif: i2c1-irq(irq31/irq32)
 6 packrat                      0   1 recv
 7 thermal                      0   5 recv, notif: timer(T+725)
 8 power                        0   6 recv, notif: timer(T+673)
 9 hiffy                        0   5 notif: bit31(T+148)
10 gimlet_seq                   0   4 recv, notif: timer(T+20)
11 hash_driver                  0   2 recv
12 hf                           0   3 recv
13 update_server                0   3 recv
14 sensor                       0   4 recv, notif: timer(T+259)
15 host_sp_comms                0   7 recv, notif: jefe-state-change usart-irq(irq82) multitimer control-plane-agent
16 udpecho                      0   6 notif: socket
17 udpbroadcast                 0   6 notif: bit31(T+295)
18 udprpc                       0   6 notif: socket
19 control_plane_agent          0   6 recv, notif: usart-irq(irq37) socket timer
20 sprot                        0   4 notif: bit31(T+1)
21 validate                     0   5 recv
22 vpd                          0   5 recv
23 user_leds                    0   2 recv, notif: timer
24 dump_agent                   0   6 wait: reply from sprot/gen0
25 sbrmi                        0   4 recv
26 idle                         0   8 RUNNING

So it is not due to the jefe preemption for task dumping; have we seen I2C errors at all? Due to its small size, the ring buffer in i2c_driver is of limited utility, unfortunately:

humility: ring buffer drv_stm32xx_i2c_server::__RINGBUF in i2c_driver:
 NDX LINE      GEN    COUNT PAYLOAD
   7  163        9        1 Reset(I2C2, PortIndex(0x0))
   0  277       10        1 Error
   1  163       10        1 Reset(I2C2, PortIndex(0x0))
   2  173       10        1 ResetMux(M1)
   3  337       10        1 Error
   4  163       10        1 Reset(I2C2, PortIndex(0x0))
   5  173       10        1 ResetMux(M1)
   6  337       10      902 Error

This tells us only that we have seen errors -- but provides little additional detail. One additional important detail is that the read of the Gimlet VPD is very early in the life of the system: the first reads on the front bus are to the temp sensors, but the next reads will be to the Gimlet VPD data. (That is, the first mux+segment selected will be the mux+segment that corresponds to the Gimlet VPD data on the front bus.)

Given all of this, what in fact seems most likely is that a segment on the non-Gimlet VPD mux remained enabled through an SP reset -- and failed to be disabled despite our two attempts to clear it on initial configuration (for reasons unknown).

To fix this, we will:

Eliminate the possibility of having in-memory state not reflect hardware state by introducing an explicit state (MuxState::Unknown) denoting unknown mux state for a bus (and therefore necessitating mux initialization on any subsequent transaction)
Make sure that any failed attempt to set mux state (e.g., in initial configuration) indicates that the bus is in MuxState::Unknown
Greatly expand the ring buffer holding I2C errors (which pushes us over a memory threshold -- but this is clearly a price worth paying)
Make this ring buffer much more explicit as to observed errors
Have reset of the controller reset all muxes on a bus rather than merely the mux that was involved in the transaction that induced the need for a reset -- and have all such resets set the state to be MuxState::Unknown
Eliminate the selected_mux_segment entry point, which has served its purpose

bcantrill · 2023-06-18T20:13:34Z

Running with the mitigations in place and a deliberate stress test (in which we reset the SP in the midst of looping over all VPDs in the system), we were able to induce the condition that the mitigations address: of the 3,072 induced SP resets, we ended up hitting the configuration mitigation (that is, putting the bus in MuxState::Unknown because we were unable to initialize the mux after a failed attempt, controller reset, and a re-attempt) 44 times. This is a clear indicator that this condition can be hit, as our in memory state will not match the true mux state -- and depending on what that true mux state was and what the first transaction(s) are on the bus, we could potentially get the wrong VPD data. While the fact that we recovered correctly in all cases is assuring, the failure rate is troublingly high; why are we failing to set our mux state on SP reset 1% of the time?

To configure I2C after an SP reset (and therefore, in the middle of an arbitrary I2C transaction), we effect a bus reset by "clocking through the problem" in wiggle_scl; here is a logic analyzer capture of wiggle_scl before a failure to initialize mux state after two attempts:

This capture shows why the logic in wiggle_scl is not quite complete: we are seeing SDA high and then pulling SDA low (and dropping SCL) to prepare to send a STOP condition (where we will release SCL and then SDA). But if -- as in this capture -- we saw SDA as high not because the bus was idle, but rather because the still-active target is in fact sending a 1, when SCL falls, the target will pull SDA low to send a 0. Our STOP condition does nothing in this case: the initiator releasing SDA doesn't, in fact, release SDA because it's held down by the target. (Which is why the logic trace shows us still in the middle of an I2C transaction.)

We need to restructure wiggle_scl slightly to strictly adhere to the approach outlined in AN-686 in that we must always send the necessarily SCL transitions to clock through any outstanding transaction -- and then finally send the STOP condition. With this logic fixed, here is another capture of wiggle_scl after a mid-transaction SP reset:

This shows us clocking through some number of bits, and then sending a STOP condition. (We in fact send several STOP conditions -- but the spurious STOP conditions are hamless.)

With the logic fixed, we were able to run through 2,015 SP resets without seeing a single case of the mux state not being able to be set after a single controller reset. (As for why a single controller reset is sometimes required, see #1034.)

Addresses #1413 by: - Eliminating the possibility of having in-memory state not reflect hardware state by intoducing an explicit state (MuxState::Unknown) denoting unknown mux state for a bus (and therefore necessitating mux initialization on any subsequent transaction) - Making sure that any failed attempt to set mux state (e.g., in initial configuration) indicates that the bus is in MuxState::Unknown - Fixing a source of initial configuration failures due to reset during an I2C transaction by correcting the SCL wiggling code to correctly clock through the zombie transaction and assert a STOP condition - Greatly expanding the ring buffer holding I2C errors (which pushes us over a memory threshold -- but this is clearly a price worth paying) - Making the ring buffer much more explicit as to observed errors - Having reset of the controller reset all muxes on a bus rather than merely the mux that was involved in the transaction that induced the need for a reset -- and have all such resets set the state to be MuxState::Unknown Additionally, this eliminates the selected_mux_segment entry point, which has outlived its usefulness.

After the fix to #1413, the system is unwilling to ignore mux related errors -- which has generated a problem on some lab systems that are missing muxes: because we can never get the bus into a known mux state these systems now become entirely unusable, as even devices that aren't on the missing segments are not accessible. This fixes that by observing that a mux that is affirmatively missing -- that is, one that doesn't reply to its in-band management at all -- can be assumed to have segments that are similarly missing (and therefore as good as disabled). This (naturally) doesn't change the fact that accessing any devices attached to the missing mux will generate an error, but it allows the system to broadly drive on absent accesses to those devices.

mkeeter mentioned this issue Jun 14, 2023

SP still comes up with self-assigned MAC address after #1171 #1196

Closed

jgallagher mentioned this issue Jun 14, 2023

Rev-C gimlet came up with a self-assigned IP address / failed to read its VPD #1369

Closed

bcantrill self-assigned this Jun 14, 2023

bcantrill mentioned this issue Jun 17, 2023

in memory I2C mux state does not always reflect true state #1417

Merged

bcantrill closed this as completed in #1417 Jun 19, 2023

bcantrill mentioned this issue Jun 21, 2023

stm32xx-i2c-server: record error code in ringbuf. #1330

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rev-C gimlet came up with a self-assigned IP address / corrupt VPD #1413

Rev-C gimlet came up with a self-assigned IP address / corrupt VPD #1413

citrus-it commented Jun 14, 2023 •

edited

Loading

citrus-it commented Jun 14, 2023

mkeeter commented Jun 14, 2023 •

edited

Loading

lzrd commented Jun 15, 2023

lzrd commented Jun 15, 2023 •

edited

Loading

bcantrill commented Jun 17, 2023

bcantrill commented Jun 18, 2023

Rev-C gimlet came up with a self-assigned IP address / corrupt VPD #1413

Rev-C gimlet came up with a self-assigned IP address / corrupt VPD #1413

Comments

citrus-it commented Jun 14, 2023 • edited Loading

citrus-it commented Jun 14, 2023

mkeeter commented Jun 14, 2023 • edited Loading

lzrd commented Jun 15, 2023

lzrd commented Jun 15, 2023 • edited Loading

bcantrill commented Jun 17, 2023

bcantrill commented Jun 18, 2023

citrus-it commented Jun 14, 2023 •

edited

Loading

mkeeter commented Jun 14, 2023 •

edited

Loading

lzrd commented Jun 15, 2023 •

edited

Loading