Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rev-C gimlet came up with a self-assigned IP address / corrupt VPD #1413

Closed
citrus-it opened this issue Jun 14, 2023 · 6 comments · Fixed by #1417
Closed

Rev-C gimlet came up with a self-assigned IP address / corrupt VPD #1413

citrus-it opened this issue Jun 14, 2023 · 6 comments · Fixed by #1417
Assignees

Comments

@citrus-it
Copy link
Contributor

citrus-it commented Jun 14, 2023

The SP on BRM42220031 in the dogfood rack was updated and it came up with a self-assigned IP address. The packrat ringbuf indicates that the VPD checksum was invalid:

jeeves% export HUMILITY_IP=fe80::c1d:7dff:fe2c:9295%rack2sw0tp1
jeeves% export HUMILITY_ARCHIVE=/net/catacomb/data/staff/dock/rack2/mupdate-20230614/build-gimlet-c-image-default.zip
jeeves% humility ringbuf packrat
humility: ring buffer drv_packrat_vpd_loader::__RINGBUF in gimlet_seq:
 NDX LINE      GEN    COUNT PAYLOAD
   0   51        1        1 MacLocalVpdError(ErrorOnNext(HeaderCorrupt { stored_checksum: 0xdb3bc619, computed_checksum: 0xdb3bc749 }))
   1   68        1        1 BarcodeLocalVpdError(ErrorOnNext(HeaderCorrupt { stored_checksum: 0xdb3bc619, computed_checksum: 0xdb3bc749 }))
   2   51        1        1 MacLocalVpdError(ErrorOnNext(HeaderCorrupt { stored_checksum: 0xdb3bc619, computed_checksum: 0xdb3bc749 }))
   3   68        1        1 BarcodeLocalVpdError(ErrorOnNext(HeaderCorrupt { stored_checksum: 0xdb3bc619, computed_checksum: 0xdb3bc749 }))
   4   51        1        1 MacLocalVpdError(ErrorOnNext(HeaderCorrupt { stored_checksum: 0xdb3bc619, computed_checksum: 0xdb3bc749 }))
   5   68        1        1 BarcodeLocalVpdError(ErrorOnNext(HeaderCorrupt { stored_checksum: 0xdb3bc619, computed_checksum: 0xdb3bc749 }))
   6   51        1        1 MacLocalVpdError(ErrorOnNext(HeaderCorrupt { stored_checksum: 0xdb3bc619, computed_checksum: 0xdb3bc749 }))
   7   68        1        1 BarcodeLocalVpdError(ErrorOnNext(HeaderCorrupt { stored_checksum: 0xdb3bc619, computed_checksum: 0xdb3bc749 }))
   8   51        1        1 MacLocalVpdError(ErrorOnNext(HeaderCorrupt { stored_checksum: 0xdb3bc619, computed_checksum: 0xdb3bc749 }))
   9   68        1        1 BarcodeLocalVpdError(ErrorOnNext(HeaderCorrupt { stored_checksum: 0xdb3bc619, computed_checksum: 0xdb3bc749 }))
jeeves% humility vpd -r -i 10
humility: connecting to fe80::c1d:7dff:fe2c:9295%rack2sw0tp1
[
    ("FRU0", [
        ("BARC", [
            "0XV1:9130000019:006:BRM42220031",
        ]),
        ("MAC0", [
            [
                168,
                64,
                37,
                4,
                3,
                67,
                8,
                0,
                8,
            ],
        ]),
    ]),
]

The dogfood documentation says that its host (not SP) IP was previously:

21    fe80::aa40:25ff:fe04:353  BRM42220031 ci 1abdab5/177d123 2023-05-30 15:26

which tracks with what the VPD read shows (the host is two strides ahead, so 0x53 whereas the VPD data has 67 (0x43))

@citrus-it
Copy link
Contributor Author

A hubris core in this state is at /net/catacomb/data/staff/core/hubris-1413/hubris.core.0

@mkeeter
Copy link
Collaborator

mkeeter commented Jun 14, 2023

Reading the VPD manually (with humility hiffy -c Vpd.read -aindex=10,offset=...) and decoding it by hand, I get the following:

0x46, 0x52, 0x55, 0x30,     "FRU0"
0x4c, 0x0, 0x0, 0x0         length = 76
0xfd, 0xc6, 0x3b, 0xdb,     checksum (correct)

    0x42, 0x41, 0x52, 0x43      "BARC"
    0x1f, 0x0, 0x0, 0x0,        length = 31
    0xce, 0x3d, 0xd7, 0xf7      checksum (correct)
    0x30, 0x58, 0x56, 0x31, 0x3a, 0x39, 0x31, 0x33, 0x30, 0x30, 0x30, 0x30, 0x30, 0x31, 0x39, 0x3a, 0x30, 0x30, 0x36, 0x3a, 0x42, 0x52, 0x4d, 0x34, 0x32, 0x32, 0x32, 0x30, 0x30, 0x33, 0x31
    0x0                         padding
    0x39, 0xf7, 0xe6, 0x11      body checksum

    0x4d, 0x41, 0x43, 0x30      "MAC0"
    0x9, 0x0, 0x0, 0x0          length = 9
    0x61, 0x64, 0xd1, 0x7e      checksum (correct)
    0xa8, 0x40, 0x25, 0x4, 0x3, 0x43, 0x8, 0x0, 0x8,
    0x0, 0x0, 0x0,              padding
    0xa0, 0x2e, 0x9b, 0x24      body checksum

0x5d, 0x2b, 0xb6, 0x2c      body checksum

All of the header checksums are correct; I didn't bother the check the body checksums.

However, notice that the root header checksum (0xdb3bc6fd) doesn't match either stored_checksum or computed_checksum in the ringbuf!

This is surprising; we'd expect at least one of them to be correct.

Doing a bit of math, computed_checksum in the ringbuf indicates that we read a tag of FRU0 (correct) and a length of 0:

In [38]: hex(~(0x30555246 *  0x6b32_9f69 + 0) & 0xFFFFFFFF)
Out[38]: '0xdb3bc749'

stored_checksum is all correct except the lowest byte, which is 0x19 instead of 0xfd.

The entire header is read in a single call to read_exact by the tlvc crate, which in turn makes a single 12-byte I2C transaction.


Summarizing the observations above, we are consistently reading the incorrect header

[0x46, 0x52, 0x55, 0x30, 0x00, 0x00, 0x00, 0x00, 0x19, 0xc6, 0x3b, 0xdb]

instead of the correct header

[0x46, 0x52, 0x55, 0x30, 0x4c, 0x00, 0x00, 0x00, 0xfd, 0xc6, 0x3b, 0xdb]

It's surprising that exactly two bytes are corrupted repeatably.


I've got a sneaking suspicion about what's going on. There are many AT24CSW080 EEPROMs in the system, all with the same I2C address (0x50), and all behind different muxes. In theory, the mux code ensures that we only talk to one at a time.

Reading the VPD header for the Sharkfin A FRU, we see

[0x46, 0x52, 0x55, 0x30, 0x30, 0x00, 0x00, 0x00, 0x19, 0xc7, 0x3b, 0xdb]

Here's the fun part: if we take the bitwise AND of this header and our correct header, we get the incorrect header (!!).

This suggests that the local VPD and at least one sharkfin VPD are talking simultaneously during the header-read transaction.

The eye of Sauron turns to our I2C mux initialization code...

@lzrd
Copy link
Contributor

lzrd commented Jun 15, 2023

This is a thing which "cannot happen" in I2C. The I2C target device is supposed to sense what is on SDA vs what it is driving and abort it's transaction if it is not winning (Trying to source a 1 by not driving the line and sensing a 0).
This implies that the mux is not a pass-gate device for SDA. I wonder what it does for SCL.

@lzrd
Copy link
Contributor

lzrd commented Jun 15, 2023

If the mux is a pass-gate device, then you can still get weirdness when multiple ports are enabled depending on the value and placement of the pull-up resistors.

@bcantrill
Copy link
Collaborator

In looking at this, a very important piece of evidence is @mkeeter's determination that we had two segments enabled simultaneously. The muxes on the front bus are all PCA9545s; leaving aside the possibility of a stray write to a mux (which is to say: arbitrary data corruption), our code has no ability to enable multiple segments simultaneously. But the front bus does have multiple PCA9545 muxes; is it possible for our code to enable multiple muxes simultaneously?

As it turns out, this is possible under certain error conditions during otherwise routine operation: if we cannot write to a mux to enable (or disable) a segment, we will abort the subsequent operation without updating our in memory mux state. The problem is that we cannot know if we in fact did write to the mux -- we know only that the write failed (not if/that it was partially successful). Under this condition, more or less anything is possible, including having the wrong segment enabled and (as in this case) having multiple segments enabled.

So how possible are such failures? On the bench, at least, it turns out that they are disconcertingly common: using @cbiffle's facility for UART-encoded tracing we were able to observe that jefe recording a dump from a panicked net task would preempt the i2c_server task for ~66ms -- far longer than the 25ms I2C timeout. But because of the byte-atomicity of writes by the I2C controller, this preemption would happen after the write (the I2C writes are returning indicating both transfer completion and timeout). Which means on the bench, at least, this situation is quite possible.

But is this what happened in the rack? Looking at the gathered dump, no task has restarted:

humility: attached to dump
system time = 11664743
ID TASK                       GEN PRI STATE    
 0 jefe                         0   0 recv, notif: fault timer(T+57)
 1 net                          0   5 recv, notif: eth-irq(irq61) wake-timer(T+19)
 2 sys                          0   1 recv
 3 spi2_driver                  0   3 recv
 4 i2c_driver                   0   3 recv
 5 spd                          0   2 notif: i2c1-irq(irq31/irq32)
 6 packrat                      0   1 recv
 7 thermal                      0   5 recv, notif: timer(T+725)
 8 power                        0   6 recv, notif: timer(T+673)
 9 hiffy                        0   5 notif: bit31(T+148)
10 gimlet_seq                   0   4 recv, notif: timer(T+20)
11 hash_driver                  0   2 recv
12 hf                           0   3 recv
13 update_server                0   3 recv
14 sensor                       0   4 recv, notif: timer(T+259)
15 host_sp_comms                0   7 recv, notif: jefe-state-change usart-irq(irq82) multitimer control-plane-agent
16 udpecho                      0   6 notif: socket
17 udpbroadcast                 0   6 notif: bit31(T+295)
18 udprpc                       0   6 notif: socket
19 control_plane_agent          0   6 recv, notif: usart-irq(irq37) socket timer
20 sprot                        0   4 notif: bit31(T+1)
21 validate                     0   5 recv
22 vpd                          0   5 recv
23 user_leds                    0   2 recv, notif: timer
24 dump_agent                   0   6 wait: reply from sprot/gen0
25 sbrmi                        0   4 recv
26 idle                         0   8 RUNNING

So it is not due to the jefe preemption for task dumping; have we seen I2C errors at all? Due to its small size, the ring buffer in i2c_driver is of limited utility, unfortunately:

humility: ring buffer drv_stm32xx_i2c_server::__RINGBUF in i2c_driver:
 NDX LINE      GEN    COUNT PAYLOAD
   7  163        9        1 Reset(I2C2, PortIndex(0x0))
   0  277       10        1 Error
   1  163       10        1 Reset(I2C2, PortIndex(0x0))
   2  173       10        1 ResetMux(M1)
   3  337       10        1 Error
   4  163       10        1 Reset(I2C2, PortIndex(0x0))
   5  173       10        1 ResetMux(M1)
   6  337       10      902 Error

This tells us only that we have seen errors -- but provides little additional detail. One additional important detail is that the read of the Gimlet VPD is very early in the life of the system: the first reads on the front bus are to the temp sensors, but the next reads will be to the Gimlet VPD data. (That is, the first mux+segment selected will be the mux+segment that corresponds to the Gimlet VPD data on the front bus.)

Given all of this, what in fact seems most likely is that a segment on the non-Gimlet VPD mux remained enabled through an SP reset -- and failed to be disabled despite our two attempts to clear it on initial configuration (for reasons unknown).

To fix this, we will:

  • Eliminate the possibility of having in-memory state not reflect hardware state by introducing an explicit state (MuxState::Unknown) denoting unknown mux state for a bus (and therefore necessitating mux initialization on any subsequent transaction)

  • Make sure that any failed attempt to set mux state (e.g., in initial configuration) indicates that the bus is in MuxState::Unknown

  • Greatly expand the ring buffer holding I2C errors (which pushes us over a memory threshold -- but this is clearly a price worth paying)

  • Make this ring buffer much more explicit as to observed errors

  • Have reset of the controller reset all muxes on a bus rather than merely the mux that was involved in the transaction that induced the need for a reset -- and have all such resets set the state to be MuxState::Unknown

  • Eliminate the selected_mux_segment entry point, which has served its purpose

@bcantrill
Copy link
Collaborator

Running with the mitigations in place and a deliberate stress test (in which we reset the SP in the midst of looping over all VPDs in the system), we were able to induce the condition that the mitigations address: of the 3,072 induced SP resets, we ended up hitting the configuration mitigation (that is, putting the bus in MuxState::Unknown because we were unable to initialize the mux after a failed attempt, controller reset, and a re-attempt) 44 times. This is a clear indicator that this condition can be hit, as our in memory state will not match the true mux state -- and depending on what that true mux state was and what the first transaction(s) are on the bus, we could potentially get the wrong VPD data. While the fact that we recovered correctly in all cases is assuring, the failure rate is troublingly high; why are we failing to set our mux state on SP reset 1% of the time?

To configure I2C after an SP reset (and therefore, in the middle of an arbitrary I2C transaction), we effect a bus reset by "clocking through the problem" in wiggle_scl; here is a logic analyzer capture of wiggle_scl before a failure to initialize mux state after two attempts:

2023-06-18-12-10-47_2328x607

This capture shows why the logic in wiggle_scl is not quite complete: we are seeing SDA high and then pulling SDA low (and dropping SCL) to prepare to send a STOP condition (where we will release SCL and then SDA). But if -- as in this capture -- we saw SDA as high not because the bus was idle, but rather because the still-active target is in fact sending a 1, when SCL falls, the target will pull SDA low to send a 0. Our STOP condition does nothing in this case: the initiator releasing SDA doesn't, in fact, release SDA because it's held down by the target. (Which is why the logic trace shows us still in the middle of an I2C transaction.)

We need to restructure wiggle_scl slightly to strictly adhere to the approach outlined in AN-686 in that we must always send the necessarily SCL transitions to clock through any outstanding transaction -- and then finally send the STOP condition. With this logic fixed, here is another capture of wiggle_scl after a mid-transaction SP reset:

2023-06-18-12-10-36_2320x592

This shows us clocking through some number of bits, and then sending a STOP condition. (We in fact send several STOP conditions -- but the spurious STOP conditions are hamless.)

With the logic fixed, we were able to run through 2,015 SP resets without seeing a single case of the mux state not being able to be set after a single controller reset. (As for why a single controller reset is sometimes required, see #1034.)

bcantrill added a commit that referenced this issue Jun 19, 2023
Addresses #1413 by:

- Eliminating the possibility of having in-memory state not reflect
  hardware state by intoducing an explicit state (MuxState::Unknown)
  denoting unknown mux state for a bus (and therefore necessitating
  mux initialization on any subsequent transaction)

- Making sure that any failed attempt to set mux state (e.g., in
  initial configuration) indicates that the bus is in
  MuxState::Unknown

- Fixing a source of initial configuration failures due to reset
  during an I2C transaction by correcting the SCL wiggling code to
  correctly clock through the zombie transaction and assert a STOP
  condition

- Greatly expanding the ring buffer holding I2C errors (which pushes
  us over a memory threshold -- but this is clearly a price worth
  paying)

- Making the ring buffer much more explicit as to observed errors

- Having reset of the controller reset all muxes on a bus rather
  than merely the mux that was involved in the transaction that
  induced the need for a reset -- and have all such resets set the
  state to be MuxState::Unknown

Additionally, this eliminates the selected_mux_segment entry point,
which has outlived its usefulness.
bcantrill added a commit that referenced this issue Jun 22, 2023
After the fix to #1413, the system is unwilling to ignore mux related
errors -- which has generated a problem on some lab systems that are
missing muxes:  because we can never get the bus into a known mux
state these systems now become entirely unusable, as even devices that
aren't on the missing segments are not accessible.  This fixes that by
observing that a mux that is affirmatively missing -- that is, one
that doesn't reply to its in-band management at all -- can be assumed
to have segments that are similarly missing (and therefore as good as
disabled).  This (naturally) doesn't change the fact that accessing
any devices attached to the missing mux will generate an error, but it
allows the system to broadly drive on absent accesses to those
devices.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants