SP still comes up with self-assigned MAC address after #1171 #1196

rmustacc · 2023-03-10T20:17:15Z

While doing testing with rack 1 (and older RoTs) we've seen this while doing a loop across the fleet of the form: for f in BRM42220060 BRM42220041 BRM42220020 BRM42220035 BRM42220052 BRM42220087 BRM42220077 BRM42220019 BRM42220029 BRM42220075 BRM42220003 BRM44220002 BRM42220089 BRM42220053 BRM42220012 BRM42220068 BRM42220064 BRM42220048 BRM42220056 BRM42220054 BRM42220007 BRM42220042 BRM42220062 BRM42220076 BRM42220044 BRM42220010 BRM42220005 BRM42220024 BRM42220030 BRM42220061; do pfexec pilot sp exec -e 'update sp 0 /data/local/rack1/gimlet/c/sp.10ms/build-gimlet-c.zip' $f && pfexec pilot sp exec -e 'reset' $f; done

I'm grabbing dumps and will upload them to catacomb, but recording this so we don't lose track. Assigning to MVP milestone right now. Feel free to move at your discretion.

The text was updated successfully, but these errors were encountered:

cbiffle · 2023-03-10T20:20:22Z

#1171 doesn't look like a fix for this, it changes the I2C driver to be more tolerant of errors selecting muxes. To avoid coming up with a self-assigned address, the netstack would need to block startup until VPD is available, which could be forever if the bus is locked.

Which in turn kind of suggests to me that a bus lock is happening.

rmustacc · 2023-03-10T22:36:49Z

Dumps are on catacomb from 3 different occurrences in /data/staff/core/rack1/ as hubris.core.<sn>.badvpd. Two spot checked had a similarly suspicious ringbuf entry:

humility: ring buffer drv_stm32xx_i2c_server::__RINGBUF in i2c_driver:
 NDX LINE      GEN    COUNT PAYLOAD
   1  162        1        1 Reset(I2C2, PortIndex(0x1))
   2  536        1        2 SegmentFailed(BusResetMux)
   3  162        1        1 Reset(I2C2, PortIndex(0x1))
   4  536        1        1 SegmentFailed(BusResetMux)
   5  536        1        1 SegmentFailed(BadMuxAddress)
   6  334        1        1 Error
   7  162        1        1 Reset(I2C3, PortIndex(0x0))
   0  334        2      952 Error

bcantrill · 2023-03-13T05:31:18Z

If we had seen an I2C-level failure to read the EEPROM, we would have seen a ring buffer entry in drv_local_vpd, but these all seem to be empty:

$ for f in *vpd ; do humility -d $f tasks net ; humility -d $f ringbuf drv_local_vpd; done
humility: attached to dump
system time = 1248384
ID TASK                       GEN PRI STATE    
 1 net                          0   5 recv, notif: eth-irq(irq117) wake-timer(T+413)
humility: attached to dump
humility: ring buffer drv_local_vpd::__RINGBUF in net:
humility: ring buffer drv_local_vpd::__RINGBUF in control_plane_agent:
humility: attached to dump
system time = 1744761
ID TASK                       GEN PRI STATE    
 1 net                          0   5 recv, notif: eth-irq(irq117)
humility: attached to dump
humility: ring buffer drv_local_vpd::__RINGBUF in net:
humility: ring buffer drv_local_vpd::__RINGBUF in control_plane_agent:
humility: attached to dump
system time = 742075
ID TASK                       GEN PRI STATE    
 1 net                          0   5 recv, notif: eth-irq(irq117)
humility: attached to dump
humility: ring buffer drv_local_vpd::__RINGBUF in net:
humility: ring buffer drv_local_vpd::__RINGBUF in control_plane_agent:

Whatever is causing the VPD to be reverting to the default must be happening elsewhere -- and I am wondering if the mux issue is causing us to see the wrong EEPROM? One definite step to take: we should expand the ring buffer in the drv_local_vpd driver to capture every error path.

cbiffle · 2023-03-13T16:45:25Z

(Edit: this was in response to @rmustacc above -- Github decided not to show me Bryan's 11 hour old comment before I posted for some reason.)

That ringbuf Error entry occurs when the underlying write_read operation returns an error, but frustratingly it does not indicate what error. Its comments suggest that the line number (334 here) should indicate it, but that line is recording an Error entry in response to any error code from the next layer down, so it doesn't tell us.

That's probably worth fixing.

cbiffle · 2023-03-13T16:51:44Z

If we had seen an I2C-level failure to read the EEPROM, we would have seen a ring buffer entry in drv_local_vpd

@bcantrill - This appears to be true for only a narrow subset of I2C failures. There are a lot of code paths in drv/local-vpd that can return Err without recording anything -- the ringbuf is only updated from the TlvcRead implementation.

For best assurance, we probably want to record any errors that hit this point in the netstack -- currently they're being discarded by the .ok().

…1204) This is the prep work for oxidecomputer/humility#334 and adds additional instrumentation per the discussion in #1196.

bcantrill · 2023-04-08T17:57:07Z

This issue is still with us! We saw this in the rack on BRM42220070 (dump):

$ humility -d ./hubris.core.BRM42220070.vpd-again.0 ringbuf drv_packrat_vpd_loader
humility: attached to dump
humility: ring buffer drv_packrat_vpd_loader::__RINGBUF in gimlet_seq:
 NDX LINE      GEN    COUNT PAYLOAD
   0   50        1        1 LocalVpdError(NoRootChunk)
   1   67        1        1 LocalVpdError(NoRootChunk)
   2   50        1        1 LocalVpdError(NoRootChunk)
   3   67        1        1 LocalVpdError(NoRootChunk)
   4   50        1        1 LocalVpdError(NoRootChunk)
   5   67        1        1 LocalVpdError(NoRootChunk)
   6   50        1        1 LocalVpdError(NoRootChunk)
   7   67        1        1 LocalVpdError(NoRootChunk)
   8   50        1        1 LocalVpdError(NoRootChunk)
   9   67        1        1 LocalVpdError(NoRootChunk)

This is surprising, as it's not a device error, but rather seemingly getting the wrong data from the EEPROM. This eliminates one hypothesis: if we somehow had the wrong mux/segment selected, we would expect to still find a FRU0 tag, and get a LocalVpdError(NoSuchChunk). To explore this a bit better, I added slightly better error recording in #1277. I then put BRM42220067 in a reboot loop whereby we reset the SP when we hit A0PlusHP; after about three hours it hit this condition (
dump):

$ humility -d ./hubris.core.BRM42220067.vpd-again.0 ringbuf drv_packrat_vpd_loader
humility: attached to dump
humility: ring buffer drv_packrat_vpd_loader::__RINGBUF in gimlet_seq:
 NDX LINE      GEN    COUNT PAYLOAD
   0   51        1        1 MacLocalVpdError(BadRootChunk)
   1   68        1        1 BarcodeLocalVpdError(BadRootChunk)
   2   51        1        1 MacLocalVpdError(BadRootChunk)
   3   68        1        1 BarcodeLocalVpdError(BadRootChunk)
   4   51        1        1 MacLocalVpdError(BadRootChunk)
   5   68        1        1 BarcodeLocalVpdError(BadRootChunk)
   6   51        1        1 MacLocalVpdError(BadRootChunk)
   7   68        1        1 BarcodeLocalVpdError(BadRootChunk)
   8   51        1        1 MacLocalVpdError(BadRootChunk)
   9   68        1        1 BarcodeLocalVpdError(BadRootChunk)

This is surprising: it means we are getting the wrong data from the device -- and seemingly repeatedly. humility vpd yielded:

$ humility vpd --read --id 10
humility: attached to 0483:374e:003400224741500820383733 via ST-Link V3
humility vpd failed: bad header: HeaderCorrupt { stored_checksum: db3bc619, computed_checksum: db3bc749 }

Checking a Sharkfin VPD, it seem intact:

$ humility vpd --read --id 0
humility: attached to 0483:374e:003400224741500820383733 via ST-Link V3
[
    ("FRU0", [
        ("BARC", [
            "0XV1:9130000021:001:BRM39220519",
        ]),
    ]),
]

After running this, the Gimlet VPD data was retrieved correctly!

$ humility vpd --read --id 10
humility: attached to 0483:374e:003400224741500820383733 via ST-Link V3
[
    ("FRU0", [
        ("BARC", [
            "0XV1:9130000019:006:BRM42220067",
        ]),
        ("MAC0", [
            [
                168,
                64,
                37,
                4,
                2,
                5,
                8,
                0,
                8,
            ],
        ]),
    ]),
]

From this point, the condition was seemingly chased away. Restarting gimlet_seq resulted in a correct read of the VPD.

Something very odd is going on; reopening this issue, and getting a Saleae on BRM42220067 while I attempt to reproduce again.

We have seen a series of vexing VPD problems on reset, with a variety of underlying causes. While we fixed several of these, the problem remained (see #1196) -- and one clear cause was a reset during an I2C transaction, resulting in an errant first transaction after the reset. This fix engages in the time-honored tradition of wiggling SCL on reset, and then inducing a STOP condition to effectively terminate any zombie I2C transactions.

bcantrill · 2023-04-28T02:20:10Z

At this point, we believe this issue to be resolved -- and if/as we see this again, a new issue is likely merited.

mkeeter · 2023-06-14T14:37:46Z

It's back!

See #1413; posting here to cross-link the issues.

rmustacc added this to the MVP milestone Mar 10, 2023

bcantrill self-assigned this Mar 13, 2023

bcantrill mentioned this issue Mar 13, 2023

add read_fpga_regs() entry point + sequencer register descriptions #1204

Merged

bcantrill added a commit that referenced this issue Mar 13, 2023

add read_fpga_regs() entry point + sequencer register descriptions (#…

27090b5

…1204) This is the prep work for oxidecomputer/humility#334 and adds additional instrumentation per the discussion in #1196.

jgallagher mentioned this issue Mar 22, 2023

Introduce packrat task for caching VPD #1230

Merged

jgallagher closed this as completed in #1230 Mar 24, 2023

mkeeter reopened this Apr 10, 2023

bcantrill closed this as completed Apr 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SP still comes up with self-assigned MAC address after #1171 #1196

SP still comes up with self-assigned MAC address after #1171 #1196

rmustacc commented Mar 10, 2023

cbiffle commented Mar 10, 2023

rmustacc commented Mar 10, 2023

bcantrill commented Mar 13, 2023

cbiffle commented Mar 13, 2023 •

edited

Loading

cbiffle commented Mar 13, 2023

bcantrill commented Apr 8, 2023

bcantrill commented Apr 28, 2023

mkeeter commented Jun 14, 2023

SP still comes up with self-assigned MAC address after #1171 #1196

SP still comes up with self-assigned MAC address after #1171 #1196

Comments

rmustacc commented Mar 10, 2023

cbiffle commented Mar 10, 2023

rmustacc commented Mar 10, 2023

bcantrill commented Mar 13, 2023

cbiffle commented Mar 13, 2023 • edited Loading

cbiffle commented Mar 13, 2023

bcantrill commented Apr 8, 2023

bcantrill commented Apr 28, 2023

mkeeter commented Jun 14, 2023

cbiffle commented Mar 13, 2023 •

edited

Loading