Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SP still comes up with self-assigned MAC address after #1171 #1196

Closed
rmustacc opened this issue Mar 10, 2023 · 8 comments · Fixed by #1230
Closed

SP still comes up with self-assigned MAC address after #1171 #1196

rmustacc opened this issue Mar 10, 2023 · 8 comments · Fixed by #1230
Assignees
Milestone

Comments

@rmustacc
Copy link
Contributor

While doing testing with rack 1 (and older RoTs) we've seen this while doing a loop across the fleet of the form: for f in BRM42220060 BRM42220041 BRM42220020 BRM42220035 BRM42220052 BRM42220087 BRM42220077 BRM42220019 BRM42220029 BRM42220075 BRM42220003 BRM44220002 BRM42220089 BRM42220053 BRM42220012 BRM42220068 BRM42220064 BRM42220048 BRM42220056 BRM42220054 BRM42220007 BRM42220042 BRM42220062 BRM42220076 BRM42220044 BRM42220010 BRM42220005 BRM42220024 BRM42220030 BRM42220061; do pfexec pilot sp exec -e 'update sp 0 /data/local/rack1/gimlet/c/sp.10ms/build-gimlet-c.zip' $f && pfexec pilot sp exec -e 'reset' $f; done

I'm grabbing dumps and will upload them to catacomb, but recording this so we don't lose track. Assigning to MVP milestone right now. Feel free to move at your discretion.

@rmustacc rmustacc added this to the MVP milestone Mar 10, 2023
@cbiffle
Copy link
Collaborator

cbiffle commented Mar 10, 2023

#1171 doesn't look like a fix for this, it changes the I2C driver to be more tolerant of errors selecting muxes. To avoid coming up with a self-assigned address, the netstack would need to block startup until VPD is available, which could be forever if the bus is locked.

Which in turn kind of suggests to me that a bus lock is happening.

@rmustacc
Copy link
Contributor Author

Dumps are on catacomb from 3 different occurrences in /data/staff/core/rack1/ as hubris.core.<sn>.badvpd. Two spot checked had a similarly suspicious ringbuf entry:

humility: ring buffer drv_stm32xx_i2c_server::__RINGBUF in i2c_driver:
 NDX LINE      GEN    COUNT PAYLOAD
   1  162        1        1 Reset(I2C2, PortIndex(0x1))
   2  536        1        2 SegmentFailed(BusResetMux)
   3  162        1        1 Reset(I2C2, PortIndex(0x1))
   4  536        1        1 SegmentFailed(BusResetMux)
   5  536        1        1 SegmentFailed(BadMuxAddress)
   6  334        1        1 Error
   7  162        1        1 Reset(I2C3, PortIndex(0x0))
   0  334        2      952 Error

@bcantrill
Copy link
Collaborator

If we had seen an I2C-level failure to read the EEPROM, we would have seen a ring buffer entry in drv_local_vpd, but these all seem to be empty:

$ for f in *vpd ; do humility -d $f tasks net ; humility -d $f ringbuf drv_local_vpd; done
humility: attached to dump
system time = 1248384
ID TASK                       GEN PRI STATE    
 1 net                          0   5 recv, notif: eth-irq(irq117) wake-timer(T+413)
humility: attached to dump
humility: ring buffer drv_local_vpd::__RINGBUF in net:
humility: ring buffer drv_local_vpd::__RINGBUF in control_plane_agent:
humility: attached to dump
system time = 1744761
ID TASK                       GEN PRI STATE    
 1 net                          0   5 recv, notif: eth-irq(irq117)
humility: attached to dump
humility: ring buffer drv_local_vpd::__RINGBUF in net:
humility: ring buffer drv_local_vpd::__RINGBUF in control_plane_agent:
humility: attached to dump
system time = 742075
ID TASK                       GEN PRI STATE    
 1 net                          0   5 recv, notif: eth-irq(irq117)
humility: attached to dump
humility: ring buffer drv_local_vpd::__RINGBUF in net:
humility: ring buffer drv_local_vpd::__RINGBUF in control_plane_agent:

Whatever is causing the VPD to be reverting to the default must be happening elsewhere -- and I am wondering if the mux issue is causing us to see the wrong EEPROM? One definite step to take: we should expand the ring buffer in the drv_local_vpd driver to capture every error path.

@bcantrill bcantrill self-assigned this Mar 13, 2023
@cbiffle
Copy link
Collaborator

cbiffle commented Mar 13, 2023

(Edit: this was in response to @rmustacc above -- Github decided not to show me Bryan's 11 hour old comment before I posted for some reason.)

That ringbuf Error entry occurs when the underlying write_read operation returns an error, but frustratingly it does not indicate what error. Its comments suggest that the line number (334 here) should indicate it, but that line is recording an Error entry in response to any error code from the next layer down, so it doesn't tell us.

That's probably worth fixing.

@cbiffle
Copy link
Collaborator

cbiffle commented Mar 13, 2023

If we had seen an I2C-level failure to read the EEPROM, we would have seen a ring buffer entry in drv_local_vpd

@bcantrill - This appears to be true for only a narrow subset of I2C failures. There are a lot of code paths in drv/local-vpd that can return Err without recording anything -- the ringbuf is only updated from the TlvcRead implementation.

For best assurance, we probably want to record any errors that hit this point in the netstack -- currently they're being discarded by the .ok().

@bcantrill
Copy link
Collaborator

This issue is still with us! We saw this in the rack on BRM42220070 (dump):

$ humility -d ./hubris.core.BRM42220070.vpd-again.0 ringbuf drv_packrat_vpd_loader
humility: attached to dump
humility: ring buffer drv_packrat_vpd_loader::__RINGBUF in gimlet_seq:
 NDX LINE      GEN    COUNT PAYLOAD
   0   50        1        1 LocalVpdError(NoRootChunk)
   1   67        1        1 LocalVpdError(NoRootChunk)
   2   50        1        1 LocalVpdError(NoRootChunk)
   3   67        1        1 LocalVpdError(NoRootChunk)
   4   50        1        1 LocalVpdError(NoRootChunk)
   5   67        1        1 LocalVpdError(NoRootChunk)
   6   50        1        1 LocalVpdError(NoRootChunk)
   7   67        1        1 LocalVpdError(NoRootChunk)
   8   50        1        1 LocalVpdError(NoRootChunk)
   9   67        1        1 LocalVpdError(NoRootChunk)

This is surprising, as it's not a device error, but rather seemingly getting the wrong data from the EEPROM. This eliminates one hypothesis: if we somehow had the wrong mux/segment selected, we would expect to still find a FRU0 tag, and get a LocalVpdError(NoSuchChunk). To explore this a bit better, I added slightly better error recording in #1277. I then put BRM42220067 in a reboot loop whereby we reset the SP when we hit A0PlusHP; after about three hours it hit this condition (
dump):

$ humility -d ./hubris.core.BRM42220067.vpd-again.0 ringbuf drv_packrat_vpd_loader
humility: attached to dump
humility: ring buffer drv_packrat_vpd_loader::__RINGBUF in gimlet_seq:
 NDX LINE      GEN    COUNT PAYLOAD
   0   51        1        1 MacLocalVpdError(BadRootChunk)
   1   68        1        1 BarcodeLocalVpdError(BadRootChunk)
   2   51        1        1 MacLocalVpdError(BadRootChunk)
   3   68        1        1 BarcodeLocalVpdError(BadRootChunk)
   4   51        1        1 MacLocalVpdError(BadRootChunk)
   5   68        1        1 BarcodeLocalVpdError(BadRootChunk)
   6   51        1        1 MacLocalVpdError(BadRootChunk)
   7   68        1        1 BarcodeLocalVpdError(BadRootChunk)
   8   51        1        1 MacLocalVpdError(BadRootChunk)
   9   68        1        1 BarcodeLocalVpdError(BadRootChunk)

This is surprising: it means we are getting the wrong data from the device -- and seemingly repeatedly. humility vpd yielded:

$ humility vpd --read --id 10
humility: attached to 0483:374e:003400224741500820383733 via ST-Link V3
humility vpd failed: bad header: HeaderCorrupt { stored_checksum: db3bc619, computed_checksum: db3bc749 }

Checking a Sharkfin VPD, it seem intact:

$ humility vpd --read --id 0
humility: attached to 0483:374e:003400224741500820383733 via ST-Link V3
[
    ("FRU0", [
        ("BARC", [
            "0XV1:9130000021:001:BRM39220519",
        ]),
    ]),
]

After running this, the Gimlet VPD data was retrieved correctly!

$ humility vpd --read --id 10
humility: attached to 0483:374e:003400224741500820383733 via ST-Link V3
[
    ("FRU0", [
        ("BARC", [
            "0XV1:9130000019:006:BRM42220067",
        ]),
        ("MAC0", [
            [
                168,
                64,
                37,
                4,
                2,
                5,
                8,
                0,
                8,
            ],
        ]),
    ]),
]

From this point, the condition was seemingly chased away. Restarting gimlet_seq resulted in a correct read of the VPD.

Something very odd is going on; reopening this issue, and getting a Saleae on BRM42220067 while I attempt to reproduce again.

@mkeeter mkeeter reopened this Apr 10, 2023
bcantrill added a commit that referenced this issue Apr 15, 2023
We have seen a series of vexing VPD problems on reset, with a variety
of underlying causes.  While we fixed several of these, the problem
remained (see #1196) -- and one clear cause was a reset during an I2C
transaction, resulting in an errant first transaction after the reset.
This fix engages in the time-honored tradition of wiggling SCL on
reset, and then inducing a STOP condition to effectively terminate any
zombie I2C transactions.
@bcantrill
Copy link
Collaborator

At this point, we believe this issue to be resolved -- and if/as we see this again, a new issue is likely merited.

@mkeeter
Copy link
Collaborator

mkeeter commented Jun 14, 2023

It's back!

See #1413; posting here to cross-link the issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants