-
Notifications
You must be signed in to change notification settings - Fork 167
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SP still comes up with self-assigned MAC address after #1171 #1196
Comments
#1171 doesn't look like a fix for this, it changes the I2C driver to be more tolerant of errors selecting muxes. To avoid coming up with a self-assigned address, the netstack would need to block startup until VPD is available, which could be forever if the bus is locked. Which in turn kind of suggests to me that a bus lock is happening. |
Dumps are on catacomb from 3 different occurrences in
|
If we had seen an I2C-level failure to read the EEPROM, we would have seen a ring buffer entry in
Whatever is causing the VPD to be reverting to the default must be happening elsewhere -- and I am wondering if the mux issue is causing us to see the wrong EEPROM? One definite step to take: we should expand the ring buffer in the |
(Edit: this was in response to @rmustacc above -- Github decided not to show me Bryan's 11 hour old comment before I posted for some reason.) That ringbuf That's probably worth fixing. |
@bcantrill - This appears to be true for only a narrow subset of I2C failures. There are a lot of code paths in For best assurance, we probably want to record any errors that hit this point in the netstack -- currently they're being discarded by the |
…1204) This is the prep work for oxidecomputer/humility#334 and adds additional instrumentation per the discussion in #1196.
This issue is still with us! We saw this in the rack on BRM42220070 (dump):
This is surprising, as it's not a device error, but rather seemingly getting the wrong data from the EEPROM. This eliminates one hypothesis: if we somehow had the wrong mux/segment selected, we would expect to still find a
This is surprising: it means we are getting the wrong data from the device -- and seemingly repeatedly.
Checking a Sharkfin VPD, it seem intact:
After running this, the Gimlet VPD data was retrieved correctly!
From this point, the condition was seemingly chased away. Restarting Something very odd is going on; reopening this issue, and getting a Saleae on BRM42220067 while I attempt to reproduce again. |
We have seen a series of vexing VPD problems on reset, with a variety of underlying causes. While we fixed several of these, the problem remained (see #1196) -- and one clear cause was a reset during an I2C transaction, resulting in an errant first transaction after the reset. This fix engages in the time-honored tradition of wiggling SCL on reset, and then inducing a STOP condition to effectively terminate any zombie I2C transactions.
At this point, we believe this issue to be resolved -- and if/as we see this again, a new issue is likely merited. |
It's back! See #1413; posting here to cross-link the issues. |
While doing testing with rack 1 (and older RoTs) we've seen this while doing a loop across the fleet of the form:
for f in BRM42220060 BRM42220041 BRM42220020 BRM42220035 BRM42220052 BRM42220087 BRM42220077 BRM42220019 BRM42220029 BRM42220075 BRM42220003 BRM44220002 BRM42220089 BRM42220053 BRM42220012 BRM42220068 BRM42220064 BRM42220048 BRM42220056 BRM42220054 BRM42220007 BRM42220042 BRM42220062 BRM42220076 BRM42220044 BRM42220010 BRM42220005 BRM42220024 BRM42220030 BRM42220061; do pfexec pilot sp exec -e 'update sp 0 /data/local/rack1/gimlet/c/sp.10ms/build-gimlet-c.zip' $f && pfexec pilot sp exec -e 'reset' $f; done
I'm grabbing dumps and will upload them to catacomb, but recording this so we don't lose track. Assigning to MVP milestone right now. Feel free to move at your discretion.
The text was updated successfully, but these errors were encountered: