-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Neighbor operation timeouts cause crashes on Dell S5248F-P-25G #20587
Comments
@rlebedys please provide tech-support. |
@vdahiya12 @jeff-yin can't upload the tech-support as it exceeds github size limits. Can I send it to you directly in some other way? |
Uploading a slightly smaller sonic dump. Managed to save some space by removing core dumps. @vdahiya12 @jeff-yin this is a dump initiated when the switch crashed and logs mentioned in the main message were generated. |
@vdahiya12 @jeff-yin do you have any news on what might be causing this? |
Has this issue been isolated to the Dell HW platform and not the Broadcom ASIC/SAI implementation? The logs don't seem to point to anything related to anything specific to the S5248F platform. Is this issue NOT seen on other TD3.X7 devices? |
I can't confirm this as we don't have any other equipment running sonic. |
@vdahiya12 @jeff-yin could you help forward the issue to somebody who could check this from Broadcom ASIC/SAI side? |
I am experiencing the same behavior on Accton-AS7326-56X(Broadcom ASIC)
It is easy to reproduce: sudo ip neigh del <ip> dev <device> show version:
|
The issue might be caused by this request |
@jeff-yin looks like this issue is not limited to dell platforms only. @Ndancejic can you check if your changes might have caused this issue? |
Thanks for following up, folks. Can someone reassign this issue to @Ndancejic or an appropriate user? |
Can confirm that the problem reproduces easily when deleting a neighbor from vlan with This problem also reproduces on latest 202311 branch build
|
Tested it with a So the problem reproduces only on 202311 and 202405 branches. Version:
|
@Ndancejic did you have a chance to check this out? Do you need any more information? |
hey, perhaps any updates on this issue? |
Hi all, sorry for the delay. I'll take a look this week. |
This doesn't seem to be related to sonic-net/sonic-swss#3148. This only changes dualtor switchover functionality. regular neighbor operations should be unchanged. Looks like there was a delay in removing neighbor in syncd:
which caused orchagent to crash. the sairedis record shows the api call:
My biggest lead right now is that it seems like the nexthop that was removed right before the neighbor remove is for a different neighbor. However I would expect a different error message (something like object still referenced) if this were the case...
|
hey, are you planning to investigate it further? |
I'll continue to investigate, these are just my initial findings |
@Ndancejic in the log provided in this thread, it shows the nexthop and neighbor ips are the same that are getting removed, so I don't think your lead is right. Do you not think this is a vendor (broadcom) SAI bug? Though I haven't tried it, I'd assume if I captured an SAI replay log and replayed it, it would hang, and if so, wouldn't that mean only broadcom could fix it? |
Description
We observe a switch crash during some neighbor operations - adding or removing. It sometimes gets triggered when an SFP module gets installed. Neighbor update job runs for 30 seconds and then decides to exit containers and causes a restart.
Steps to reproduce the issue:
We managed to reproduce it on production switch by installing a 100G QSFP module
Describe the results you received:
We get a crash and container restart.
Logs
syslog
:Logs
/var/log/swss/sairedis.rec
:Logs
/var/log/swss/swss.rec
:Describe the results you expected:
No crash.
Output of
show version
:Additional information you deem important (e.g. issue happens only occasionally):
Sometimes issue happens even when there is no interaction with the switch.
The text was updated successfully, but these errors were encountered: