Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Neighbor operation timeouts cause crashes on Dell S5248F-P-25G #20587

Open
rlebedys opened this issue Oct 23, 2024 · 21 comments
Open

Neighbor operation timeouts cause crashes on Dell S5248F-P-25G #20587

rlebedys opened this issue Oct 23, 2024 · 21 comments
Assignees
Labels
Triaged this issue has been triaged

Comments

@rlebedys
Copy link

Description

We observe a switch crash during some neighbor operations - adding or removing. It sometimes gets triggered when an SFP module gets installed. Neighbor update job runs for 30 seconds and then decides to exit containers and causes a restart.

Steps to reproduce the issue:

We managed to reproduce it on production switch by installing a 100G QSFP module

  1. Install 100G QSFP module

Describe the results you received:

We get a crash and container restart.

Logs syslog:

2024 Oct 23 12:04:35.313087 gs1-leaf68 INFO swss#supervisord: message repeated 28 times: [ orchagent ]
2024 Oct 23 12:04:35.313087 gs1-leaf68 INFO swss#supervisord: arp_update ping6: 
2024 Oct 23 12:04:35.313087 gs1-leaf68 INFO swss#supervisord: arp_update Warning: source address might be selected on device other than: Vlan10
2024 Oct 23 12:04:35.313087 gs1-leaf68 INFO swss#supervisord: arp_update 
2024 Oct 23 12:04:35.316676 gs1-leaf68 NOTICE swss#arp_update[14900]: 212 mismatch v6 nbr entry, pinging fe80::262:bff:fe5f:635e on Vlan10
2024 Oct 23 12:04:35.332522 gs1-leaf68 INFO swss#supervisord: arp_update ping6: Warning: source address might be selected on device other than: Vlan10
2024 Oct 23 12:04:35.336200 gs1-leaf68 NOTICE swss#arp_update[14911]: 212 mismatch v6 nbr entry, pinging fe80::262:bff:feb2:cc6 on Vlan10
2024 Oct 23 12:04:35.375682 gs1-leaf68 NOTICE swss#arp_update[14938]: 212 mismatch v6 nbr entry, pinging fe80::262:bff:fe5f:2a22 on Vlan10
2024 Oct 23 12:04:35.391063 gs1-leaf68 NOTICE swss#arp_update[14945]: 212 mismatch v6 nbr entry, pinging fe80::262:bff:fe5e:3098 on Vlan10
2024 Oct 23 12:04:35.428983 gs1-leaf68 NOTICE swss#arp_update[14972]: 212 mismatch v6 nbr entry, pinging fe80::262:bff:fe5e:51b6 on Vlan10
2024 Oct 23 12:04:35.443016 gs1-leaf68 NOTICE swss#arp_update[14979]: 212 mismatch v6 nbr entry, pinging fe80::262:bff:feb2:c1c2 on Vlan10
2024 Oct 23 12:04:35.464435 gs1-leaf68 INFO swss#supervisord: message repeated 4 times: [ arp_update ping6: Warning: source address might be selected on device other than: Vlan10]
2024 Oct 23 12:04:35.464435 gs1-leaf68 INFO swss#supervisord: arp_update ping6: 
2024 Oct 23 12:04:35.465083 gs1-leaf68 INFO swss#supervisord: arp_update Warning: source address might be selected on device other than: Vlan10
2024 Oct 23 12:04:35.465661 gs1-leaf68 INFO swss#supervisord: arp_update 
2024 Oct 23 12:04:35.469962 gs1-leaf68 NOTICE swss#arp_update[14994]: 212 mismatch v6 nbr entry, pinging fe80::262:bff:fe5f:70a8 on Vlan10
2024 Oct 23 12:04:35.485457 gs1-leaf68 INFO swss#supervisord: arp_update ping6: Warning: source address might be selected on device other than: Vlan10
2024 Oct 23 12:04:35.488964 gs1-leaf68 NOTICE swss#arp_update[15005]: 212 mismatch v6 nbr entry, pinging fe80::262:bff:feb2:1ac4 on Vlan10
2024 Oct 23 12:04:35.508442 gs1-leaf68 NOTICE swss#arp_update[15016]: 212 mismatch v6 nbr entry, pinging fe80::6efe:54ff:fe3b:7e70 on Vlan10
2024 Oct 23 12:04:35.522436 gs1-leaf68 NOTICE swss#arp_update[15023]: 212 mismatch v6 nbr entry, pinging fe80::262:bff:fe5e:5e5e on Vlan10
2024 Oct 23 12:04:35.541614 gs1-leaf68 NOTICE swss#arp_update[15034]: 212 mismatch v6 nbr entry, pinging fe80::262:bff:fe5f:d4e on Vlan10
2024 Oct 23 12:04:35.576153 gs1-leaf68 NOTICE swss#arp_update[15057]: 212 mismatch v6 nbr entry, pinging fe80::262:bff:fe5e:52fa on Vlan10
2024 Oct 23 12:04:35.704994 gs1-leaf68 INFO swss#supervisord: message repeated 5 times: [ arp_update ping6: Warning: source address might be selected on device other than: Vlan10]
2024 Oct 23 12:04:35.705387 gs1-leaf68 INFO swss#supervisord: orchagent 
2024 Oct 23 12:04:35.786635 gs1-leaf68 NOTICE swss#arp_update[15064]: 212 mismatch v6 nbr entry, pinging fe80::262:bff:fe5e:2b52 on Vlan10
2024 Oct 23 12:04:39.717709 gs1-leaf68 INFO swss#supervisord: arp_update ping6: Warning: source address might be selected on device other than: Vlan10
2024 Oct 23 12:04:43.823395 gs1-leaf68 NOTICE syncd#syncd: [none] SAI_API_NEXT_HOP:brcm_sai_remove_next_hop:441 Removing nhid 21 if_id 100072
2024 Oct 23 12:04:43.823969 gs1-leaf68 NOTICE swss#orchagent: :- removeNeighbor: Removed next hop fe80::262:bff:fe5e:2b52 on Vlan10
2024 Oct 23 12:04:43.824971 gs1-leaf68 NOTICE syncd#syncd: [none] SAI_API_DASH_ENI:_brcm_sai_l2_ecmp_nbr_mac_delete:267 FDB : MAC:00-62-0B-5E-2B-52 vfi:0xa, is_fdb_del 0, dir 3
2024 Oct 23 12:04:44.172598 gs1-leaf68 NOTICE syncd#syncd: :- threadFunction: time span 347 ms for 'remove:SAI_OBJECT_TYPE_NEIGHBOR_ENTRY:{"ip":"fe80::262:bff:fe5e:2b52","rif":"oid:0x6000000000a50","switch_id":"oid:0x21000000000000"}'
2024 Oct 23 12:04:45.172732 gs1-leaf68 NOTICE syncd#syncd: :- threadFunction: time span 1347 ms for 'remove:SAI_OBJECT_TYPE_NEIGHBOR_ENTRY:{"ip":"fe80::262:bff:fe5e:2b52","rif":"oid:0x6000000000a50","switch_id":"oid:0x21000000000000"}'
2024 Oct 23 12:04:45.606814 gs1-leaf68 WARNING CCmisApi: Failed to get image 'docker-sonic-telemetry'. Error: '404 Client Error for http+docker://localhost/v1.43/images/docker-sonic-telemetry/json: Not Found ("No such image: docker-sonic-telemetry:latest")'
2024 Oct 23 12:04:46.172901 gs1-leaf68 NOTICE syncd#syncd: :- threadFunction: time span 2347 ms for 'remove:SAI_OBJECT_TYPE_NEIGHBOR_ENTRY:{"ip":"fe80::262:bff:fe5e:2b52","rif":"oid:0x6000000000a50","switch_id":"oid:0x21000000000000"}'
2024 Oct 23 12:04:46.317475 gs1-leaf68 INFO systemd[1]: run-docker-runtime\x2drunc-moby-65ea9b4ad4817c193364c5dbfdcb6761c3a4f407931cd56dfdc22137f07ff81f-runc.zjrQ8l.mount: Deactivated successfully.
2024 Oct 23 12:04:47.172989 gs1-leaf68 NOTICE syncd#syncd: :- threadFunction: time span 3348 ms for 'remove:SAI_OBJECT_TYPE_NEIGHBOR_ENTRY:{"ip":"fe80::262:bff:fe5e:2b52","rif":"oid:0x6000000000a50","switch_id":"oid:0x21000000000000"}'
2024 Oct 23 12:04:48.165782 gs1-leaf68 INFO systemd[1]: run-docker-runtime\x2drunc-moby-7e9ec2757cd9ca6f499156f90f401c7a77e26fef2fc270dcdad2d979188018e0-runc.RxH3BC.mount: Deactivated successfully.
2024 Oct 23 12:04:48.173118 gs1-leaf68 NOTICE syncd#syncd: :- threadFunction: time span 4348 ms for 'remove:SAI_OBJECT_TYPE_NEIGHBOR_ENTRY:{"ip":"fe80::262:bff:fe5e:2b52","rif":"oid:0x6000000000a50","switch_id":"oid:0x21000000000000"}'
2024 Oct 23 12:04:49.173260 gs1-leaf68 NOTICE syncd#syncd: :- threadFunction: time span 5348 ms for 'remove:SAI_OBJECT_TYPE_NEIGHBOR_ENTRY:{"ip":"fe80::262:bff:fe5e:2b52","rif":"oid:0x6000000000a50","switch_id":"oid:0x21000000000000"}'
2024 Oct 23 12:04:50.173388 gs1-leaf68 NOTICE syncd#syncd: :- threadFunction: time span 6348 ms for 'remove:SAI_OBJECT_TYPE_NEIGHBOR_ENTRY:{"ip":"fe80::262:bff:fe5e:2b52","rif":"oid:0x6000000000a50","switch_id":"oid:0x21000000000000"}'
2024 Oct 23 12:04:50.321990 gs1-leaf68 INFO systemd[1]: run-docker-runtime\x2drunc-moby-7c03e4b3d60603aa98de38ce504c1de4afe953b8ca4434ab9b98dc2c28c278d0-runc.eneKRD.mount: Deactivated successfully.
2024 Oct 23 12:04:50.873017 gs1-leaf68 INFO systemd[1]: run-docker-runtime\x2drunc-moby-6661b2953ffcf792c0d83a080ba853097c46d6efc9a562d1f996833f3666aa15-runc.RuHkBC.mount: Deactivated successfully.
2024 Oct 23 12:04:51.173518 gs1-leaf68 NOTICE syncd#syncd: :- threadFunction: time span 7348 ms for 'remove:SAI_OBJECT_TYPE_NEIGHBOR_ENTRY:{"ip":"fe80::262:bff:fe5e:2b52","rif":"oid:0x6000000000a50","switch_id":"oid:0x21000000000000"}'
2024 Oct 23 12:04:52.173664 gs1-leaf68 NOTICE syncd#syncd: :- threadFunction: time span 8348 ms for 'remove:SAI_OBJECT_TYPE_NEIGHBOR_ENTRY:{"ip":"fe80::262:bff:fe5e:2b52","rif":"oid:0x6000000000a50","switch_id":"oid:0x21000000000000"}'
2024 Oct 23 12:04:53.173769 gs1-leaf68 NOTICE syncd#syncd: :- threadFunction: time span 9348 ms for 'remove:SAI_OBJECT_TYPE_NEIGHBOR_ENTRY:{"ip":"fe80::262:bff:fe5e:2b52","rif":"oid:0x6000000000a50","switch_id":"oid:0x21000000000000"}'
2024 Oct 23 12:04:54.174169 gs1-leaf68 NOTICE syncd#syncd: :- threadFunction: time span 10349 ms for 'remove:SAI_OBJECT_TYPE_NEIGHBOR_ENTRY:{"ip":"fe80::262:bff:fe5e:2b52","rif":"oid:0x6000000000a50","switch_id":"oid:0x21000000000000"}'
2024 Oct 23 12:04:55.174285 gs1-leaf68 NOTICE syncd#syncd: :- threadFunction: time span 11349 ms for 'remove:SAI_OBJECT_TYPE_NEIGHBOR_ENTRY:{"ip":"fe80::262:bff:fe5e:2b52","rif":"oid:0x6000000000a50","switch_id":"oid:0x21000000000000"}'
2024 Oct 23 12:04:56.174406 gs1-leaf68 NOTICE syncd#syncd: :- threadFunction: time span 12349 ms for 'remove:SAI_OBJECT_TYPE_NEIGHBOR_ENTRY:{"ip":"fe80::262:bff:fe5e:2b52","rif":"oid:0x6000000000a50","switch_id":"oid:0x21000000000000"}'
2024 Oct 23 12:04:57.174520 gs1-leaf68 NOTICE syncd#syncd: :- threadFunction: time span 13349 ms for 'remove:SAI_OBJECT_TYPE_NEIGHBOR_ENTRY:{"ip":"fe80::262:bff:fe5e:2b52","rif":"oid:0x6000000000a50","switch_id":"oid:0x21000000000000"}'
2024 Oct 23 12:04:58.174645 gs1-leaf68 NOTICE syncd#syncd: :- threadFunction: time span 14349 ms for 'remove:SAI_OBJECT_TYPE_NEIGHBOR_ENTRY:{"ip":"fe80::262:bff:fe5e:2b52","rif":"oid:0x6000000000a50","switch_id":"oid:0x21000000000000"}'
2024 Oct 23 12:04:59.174857 gs1-leaf68 NOTICE syncd#syncd: :- threadFunction: time span 15349 ms for 'remove:SAI_OBJECT_TYPE_NEIGHBOR_ENTRY:{"ip":"fe80::262:bff:fe5e:2b52","rif":"oid:0x6000000000a50","switch_id":"oid:0x21000000000000"}'
2024 Oct 23 12:05:00.174942 gs1-leaf68 NOTICE syncd#syncd: :- threadFunction: time span 16350 ms for 'remove:SAI_OBJECT_TYPE_NEIGHBOR_ENTRY:{"ip":"fe80::262:bff:fe5e:2b52","rif":"oid:0x6000000000a50","switch_id":"oid:0x21000000000000"}'
2024 Oct 23 12:05:01.175075 gs1-leaf68 NOTICE syncd#syncd: :- threadFunction: time span 17350 ms for 'remove:SAI_OBJECT_TYPE_NEIGHBOR_ENTRY:{"ip":"fe80::262:bff:fe5e:2b52","rif":"oid:0x6000000000a50","switch_id":"oid:0x21000000000000"}'
2024 Oct 23 12:05:02.175203 gs1-leaf68 NOTICE syncd#syncd: :- threadFunction: time span 18350 ms for 'remove:SAI_OBJECT_TYPE_NEIGHBOR_ENTRY:{"ip":"fe80::262:bff:fe5e:2b52","rif":"oid:0x6000000000a50","switch_id":"oid:0x21000000000000"}'
2024 Oct 23 12:05:03.175327 gs1-leaf68 NOTICE syncd#syncd: :- threadFunction: time span 19350 ms for 'remove:SAI_OBJECT_TYPE_NEIGHBOR_ENTRY:{"ip":"fe80::262:bff:fe5e:2b52","rif":"oid:0x6000000000a50","switch_id":"oid:0x21000000000000"}'
2024 Oct 23 12:05:04.175458 gs1-leaf68 NOTICE syncd#syncd: :- threadFunction: time span 20350 ms for 'remove:SAI_OBJECT_TYPE_NEIGHBOR_ENTRY:{"ip":"fe80::262:bff:fe5e:2b52","rif":"oid:0x6000000000a50","switch_id":"oid:0x21000000000000"}'
2024 Oct 23 12:05:05.175597 gs1-leaf68 NOTICE syncd#syncd: :- threadFunction: time span 21350 ms for 'remove:SAI_OBJECT_TYPE_NEIGHBOR_ENTRY:{"ip":"fe80::262:bff:fe5e:2b52","rif":"oid:0x6000000000a50","switch_id":"oid:0x21000000000000"}'
2024 Oct 23 12:05:06.175717 gs1-leaf68 NOTICE syncd#syncd: :- threadFunction: time span 22350 ms for 'remove:SAI_OBJECT_TYPE_NEIGHBOR_ENTRY:{"ip":"fe80::262:bff:fe5e:2b52","rif":"oid:0x6000000000a50","switch_id":"oid:0x21000000000000"}'
2024 Oct 23 12:05:07.175840 gs1-leaf68 NOTICE syncd#syncd: :- threadFunction: time span 23350 ms for 'remove:SAI_OBJECT_TYPE_NEIGHBOR_ENTRY:{"ip":"fe80::262:bff:fe5e:2b52","rif":"oid:0x6000000000a50","switch_id":"oid:0x21000000000000"}'
2024 Oct 23 12:05:08.175971 gs1-leaf68 NOTICE syncd#syncd: :- threadFunction: time span 24351 ms for 'remove:SAI_OBJECT_TYPE_NEIGHBOR_ENTRY:{"ip":"fe80::262:bff:fe5e:2b52","rif":"oid:0x6000000000a50","switch_id":"oid:0x21000000000000"}'
2024 Oct 23 12:05:09.176153 gs1-leaf68 NOTICE syncd#syncd: :- threadFunction: time span 25351 ms for 'remove:SAI_OBJECT_TYPE_NEIGHBOR_ENTRY:{"ip":"fe80::262:bff:fe5e:2b52","rif":"oid:0x6000000000a50","switch_id":"oid:0x21000000000000"}'
2024 Oct 23 12:05:10.176291 gs1-leaf68 NOTICE syncd#syncd: :- threadFunction: time span 26351 ms for 'remove:SAI_OBJECT_TYPE_NEIGHBOR_ENTRY:{"ip":"fe80::262:bff:fe5e:2b52","rif":"oid:0x6000000000a50","switch_id":"oid:0x21000000000000"}'
2024 Oct 23 12:05:11.176391 gs1-leaf68 NOTICE syncd#syncd: :- threadFunction: time span 27351 ms for 'remove:SAI_OBJECT_TYPE_NEIGHBOR_ENTRY:{"ip":"fe80::262:bff:fe5e:2b52","rif":"oid:0x6000000000a50","switch_id":"oid:0x21000000000000"}'
2024 Oct 23 12:05:12.176539 gs1-leaf68 NOTICE syncd#syncd: :- threadFunction: time span 28351 ms for 'remove:SAI_OBJECT_TYPE_NEIGHBOR_ENTRY:{"ip":"fe80::262:bff:fe5e:2b52","rif":"oid:0x6000000000a50","switch_id":"oid:0x21000000000000"}'
2024 Oct 23 12:05:13.176660 gs1-leaf68 NOTICE syncd#syncd: :- threadFunction: time span 29351 ms for 'remove:SAI_OBJECT_TYPE_NEIGHBOR_ENTRY:{"ip":"fe80::262:bff:fe5e:2b52","rif":"oid:0x6000000000a50","switch_id":"oid:0x21000000000000"}'
2024 Oct 23 12:05:14.176820 gs1-leaf68 NOTICE syncd#syncd: :- threadFunction: time span 30351 ms for 'remove:SAI_OBJECT_TYPE_NEIGHBOR_ENTRY:{"ip":"fe80::262:bff:fe5e:2b52","rif":"oid:0x6000000000a50","switch_id":"oid:0x21000000000000"}'
2024 Oct 23 12:05:14.176890 gs1-leaf68 ERR syncd#syncd: :- threadFunction: time span WD exceeded 30351 ms for remove:SAI_OBJECT_TYPE_NEIGHBOR_ENTRY:{"ip":"fe80::262:bff:fe5e:2b52","rif":"oid:0x6000000000a50","switch_id":"oid:0x21000000000000"}
2024 Oct 23 12:05:14.176933 gs1-leaf68 ERR syncd#syncd: :- logEventData: op: remove, key: SAI_OBJECT_TYPE_NEIGHBOR_ENTRY:{"ip":"fe80::262:bff:fe5e:2b52","rif":"oid:0x6000000000a50","switch_id":"oid:0x21000000000000"}
2024 Oct 23 12:05:35.763241 gs1-leaf68 WARNING swss#supervisor-proc-exit-listener: Process 'orchagent' is stuck in namespace 'host' (1.0 minutes).
2024 Oct 23 12:05:43.882100 gs1-leaf68 ERR swss#orchagent: :- wait: SELECT operation result: TIMEOUT on getresponse
2024 Oct 23 12:05:43.882100 gs1-leaf68 ERR swss#orchagent: :- wait: failed to get response for getresponse
2024 Oct 23 12:05:43.882100 gs1-leaf68 ERR swss#orchagent: :- remove: remove status: SAI_STATUS_FAILURE
2024 Oct 23 12:05:43.882100 gs1-leaf68 ERR swss#orchagent: :- removeNeighbor: Failed to remove neighbor 00:62:0b:5e:2b:52 on Vlan10, rv:-1
2024 Oct 23 12:05:43.882100 gs1-leaf68 ERR swss#orchagent: :- handleSaiRemoveStatus: Encountered failure in remove operation, exiting orchagent, SAI API: SAI_API_NEIGHBOR, status: SAI_STATUS_FAILURE
2024 Oct 23 12:05:43.882100 gs1-leaf68 NOTICE swss#orchagent: :- notifySyncd: sending syncd: SYNCD_INVOKE_DUMP
2024 Oct 23 12:05:45.606722 gs1-leaf68 WARNING CCmisApi: Failed to get image 'docker-sonic-telemetry'. Error: '404 Client Error for http+docker://localhost/v1.43/images/docker-sonic-telemetry/json: Not Found ("No such image: docker-sonic-telemetry:latest")'
2024 Oct 23 12:05:45.777721 gs1-leaf68 INFO systemd[1]: run-docker-runtime\x2drunc-moby-155614239cebe61bf0013683490b3b63b3b98b94b98afd333cb0087479d2589b-runc.W2nRab.mount: Deactivated successfully.
2024 Oct 23 12:06:35.819250 gs1-leaf68 WARNING swss#supervisor-proc-exit-listener: message repeated 59 times: [ Process 'orchagent' is stuck in namespace 'host' (1.0 minutes).]
2024 Oct 23 12:06:35.819250 gs1-leaf68 WARNING swss#supervisor-proc-exit-listener: Process 'orchagent' is stuck in namespace 'host' (2.0 minutes).
2024 Oct 23 12:06:43.942822 gs1-leaf68 ERR swss#orchagent: :- wait: SELECT operation result: TIMEOUT on notify
2024 Oct 23 12:06:43.942822 gs1-leaf68 ERR swss#orchagent: :- wait: failed to get response for notify
2024 Oct 23 12:06:43.942822 gs1-leaf68 ERR swss#orchagent: :- handleSaiFailure: Failed to take sai failure dump -1
2024 Oct 23 12:06:45.037967 gs1-leaf68 INFO swss#supervisord 2024-10-23 12:06:45,036 WARN exited: orchagent (terminated by SIGABRT (core dumped); not expected)

Logs /var/log/swss/sairedis.rec:

2024-10-23.12:04:43.822463|r|SAI_OBJECT_TYPE_NEXT_HOP:oid:0x4000000000a8f
2024-10-23.12:04:43.823794|r|SAI_OBJECT_TYPE_NEIGHBOR_ENTRY:{"ip":"fe80::262:bff:fe5e:2b52","rif":"oid:0x6000000000a50","switch_id":"oid:0x21000000000000"}
2024-10-23.12:05:43.881549|E|SAI_STATUS_FAILURE
2024-10-23.12:05:43.881679|a|SYNCD_INVOKE_DUMP
2024-10-23.12:06:43.942274|A|SAI_STATUS_FAILURE

Logs /var/log/swss/swss.rec:

2024-10-23.12:04:40.495283|NEIGH_TABLE:Vlan10:fe80::262:bff:fe5e:51b6|SET|neigh:00:62:0b:5e:51:b6|family:IPv6
2024-10-23.12:04:40.495908|NEIGH_TABLE:Vlan10:fe80::262:bff:feb2:1ac4|SET|neigh:00:62:0b:b2:1a:c4|family:IPv6
2024-10-23.12:04:40.496495|NEIGH_TABLE:Vlan10:fe80::262:bff:feb2:1ac4|SET|neigh:00:62:0b:b2:1a:c4|family:IPv6
2024-10-23.12:04:40.497032|NEIGH_TABLE:Vlan10:fe80::262:bff:fe5e:51b6|SET|neigh:00:62:0b:5e:51:b6|family:IPv6
2024-10-23.12:04:40.498601|NEIGH_TABLE:Vlan10:fe80::262:bff:fe5f:70a8|SET|neigh:00:62:0b:5f:70:a8|family:IPv6
2024-10-23.12:04:40.499187|NEIGH_TABLE:Vlan10:fe80::262:bff:feb2:c1c2|SET|neigh:00:62:0b:b2:c1:c2|family:IPv6
2024-10-23.12:04:40.499773|NEIGH_TABLE:Vlan10:fe80::262:bff:fe5e:3098|SET|neigh:00:62:0b:5e:30:98|family:IPv6
2024-10-23.12:04:40.500356|NEIGH_TABLE:Vlan10:fe80::262:bff:fe5f:2a22|SET|neigh:00:62:0b:5f:2a:22|family:IPv6
2024-10-23.12:04:40.500939|NEIGH_TABLE:Vlan10:fe80::262:bff:fe5f:635e|SET|neigh:00:62:0b:5f:63:5e|family:IPv6
2024-10-23.12:04:40.501647|NEIGH_TABLE:Vlan10:fe80::262:bff:feb2:cc6|SET|neigh:00:62:0b:b2:0c:c6|family:IPv6
2024-10-23.12:04:40.502313|NEIGH_TABLE:Vlan10:fe80::262:bff:fe5f:70a8|SET|neigh:00:62:0b:5f:70:a8|family:IPv6
2024-10-23.12:04:40.502909|NEIGH_TABLE:Vlan10:fe80::262:bff:fe5e:3098|SET|neigh:00:62:0b:5e:30:98|family:IPv6
2024-10-23.12:04:40.503504|NEIGH_TABLE:Vlan10:fe80::262:bff:feb2:c1c2|SET|neigh:00:62:0b:b2:c1:c2|family:IPv6
2024-10-23.12:04:40.504097|NEIGH_TABLE:Vlan10:fe80::262:bff:fe5f:635e|SET|neigh:00:62:0b:5f:63:5e|family:IPv6
2024-10-23.12:04:40.504694|NEIGH_TABLE:Vlan10:fe80::262:bff:feb2:cc6|SET|neigh:00:62:0b:b2:0c:c6|family:IPv6
2024-10-23.12:04:40.509931|NEIGH_TABLE:Vlan10:fe80::262:bff:fe5f:2a22|SET|neigh:00:62:0b:5f:2a:22|family:IPv6
2024-10-23.12:04:40.751632|NEIGH_TABLE:Vlan10:fe80::262:bff:fe5f:d4e|SET|neigh:00:62:0b:5f:0d:4e|family:IPv6
2024-10-23.12:04:40.752853|NEIGH_TABLE:Vlan10:fe80::262:bff:fe5e:5e5e|SET|neigh:00:62:0b:5e:5e:5e|family:IPv6
2024-10-23.12:04:40.754553|NEIGH_TABLE:Vlan10:fe80::262:bff:fe5e:2b52|SET|neigh:00:62:0b:5e:2b:52|family:IPv6
2024-10-23.12:04:40.757162|NEIGH_TABLE:Vlan10:fe80::6efe:54ff:fe3b:7e70|SET|neigh:6c:fe:54:3b:7e:70|family:IPv6
2024-10-23.12:04:40.758967|NEIGH_TABLE:Vlan10:fe80::262:bff:fe5e:52fa|SET|neigh:00:62:0b:5e:52:fa|family:IPv6
2024-10-23.12:04:40.760159|NEIGH_TABLE:Vlan10:fe80::262:bff:fe5f:d4e|SET|neigh:00:62:0b:5f:0d:4e|family:IPv6
2024-10-23.12:04:40.761313|NEIGH_TABLE:Vlan10:fe80::262:bff:fe5e:52fa|SET|neigh:00:62:0b:5e:52:fa|family:IPv6
2024-10-23.12:04:40.764335|NEIGH_TABLE:Vlan10:fe80::262:bff:fe5e:5e5e|SET|neigh:00:62:0b:5e:5e:5e|family:IPv6
2024-10-23.12:04:40.765610|NEIGH_TABLE:Vlan10:fe80::6efe:54ff:fe3b:7e70|SET|neigh:6c:fe:54:3b:7e:70|family:IPv6
2024-10-23.12:04:43.822301|NEIGH_TABLE:Vlan10:fe80::262:bff:fe5e:2b52|DEL

Describe the results you expected:

No crash.

Output of show version:

SONiC Software Version: SONiC.202405.673948-aadab3251
SONiC OS Version: 12
Distribution: Debian 12.6
Kernel: 6.1.0-22-2-amd64
Build commit: aadab3251
Build date: Mon Oct 21 13:41:07 UTC 2024
Built by: azureuser@883a6a0bc000000

Platform: x86_64-dellemc_s5248f_c3538-r0
HwSKU: DellEMC-S5248f-P-25G
ASIC: broadcom
ASIC Count: 1
Serial Number: 61H6SR3
Model Number: 006Y6V
Hardware Revision: N/A
Uptime: 12:43:28 up  3:46,  2 users,  load average: 1.74, 1.90, 1.88
Date: Wed 23 Oct 2024 12:43:28

Docker images:
REPOSITORY                    TAG                       IMAGE ID       SIZE
docker-syncd-brcm             202405.673948-aadab3251   c1dffffd1ebd   742MB
docker-syncd-brcm             latest                    c1dffffd1ebd   742MB
docker-gbsyncd-broncos        202405.673948-aadab3251   f26cdd02d2dd   354MB
docker-gbsyncd-broncos        latest                    f26cdd02d2dd   354MB
docker-gbsyncd-credo          202405.673948-aadab3251   b8fd434ca826   327MB
docker-gbsyncd-credo          latest                    b8fd434ca826   327MB
docker-dhcp-relay             latest                    12ff7cdfe849   325MB
docker-platform-monitor       202405.673948-aadab3251   1c09318f61ba   442MB
docker-platform-monitor       latest                    1c09318f61ba   442MB
docker-macsec                 latest                    75e320e65e0d   347MB
docker-orchagent              202405.673948-aadab3251   7722d7ebf1da   357MB
docker-orchagent              latest                    7722d7ebf1da   357MB
docker-fpm-frr                202405.673948-aadab3251   ae2df63ff329   376MB
docker-fpm-frr                latest                    ae2df63ff329   376MB
docker-nat                    202405.673948-aadab3251   a3d51268f546   347MB
docker-nat                    latest                    a3d51268f546   347MB
docker-eventd                 202405.673948-aadab3251   b69ae3a96330   316MB
docker-eventd                 latest                    b69ae3a96330   316MB
docker-snmp                   202405.673948-aadab3251   d6817564e3c6   355MB
docker-snmp                   latest                    d6817564e3c6   355MB
docker-teamd                  202405.673948-aadab3251   8a0d8148af3d   344MB
docker-teamd                  latest                    8a0d8148af3d   344MB
docker-sflow                  202405.673948-aadab3251   b7da72b72ecd   345MB
docker-sflow                  latest                    b7da72b72ecd   345MB
docker-router-advertiser      202405.673948-aadab3251   015dcec8e6b2   316MB
docker-router-advertiser      latest                    015dcec8e6b2   316MB
docker-mux                    202405.673948-aadab3251   edeb39ecafc5   368MB
docker-mux                    latest                    edeb39ecafc5   368MB
docker-lldp                   202405.673948-aadab3251   30fecd2df8cf   361MB
docker-lldp                   latest                    30fecd2df8cf   361MB
docker-sonic-gnmi             202405.673948-aadab3251   e44ec23489b2   400MB
docker-sonic-gnmi             latest                    e44ec23489b2   400MB
docker-database               202405.673948-aadab3251   beaabf3ba8fa   324MB
docker-database               latest                    beaabf3ba8fa   324MB
docker-sonic-mgmt-framework   202405.673948-aadab3251   038f11f3719f   402MB
docker-sonic-mgmt-framework   latest                    038f11f3719f   402MB

Additional information you deem important (e.g. issue happens only occasionally):

Sometimes issue happens even when there is no interaction with the switch.

@rlebedys rlebedys changed the title Neighbor operations timeouts cause crashes on Dell S5248F-P-25G Neighbor operation timeouts cause crashes on Dell S5248F-P-25G Oct 23, 2024
@vdahiya12
Copy link
Contributor

@rlebedys please provide tech-support.

@vdahiya12 vdahiya12 added the DELL label Oct 23, 2024
@vdahiya12 vdahiya12 added the Triaged this issue has been triaged label Oct 23, 2024
@rlebedys
Copy link
Author

@vdahiya12 @jeff-yin can't upload the tech-support as it exceeds github size limits. Can I send it to you directly in some other way?

@rlebedys
Copy link
Author

Uploading a slightly smaller sonic dump. Managed to save some space by removing core dumps.

@vdahiya12 @jeff-yin this is a dump initiated when the switch crashed and logs mentioned in the main message were generated.

sonic_dump_gs1-leaf68_20241023_120646.tar.gz

@rlebedys
Copy link
Author

rlebedys commented Nov 5, 2024

@vdahiya12 @jeff-yin do you have any news on what might be causing this?

@jeff-yin
Copy link
Collaborator

jeff-yin commented Nov 5, 2024

Has this issue been isolated to the Dell HW platform and not the Broadcom ASIC/SAI implementation? The logs don't seem to point to anything related to anything specific to the S5248F platform.

Is this issue NOT seen on other TD3.X7 devices?

@rlebedys
Copy link
Author

rlebedys commented Nov 5, 2024

I can't confirm this as we don't have any other equipment running sonic.

@rlebedys
Copy link
Author

rlebedys commented Nov 8, 2024

@vdahiya12 @jeff-yin could you help forward the issue to somebody who could check this from Broadcom ASIC/SAI side?

@sergeimonakhov
Copy link

sergeimonakhov commented Nov 19, 2024

I am experiencing the same behavior on Accton-AS7326-56X(Broadcom ASIC)

05:04:09.400055 L-732656X2411199-0501 NOTICE syncd#syncd: :- threadFunction: time span 9476 ms for 'remove:SAI_OBJECT_TYPE_NEIGHBOR_ENTRY:{"ip":"169.254.0.1","rif":"oid:0x6000000000a33","switch_id":"oid:0x21000000000000"}'
2024 Nov 19 05:04:10.400195 L-732656X2411199-0501 NOTICE syncd#syncd: :- threadFunction: time span 10476 ms for 'remove:SAI_OBJECT_TYPE_NEIGHBOR_ENTRY:{"ip":"169.254.0.1","rif":"oid:0x6000000000a33","switch_id":"oid:0x21000000000000"}'
2024 Nov 19 05:04:11.400339 L-732656X2411199-0501 NOTICE syncd#syncd: :- threadFunction: time span 11476 ms for 'remove:SAI_OBJECT_TYPE_NEIGHBOR_ENTRY:{"ip":"169.254.0.1","rif":"oid:0x6000000000a33","switch_id":"oid:0x21000000000000"}'
2024 Nov 19 05:04:12.400457 L-732656X2411199-0501 NOTICE syncd#syncd: :- threadFunction: time span 12476 ms for 'remove:SAI_OBJECT_TYPE_NEIGHBOR_ENTRY:{"ip":"169.254.0.1","rif":"oid:0x6000000000a33","switch_id":"oid:0x21000000000000"}'
2024 Nov 19 05:04:13.400616 L-732656X2411199-0501 NOTICE syncd#syncd: :- threadFunction: time span 13477 ms for 'remove:SAI_OBJECT_TYPE_NEIGHBOR_ENTRY:{"ip":"169.254.0.1","rif":"oid:0x6000000000a33","switch_id":"oid:0x21000000000000"}'
2024 Nov 19 05:04:14.400725 L-732656X2411199-0501 NOTICE syncd#syncd: :- threadFunction: time span 14477 ms for 'remove:SAI_OBJECT_TYPE_NEIGHBOR_ENTRY:{"ip":"169.254.0.1","rif":"oid:0x6000000000a33","switch_id":"oid:0x21000000000000"}'
2024 Nov 19 05:04:15.400864 L-732656X2411199-0501 NOTICE syncd#syncd: :- threadFunction: time span 15477 ms for 'remove:SAI_OBJECT_TYPE_NEIGHBOR_ENTRY:{"ip":"169.254.0.1","rif":"oid:0x6000000000a33","switch_id":"oid:0x21000000000000"}'
2024 Nov 19 05:04:16.400997 L-732656X2411199-0501 NOTICE syncd#syncd: :- threadFunction: time span 16477 ms for 'remove:SAI_OBJECT_TYPE_NEIGHBOR_ENTRY:{"ip":"169.254.0.1","rif":"oid:0x6000000000a33","switch_id":"oid:0x21000000000000"}'
2024 Nov 19 05:04:17.401101 L-732656X2411199-0501 NOTICE syncd#syncd: :- threadFunction: time span 17477 ms for 'remove:SAI_OBJECT_TYPE_NEIGHBOR_ENTRY:{"ip":"169.254.0.1","rif":"oid:0x6000000000a33","switch_id":"oid:0x21000000000000"}'
2024 Nov 19 05:04:18.401227 L-732656X2411199-0501 NOTICE syncd#syncd: :- threadFunction: time span 18477 ms for 'remove:SAI_OBJECT_TYPE_NEIGHBOR_ENTRY:{"ip":"169.254.0.1","rif":"oid:0x6000000000a33","switch_id":"oid:0x21000000000000"}'
2024 Nov 19 05:04:19.401350 L-732656X2411199-0501 NOTICE syncd#syncd: :- threadFunction: time span 19477 ms for 'remove:SAI_OBJECT_TYPE_NEIGHBOR_ENTRY:{"ip":"169.254.0.1","rif":"oid:0x6000000000a33","switch_id":"oid:0x21000000000000"}'
2024 Nov 19 05:04:20.401524 L-732656X2411199-0501 NOTICE syncd#syncd: :- threadFunction: time span 20477 ms for 'remove:SAI_OBJECT_TYPE_NEIGHBOR_ENTRY:{"ip":"169.254.0.1","rif":"oid:0x6000000000a33","switch_id":"oid:0x21000000000000"}'
2024 Nov 19 05:04:21.401654 L-732656X2411199-0501 NOTICE syncd#syncd: :- threadFunction: time span 21478 ms for 'remove:SAI_OBJECT_TYPE_NEIGHBOR_ENTRY:{"ip":"169.254.0.1","rif":"oid:0x6000000000a33","switch_id":"oid:0x21000000000000"}'
2024 Nov 19 05:04:22.401790 L-732656X2411199-0501 NOTICE syncd#syncd: :- threadFunction: time span 22478 ms for 'remove:SAI_OBJECT_TYPE_NEIGHBOR_ENTRY:{"ip":"169.254.0.1","rif":"oid:0x6000000000a33","switch_id":"oid:0x21000000000000"}'
2024 Nov 19 05:04:23.401910 L-732656X2411199-0501 NOTICE syncd#syncd: :- threadFunction: time span 23478 ms for 'remove:SAI_OBJECT_TYPE_NEIGHBOR_ENTRY:{"ip":"169.254.0.1","rif":"oid:0x6000000000a33","switch_id":"oid:0x21000000000000"}'
2024 Nov 19 05:04:24.402072 L-732656X2411199-0501 NOTICE syncd#syncd: :- threadFunction: time span 24478 ms for 'remove:SAI_OBJECT_TYPE_NEIGHBOR_ENTRY:{"ip":"169.254.0.1","rif":"oid:0x6000000000a33","switch_id":"oid:0x21000000000000"}'
2024 Nov 19 05:04:25.402187 L-732656X2411199-0501 NOTICE syncd#syncd: :- threadFunction: time span 25478 ms for 'remove:SAI_OBJECT_TYPE_NEIGHBOR_ENTRY:{"ip":"169.254.0.1","rif":"oid:0x6000000000a33","switch_id":"oid:0x21000000000000"}'
2024 Nov 19 05:04:26.402299 L-732656X2411199-0501 NOTICE syncd#syncd: :- threadFunction: time span 26478 ms for 'remove:SAI_OBJECT_TYPE_NEIGHBOR_ENTRY:{"ip":"169.254.0.1","rif":"oid:0x6000000000a33","switch_id":"oid:0x21000000000000"}'
2024 Nov 19 05:04:27.402434 L-732656X2411199-0501 NOTICE syncd#syncd: :- threadFunction: time span 27478 ms for 'remove:SAI_OBJECT_TYPE_NEIGHBOR_ENTRY:{"ip":"169.254.0.1","rif":"oid:0x6000000000a33","switch_id":"oid:0x21000000000000"}'
2024 Nov 19 05:04:27.460972 L-732656X2411199-0501 INFO dhclient[6684]: DHCPDISCOVER on eth0 to 255.255.255.255 port 67 interval 7
2024 Nov 19 05:04:28.402578 L-732656X2411199-0501 NOTICE syncd#syncd: :- threadFunction: time span 28479 ms for 'remove:SAI_OBJECT_TYPE_NEIGHBOR_ENTRY:{"ip":"169.254.0.1","rif":"oid:0x6000000000a33","switch_id":"oid:0x21000000000000"}'
2024 Nov 19 05:04:29.402685 L-732656X2411199-0501 NOTICE syncd#syncd: :- threadFunction: time span 29479 ms for 'remove:SAI_OBJECT_TYPE_NEIGHBOR_ENTRY:{"ip":"169.254.0.1","rif":"oid:0x6000000000a33","switch_id":"oid:0x21000000000000"}'
2024 Nov 19 05:04:30.101007 L-732656X2411199-0501 ERR monit[956]: 'container_checker' status failed (4) -- Unexpected running containers: nginx
2024 Nov 19 05:04:30.402766 L-732656X2411199-0501 NOTICE syncd#syncd: :- threadFunction: time span 30479 ms for 'remove:SAI_OBJECT_TYPE_NEIGHBOR_ENTRY:{"ip":"169.254.0.1","rif":"oid:0x6000000000a33","switch_id":"oid:0x21000000000000"}'
2024 Nov 19 05:04:30.402766 L-732656X2411199-0501 ERR syncd#syncd: :- threadFunction: time span WD exceeded 30479 ms for remove:SAI_OBJECT_TYPE_NEIGHBOR_ENTRY:{"ip":"169.254.0.1","rif":"oid:0x6000000000a33","switch_id":"oid:0x21000000000000"}
2024 Nov 19 05:04:30.402766 L-732656X2411199-0501 ERR syncd#syncd: :- logEventData: op: remove, key: SAI_OBJECT_TYPE_NEIGHBOR_ENTRY:{"ip":"169.254.0.1","rif":"oid:0x6000000000a33","switch_id":"oid:0x21000000000000"}
2024 Nov 19 05:04:34.307099 L-732656X2411199-0501 INFO dhclient[6684]: DHCPDISCOVER on eth0 to 255.255.255.255 port 67 interval 18
2024 Nov 19 05:04:52.794793 L-732656X2411199-0501 INFO dhclient[6684]: No DHCPOFFERS received.
2024 Nov 19 05:04:52.794999 L-732656X2411199-0501 INFO dhclient[6684]: No working leases in persistent database - sleeping.
2024 Nov 19 05:04:53.114708 L-732656X2411199-0501 ALERT dhcp_relay#dhcpmon[78]: dhcpmon detected disparity in DHCP Relay behavior. Duration: 1908 (sec) for vlan: 'Agg-Vlan12'
2024 Nov 19 05:04:53.114858 L-732656X2411199-0501 NOTICE dhcp_relay#dhcpmon[78]: :- publish: EVENT_PUBLISHED: {"sonic-events-dhcp-relay:dhcp-relay-disparity":{"duration":"1908","timestamp":"2024-11-19T05:04:53.114326Z","vlan":"Agg-Vlan12"}}
2024 Nov 19 05:04:53.114858 L-732656X2411199-0501 NOTICE dhcp_relay#dhcpmon[78]: [      Agg-Vlan12-Snapshot rx/tx] Discover:        31/        0, Offer:         0/        0, Request:         0/        0, ACK:         0/        0
2024 Nov 19 05:04:53.114858 L-732656X2411199-0501 NOTICE dhcp_relay#dhcpmon[78]: [      Agg-Vlan12- Current rx/tx] Discover:        32/        0, Offer:         0/        0, Request:         0/        0, ACK:         0/        0
2024 Nov 19 05:04:54.358626 L-732656X2411199-0501 WARNING swss#supervisor-proc-exit-listener: Process 'orchagent' is stuck in namespace 'host' (1.0 minutes).
2024 Nov 19 05:04:59.983241 L-732656X2411199-0501 ERR swss#orchagent: :- wait: SELECT operation result: TIMEOUT on getresponse
2024 Nov 19 05:04:59.983297 L-732656X2411199-0501 ERR swss#orchagent: :- wait: failed to get response for getresponse
2024 Nov 19 05:04:59.983297 L-732656X2411199-0501 ERR swss#orchagent: :- remove: remove status: SAI_STATUS_FAILURE
2024 Nov 19 05:04:59.983297 L-732656X2411199-0501 ERR swss#orchagent: :- removeNeighbor: Failed to remove neighbor 3c:ec:ef:5c:97:6c on Vlan12, rv:-1
2024 Nov 19 05:04:59.983330 L-732656X2411199-0501 ERR swss#orchagent: :- handleSaiRemoveStatus: Encountered failure in remove operation, exiting orchagent, SAI API: SAI_API_NEIGHBOR, status: SAI_STATUS_FAILURE
2024 Nov 19 05:04:59.983375 L-732656X2411199-0501 NOTICE swss#orchagent: :- notifySyncd: sending syncd: SYNCD_INVOKE_DUMP
2024 Nov 19 05:05:30.135391 L-732656X2411199-0501 ERR monit[956]: 'container_checker' status failed (4) -- Unexpected running containers: nginx
2024 Nov 19 05:05:50.618419 L-732656X2411199-0501 INFO dhclient[6790]: XMT: Solicit on eth0, interval 124140ms.
2024 Nov 19 05:05:54.425293 L-732656X2411199-0501 WARNING swss#supervisor-proc-exit-listener: message repeated 59 times: [ Process 'orchagent' is stuck in namespace 'host' (1.0 minutes).]
2024 Nov 19 05:05:54.425293 L-732656X2411199-0501 WARNING swss#supervisor-proc-exit-listener: Process 'orchagent' is stuck in namespace 'host' (2.0 minutes).
2024 Nov 19 05:06:00.024013 L-732656X2411199-0501 ERR swss#orchagent: :- wait: SELECT operation result: TIMEOUT on notify
2024 Nov 19 05:06:00.024013 L-732656X2411199-0501 ERR swss#orchagent: :- wait: failed to get response for notify
2024 Nov 19 05:06:00.024013 L-732656X2411199-0501 ERR swss#orchagent: :- handleSaiFailure: Failed to take sai failure dump -1
2024 Nov 19 05:06:00.699142 L-732656X2411199-0501 INFO swss#supervisord 2024-11-19 05:06:00,698 WARN exited: orchagent (terminated by SIGABRT (core dumped); not expected)
2024 Nov 19 05:06:01.702710 L-732656X2411199-0501 WARNING swss#supervisor-proc-exit-listener: message repeated 7 times: [ Process 'orchagent' is stuck in namespace 'host' (2.0 minutes).]
2024 Nov 19 05:06:01.702710 L-732656X2411199-0501 INFO swss#supervisor-proc-exit-listener: Process 'orchagent' exited unexpectedly. Terminating supervisor 'swss'
2024 Nov 19 05:06:01.702823 L-732656X2411199-0501 NOTICE swss#supervisor-proc-exit-listener: :- publish: EVENT_PUBLISHED: {"sonic-events-host:process-exited-unexpectedly":{"ctr_name":"swss","process_name":"orchagent","timestamp":"2024-11-19T05:06:01.702669Z"}}
2024 Nov 19 05:06:01.703334 L-732656X2411199-0501 WARNING swss#supervisor-proc-exit-listener: Process 'orchagent' is stuck in namespace 'host' (2.0 minutes).
2024 Nov 19 05:06:01.703791 L-732656X2411199-0501 INFO swss#supervisord 2024-11-19 05:06:01,703 WARN received SIGTERM indicating exit request
2024 Nov 19 05:06:01.704007 L-732656X2411199-0501 INFO swss#supervisord 2024-11-19 05:06:01,703 INFO waiting for supervisor-proc-exit-listener, rsyslogd, portsyncd, coppmgrd, arp_update, ndppd, neighsyncd, vlanmgrd, intfmgrd, fabricmgrd, portmgrd, buffermgrd, vrfmgrd, nbrmgrd, vxlanmgrd, fdbsyncd, tunnelmgrd to die
2024 Nov 19 05:06:01.704699 L-732656X2411199-0501 INFO swss#supervisord 2024-11-19 05:06:01,704 WARN stopped: tunnelmgrd (terminated by SIGTERM)
2024 Nov 19 05:06:01.705554 L-732656X2411199-0501 INFO swss#supervisord 2024-11-19 05:06:01,705 WARN stopped: fdbsyncd (terminated by SIGTERM)
2024 Nov 19 05:06:01.706589 L-732656X2411199-0501 INFO swss#supervisord 2024-11-19 05:06:01,706 WARN stopped: vxlanmgrd (terminated by SIGTERM)
2024 Nov 19 05:06:01.707574 L-732656X2411199-0501 INFO swss#supervisord 2024-11-19 05:06:01,707 WARN stopped: nbrmgrd (terminated by SIGTERM)
2024 Nov 19 05:06:01.708505 L-732656X2411199-0501 INFO swss#supervisord 2024-11-19 05:06:01,708 WARN stopped: vrfmgrd (terminated by SIGTERM)
2024 Nov 19 05:06:02.711125 L-732656X2411199-0501 INFO swss#supervisord 2024-11-19 05:06:02,710 WARN stopped: buffermgrd (terminated by SIGTERM)
2024 Nov 19 05:06:02.712046 L-732656X2411199-0501 INFO swss#supervisord 2024-11-19 05:06:02,711 WARN stopped: portmgrd (terminated by SIGTERM)
2024 Nov 19 05:06:02.713033 L-732656X2411199-0501 INFO swss#supervisord 2024-11-19 05:06:02,712 WARN stopped: fabricmgrd (terminated by SIGTERM)
2024 Nov 19 05:06:02.714488 L-732656X2411199-0501 INFO swss#supervisord 2024-11-19 05:06:02,714 WARN stopped: intfmgrd (terminated by SIGTERM)
2024 Nov 19 05:06:02.715365 L-732656X2411199-0501 INFO swss#supervisord 2024-11-19 05:06:02,715 WARN stopped: vlanmgrd (terminated by SIGTERM)
2024 Nov 19 05:06:03.717721 L-732656X2411199-0501 INFO swss#supervisord 2024-11-19 05:06:03,717 WARN stopped: neighsyncd (terminated by SIGTERM)
2024 Nov 19 05:06:03.717748 L-732656X2411199-0501 INFO swss#supervisord: message repeated 5 times: [ orchagent ]
2024 Nov 19 05:06:03.717748 L-732656X2411199-0501 INFO swss#supervisord: ndppd (error) Shutting down...
2024 Nov 19 05:06:03.717769 L-732656X2411199-0501 INFO swss#supervisord: ndppd (notice) Bye

It is easy to reproduce:

sudo ip neigh del <ip> dev <device>

show version:

SONiC Software Version: SONiC.202405.0-5cdb00cc2
SONiC OS Version: 12
Distribution: Debian 12.7
Kernel: 6.1.0-22-2-amd64
Build commit: 5cdb00cc2
Build date: Wed Oct  2 18:20:32 UTC 2024
Built by: ubuntu@sonic-build

Platform: x86_64-accton_as7326_56x-r0
HwSKU: Accton-AS7326-56X
ASIC: broadcom
ASIC Count: 1
Serial Number:
Model Number:
Hardware Revision: N/A
Uptime: 18:08:27 up 4 days, 18:23,  1 user,  load average: 0.63, 0.59, 0.59
Date: Tue 19 Nov 2024 18:08:27

@sergeimonakhov
Copy link

The issue might be caused by this request

@rlebedys
Copy link
Author

@jeff-yin looks like this issue is not limited to dell platforms only. @Ndancejic can you check if your changes might have caused this issue?

@jeff-yin jeff-yin removed the DELL label Nov 19, 2024
@jeff-yin
Copy link
Collaborator

Thanks for following up, folks. Can someone reassign this issue to @Ndancejic or an appropriate user?

@rlebedys
Copy link
Author

Can confirm that the problem reproduces easily when deleting a neighbor from vlan with sudo ip neigh del <IP> dev <DEVICE>

This problem also reproduces on latest 202311 branch build

SONiC Software Version: SONiC.202311.480461-bacd21577
SONiC OS Version: 11
Distribution: Debian 11.8
Kernel: 5.10.0-23-2-amd64
Build commit: bacd21577
Build date: Sun Feb 18 12:27:37 UTC 2024
Built by: AzDevOps@vmss-soni0033YT

Platform: x86_64-dellemc_s5248f_c3538-r0
HwSKU: DellEMC-S5248f-P-25G
ASIC: broadcom
ASIC Count: 1
Serial Number: N/A
Model Number: N/A
Hardware Revision: N/A
Uptime: 08:59:43 up 15 min,  2 users,  load average: 3.14, 2.21, 1.55
Date: Thu 21 Nov 2024 08:59:43

Docker images:
REPOSITORY                    TAG                       IMAGE ID       SIZE
docker-gbsyncd-broncos        202311.480461-bacd21577   9c2fc3c859cf   351MB
docker-gbsyncd-broncos        latest                    9c2fc3c859cf   351MB
docker-gbsyncd-credo          202311.480461-bacd21577   35c00630607b   323MB
docker-gbsyncd-credo          latest                    35c00630607b   323MB
docker-syncd-brcm             202311.480461-bacd21577   4fd20b859b27   714MB
docker-syncd-brcm             latest                    4fd20b859b27   714MB
docker-dhcp-relay             latest                    c4f53024f7f3   310MB
docker-macsec                 latest                    0e2845b7a6e6   329MB
docker-eventd                 202311.480461-bacd21577   482dd63753fe   300MB
docker-eventd                 latest                    482dd63753fe   300MB
docker-orchagent              202311.480461-bacd21577   646dfac0e10f   339MB
docker-orchagent              latest                    646dfac0e10f   339MB
docker-fpm-frr                202311.480461-bacd21577   84b2257ec3d0   358MB
docker-fpm-frr                latest                    84b2257ec3d0   358MB
docker-nat                    202311.480461-bacd21577   5aff44d40fc3   330MB
docker-nat                    latest                    5aff44d40fc3   330MB
docker-sflow                  202311.480461-bacd21577   331b0d8cd627   328MB
docker-sflow                  latest                    331b0d8cd627   328MB
docker-teamd                  202311.480461-bacd21577   4a8067230d89   327MB
docker-teamd                  latest                    4a8067230d89   327MB
docker-snmp                   202311.480461-bacd21577   589578f68b0a   340MB
docker-snmp                   latest                    589578f68b0a   340MB
docker-mux                    202311.480461-bacd21577   2aeff3915188   349MB
docker-mux                    latest                    2aeff3915188   349MB
docker-platform-monitor       202311.480461-bacd21577   7aca1b675027   421MB
docker-platform-monitor       latest                    7aca1b675027   421MB
docker-router-advertiser      202311.480461-bacd21577   6cb67ccd1715   301MB
docker-router-advertiser      latest                    6cb67ccd1715   301MB
docker-lldp                   202311.480461-bacd21577   e0f04ba09890   343MB
docker-lldp                   latest                    e0f04ba09890   343MB
docker-database               202311.480461-bacd21577   cde6051445df   301MB
docker-database               latest                    cde6051445df   301MB
docker-sonic-gnmi             202311.480461-bacd21577   0d52a52ae6d6   388MB
docker-sonic-gnmi             latest                    0d52a52ae6d6   388MB
docker-sonic-mgmt-framework   202311.480461-bacd21577   5505bce40dff   416MB
docker-sonic-mgmt-framework   latest                    5505bce40dff   416MB

@rlebedys
Copy link
Author

Tested it with a 202305 branch and can confirm that problem does not reproduce on the latest build.

So the problem reproduces only on 202311 and 202405 branches.

Version:

SONiC Software Version: SONiC.202305.700297-254035eb8
SONiC OS Version: 11
Distribution: Debian 11.8
Kernel: 5.10.0-23-2-amd64
Build commit: 254035eb8
Build date: Wed Nov 20 14:00:21 UTC 2024
Built by: azureuser@c095c4ddc00000B

Platform: x86_64-dellemc_s5248f_c3538-r0
HwSKU: DellEMC-S5248f-P-25G
ASIC: broadcom
ASIC Count: 1
Serial Number: 61W5SR3
Model Number: 006Y6V
Hardware Revision: N/A
Uptime: 10:19:38 up 7 min,  2 users,  load average: 3.50, 3.27, 1.69
Date: Thu 21 Nov 2024 10:19:38

Docker images:
REPOSITORY                    TAG                       IMAGE ID       SIZE
docker-gbsyncd-broncos        202305.700297-254035eb8   412eb6253a41   350MB
docker-gbsyncd-broncos        latest                    412eb6253a41   350MB
docker-gbsyncd-credo          202305.700297-254035eb8   508db24c309c   323MB
docker-gbsyncd-credo          latest                    508db24c309c   323MB
docker-syncd-brcm             202305.700297-254035eb8   c85126fdc39c   674MB
docker-syncd-brcm             latest                    c85126fdc39c   674MB
docker-teamd                  202305.700297-254035eb8   51a88340d07e   318MB
docker-teamd                  latest                    51a88340d07e   318MB
docker-sflow                  202305.700297-254035eb8   5cdd13b381aa   319MB
docker-sflow                  latest                    5cdd13b381aa   319MB
docker-orchagent              202305.700297-254035eb8   741ccff186a0   330MB
docker-orchagent              latest                    741ccff186a0   330MB
docker-fpm-frr                202305.700297-254035eb8   38c63ed499ea   349MB
docker-fpm-frr                latest                    38c63ed499ea   349MB
docker-nat                    202305.700297-254035eb8   b7b3dea5b42f   321MB
docker-nat                    latest                    b7b3dea5b42f   321MB
docker-macsec                 latest                    352f43b8b872   320MB
docker-eventd                 202305.700297-254035eb8   e27534b8ad78   300MB
docker-eventd                 latest                    e27534b8ad78   300MB
docker-dhcp-relay             latest                    f64e4107453e   308MB
docker-snmp                   202305.700297-254035eb8   c3af5a21ee37   339MB
docker-snmp                   latest                    c3af5a21ee37   339MB
docker-router-advertiser      202305.700297-254035eb8   4013c03e641a   300MB
docker-router-advertiser      latest                    4013c03e641a   300MB
docker-platform-monitor       202305.700297-254035eb8   e4db0c7a0f97   422MB
docker-platform-monitor       latest                    e4db0c7a0f97   422MB
docker-mux                    202305.700297-254035eb8   86fcce1efaf7   349MB
docker-mux                    latest                    86fcce1efaf7   349MB
docker-lldp                   202305.700297-254035eb8   cfb7f1e35a1e   343MB
docker-lldp                   latest                    cfb7f1e35a1e   343MB
docker-database               202305.700297-254035eb8   f48ff69643ba   300MB
docker-database               latest                    f48ff69643ba   300MB
docker-sonic-telemetry        202305.700297-254035eb8   f5e68afee702   387MB
docker-sonic-telemetry        latest                    f5e68afee702   387MB
docker-sonic-mgmt-framework   202305.700297-254035eb8   38cf57635056   414MB
docker-sonic-mgmt-framework   latest                    38cf57635056   414MB

@rlebedys
Copy link
Author

@Ndancejic did you have a chance to check this out? Do you need any more information?

@NerijusRazvodovskis
Copy link

hey, perhaps any updates on this issue?

@Ndancejic
Copy link
Contributor

Hi all, sorry for the delay. I'll take a look this week.

@Ndancejic
Copy link
Contributor

This doesn't seem to be related to sonic-net/sonic-swss#3148. This only changes dualtor switchover functionality. regular neighbor operations should be unchanged.

Looks like there was a delay in removing neighbor in syncd:

2024 Oct 23 12:05:14.176820 gs1-leaf68 NOTICE syncd#syncd: :- threadFunction: time span 30351 ms for 'remove:SAI_OBJECT_TYPE_NEIGHBOR_ENTRY:{"ip":"fe80::262:bff:fe5e:2b52","rif":"oid:0x6000000000a50","switch_id":"oid:0x21000000000000"}'
2024 Oct 23 12:05:14.176890 gs1-leaf68 ERR syncd#syncd: :- threadFunction: time span WD exceeded 30351 ms for remove:SAI_OBJECT_TYPE_NEIGHBOR_ENTRY:{"ip":"fe80::262:bff:fe5e:2b52","rif":"oid:0x6000000000a50","switch_id":"oid:0x21000000000000"}

which caused orchagent to crash. the sairedis record shows the api call:

2024-10-23.12:04:43.823794|r|SAI_OBJECT_TYPE_NEIGHBOR_ENTRY:{"ip":"fe80::262:bff:fe5e:2b52","rif":"oid:0x6000000000a50","switch_id":"oid:0x21000000000000"}

My biggest lead right now is that it seems like the nexthop that was removed right before the neighbor remove is for a different neighbor. However I would expect a different error message (something like object still referenced) if this were the case...

2024-10-23.09:12:11.549686|c|SAI_OBJECT_TYPE_NEIGHBOR_ENTRY:{"ip":"fe80::262:bff:fe5f:2a22","rif":"oid:0x6000000000a50","switch_id":"oid:0x21000000000000"}|SAI_NEIGHBOR_ENTRY_ATTR_DST_MAC_ADDRESS=00:62:0B:5F:2A:22
2024-10-23.09:12:11.551108|c|SAI_OBJECT_TYPE_NEXT_HOP:oid:0x4000000000a8f|SAI_NEXT_HOP_ATTR_TYPE=SAI_NEXT_HOP_TYPE_IP|SAI_NEXT_HOP_ATTR_IP=fe80::262:bff:fe5f:2a22|SAI_NEXT_HOP_ATTR_ROUTER_INTERFACE_ID=oid:0x6000000000a50

@audmas
Copy link

audmas commented Dec 17, 2024

hey, are you planning to investigate it further?

@Ndancejic
Copy link
Contributor

I'll continue to investigate, these are just my initial findings

@bradh352
Copy link
Contributor

bradh352 commented Dec 23, 2024

Looks like as @tomvil pointed out to me I also experience this issue on Trident3-X3 (Dell N3248TE). See #21247

I'm available to test anything that might need testing as I can readily reproduce in a lab environment. I also have private 202411 and master forks I can use to build and test.

@bradh352
Copy link
Contributor

My biggest lead right now is that it seems like the nexthop that was removed right before the neighbor remove is for a different neighbor. However I would expect a different error message (something like object still referenced) if this were the case...

2024-10-23.09:12:11.549686|c|SAI_OBJECT_TYPE_NEIGHBOR_ENTRY:{"ip":"fe80::262:bff:fe5f:2a22","rif":"oid:0x6000000000a50","switch_id":"oid:0x21000000000000"}|SAI_NEIGHBOR_ENTRY_ATTR_DST_MAC_ADDRESS=00:62:0B:5F:2A:22
2024-10-23.09:12:11.551108|c|SAI_OBJECT_TYPE_NEXT_HOP:oid:0x4000000000a8f|SAI_NEXT_HOP_ATTR_TYPE=SAI_NEXT_HOP_TYPE_IP|SAI_NEXT_HOP_ATTR_IP=fe80::262:bff:fe5f:2a22|SAI_NEXT_HOP_ATTR_ROUTER_INTERFACE_ID=oid:0x6000000000a50

@Ndancejic in the log provided in this thread, it shows the nexthop and neighbor ips are the same that are getting removed, so I don't think your lead is right.

Do you not think this is a vendor (broadcom) SAI bug? Though I haven't tried it, I'd assume if I captured an SAI replay log and replayed it, it would hang, and if so, wouldn't that mean only broadcom could fix it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Triaged this issue has been triaged
Projects
None yet
Development

No branches or pull requests

8 participants