swss restarts during service initialization due to sai_redis_internal_notify_syncd timeout #3444

stephenxs · 2019-09-12T05:30:57Z

Description
swss restarts during service initialization due to sai_redis_internal_notify_syncd timeout.
During the initialization, orchagent calls initSaiRedis to set up SAI interfaces. One step of it is to wait for some table via calling select and returns once SAI writes something to the table. If no data received within 1 minute the select returns timeout and orchagent fails, causing the whole swss docker restarted. The log is like the following:
The log is like the following:

Aug 31 07:54:25.100769 mtbc-sonic-03-2700 NOTICE swss#orchagent: :- initSaiRedis: Enable redis pipeline
Aug 31 07:54:25.100853 mtbc-sonic-03-2700 NOTICE swss#orchagent: :- sai_redis_notify_syncd: sending syncd INIT view
Aug 31 07:54:25.101245 mtbc-sonic-03-2700 NOTICE swss#orchagent: :- sai_redis_internal_notify_syncd: wait for notify response
... ...
Aug 31 07:55:25.162280 mtbc-sonic-03-2700 ERR swss#orchagent: :- sai_redis_internal_notify_syncd: notify syncd failed to get response result from select: 2
Aug 31 07:55:25.162280 mtbc-sonic-03-2700 ERR swss#orchagent: :- sai_redis_internal_notify_syncd: notify syncd failed to get response
Aug 31 07:55:25.162280 mtbc-sonic-03-2700 ERR swss#orchagent: :- sai_redis_notify_syncd: notify syncd failed: SAI_STATUS_FAILURE
Aug 31 07:55:25.162280 mtbc-sonic-03-2700 ERR swss#orchagent: :- initSaiRedis: Failed to notify syncd INIT_VIEW, rv:-1
Aug 31 07:55:25.162912 mtbc-sonic-03-2700 INFO swss#supervisord: orchagent terminate called without an active exception

Steps to reproduce the issue:
It can be reproduced when "config reload" or "systemctl restart swss.service" is executed by at a very low probability, less then 5%.

Describe the results you received:

Describe the results you expected:
orchagent should start normally.

Additional information you deem important (e.g. issue happens only occasionally):
It seems that a lot of services initialize simultaneously without an explicitly defined order when system is starting. Sometimes bgp starts between orchagent called select in sai_redis_notify_syncd and timeout and a lot of configuration is deployed via bgpcfd like the following log. in this case this issue is more likely to reproduce.

Sep  4 05:46:02.559053 mtbc-sonic-03-2700 DEBUG bgp#bgpcfgd: Receive message : ('fc00::72', 'SET', (('rrclient', '0'), ('name', 'ARISTA13T0'), ('local_addr', 'fc00::71'), ('nhopself', '0'), ('holdtime', '10'), ('asn', '64013'), ('keepalive', '3')))
Sep  4 05:46:02.559053 mtbc-sonic-03-2700 INFO bgp#bgpcfgd: value for fc00::72 changed to {'rrclient': '0', 'name': 'ARISTA13T0', 'local_addr': 'fc00::71', 'nhopself': '0', 'holdtime': '10', 'asn': '64013', 'keepalive': '3'}
Sep  4 05:46:02.559053 mtbc-sonic-03-2700 DEBUG bgp#bgpcfgd: execute command vtysh -c 'configure terminal' -c 'router bgp 65100' -c 'neighbor fc00::72 remote-as 64013'.
Sep  4 05:46:02.804103 mtbc-sonic-03-2700 DEBUG bgp#bgpcfgd: execute command vtysh -c 'configure terminal' -c 'router bgp 65100' -c 'neighbor fc00::72 description ARISTA13T0'.
Sep  4 05:46:03.046172 mtbc-sonic-03-2700 DEBUG bgp#bgpcfgd: Receive message : ('fc00::e', 'SET', (('rrclient', '0'), ('name', 'ARISTA04T2'), ('local_addr', 'fc00::d'), ('nhopself', '0'), ('holdtime', '10'), ('asn', '65200'), ('keepalive', '3')))
Sep  4 05:46:03.046172 mtbc-sonic-03-2700 INFO bgp#bgpcfgd: value for fc00::e changed to {'rrclient': '0', 'name': 'ARISTA04T2', 'local_addr': 'fc00::d', 'nhopself': '0', 'holdtime': '10', 'asn': '65200', 'keepalive': '3'}
Sep  4 05:46:03.046172 mtbc-sonic-03-2700 DEBUG bgp#bgpcfgd: execute command vtysh -c 'configure terminal' -c 'router bgp 65100' -c 'neighbor fc00::e remote-as 65200'.
Sep  4 05:46:03.290509 mtbc-sonic-03-2700 DEBUG bgp#bgpcfgd: execute command vtysh -c 'configure terminal' -c 'router bgp 65100' -c 'neighbor fc00::e description ARISTA04T2'.
Sep  4 05:46:03.375306 mtbc-sonic-03-2700 INFO dhcp_relay#supervisord: start.sh rsyslogd: started
Sep  4 05:46:03.632767 mtbc-sonic-03-2700 INFO systemd[1]: Started Platform monitor container.
Sep  4 05:46:03.633545 mtbc-sonic-03-2700 NOTICE root: Started pmon service
Sep  4 05:46:03.696286 mtbc-sonic-03-2700 DEBUG bgp#bgpcfgd: Receive message : ('10.0.0.15', 'SET', (('rrclient', '0'), ('name', 'ARISTA08T2'), ('local_addr', '10.0.0.14'), ('nhopself', '0'), ('holdtime', '10'), ('asn', '65200'), ('keepalive', '3')))
Sep  4 05:46:03.696286 mtbc-sonic-03-2700 INFO bgp#bgpcfgd: value for 10.0.0.15 changed to {'rrclient': '0', 'name': 'ARISTA08T2', 'local_addr': '10.0.0.14', 'nhopself': '0', 'holdtime': '10', 'asn': '65200', 'keepalive': '3'}
Sep  4 05:46:03.696286 mtbc-sonic-03-2700 DEBUG bgp#bgpcfgd: execute command vtysh -c 'configure terminal' -c 'router bgp 65100' -c 'neighbor 10.0.0.15 remote-as 65200'.
Sep  4 05:46:03.726310 mtbc-sonic-03-2700 INFO systemd[1]: Starting LLDP container...
Sep  4 05:46:04.107318 mtbc-sonic-03-2700 DEBUG bgp#bgpcfgd: execute command vtysh -c 'configure terminal' -c 'router bgp 65100' -c 'neighbor 10.0.0.15 description ARISTA08T2'.
Sep  4 05:46:04.508871 mtbc-sonic-03-2700 DEBUG bgp#bgpcfgd: Receive message : ('10.0.0.31', 'SET', (('rrclient', '0'), ('name', 'ARISTA16T2'), ('local_addr', '10.0.0.30'), ('nhopself', '0'), ('holdtime', '10'), ('asn', '65200'), ('keepalive', '3')))
Sep  4 05:46:04.508871 mtbc-sonic-03-2700 INFO bgp#bgpcfgd: value for 10.0.0.31 changed to {'rrclient': '0', 'name': 'ARISTA16T2', 'local_addr': '10.0.0.30', 'nhopself': '0', 'holdtime': '10', 'asn': '65200', 'keepalive': '3'}
Sep  4 05:46:04.508871 mtbc-sonic-03-2700 DEBUG bgp#bgpcfgd: execute command vtysh -c 'configure terminal' -c 'router bgp 65100' -c 'neighbor 10.0.0.31 remote-as 65200'.
Sep  4 05:46:04.737037 mtbc-sonic-03-2700 DEBUG bgp#bgpcfgd: execute command vtysh -c 'configure terminal' -c 'router bgp 65100' -c 'neighbor 10.0.0.31 description ARISTA16T2'.
Sep  4 05:46:05.023041 mtbc-sonic-03-2700 DEBUG bgp#bgpcfgd: Receive message : ('10.0.0.13', 'SET', (('rrclient', '0'), ('name', 'ARISTA07T2'), ('local_addr', '10.0.0.12'), ('nhopself', '0'), ('holdtime', '10'), ('asn', '65200'), ('keepalive', '3')))
Sep  4 05:46:05.023041 mtbc-sonic-03-2700 INFO bgp#bgpcfgd: value for 10.0.0.13 changed to {'rrclient': '0', 'name': 'ARISTA07T2', 'local_addr': '10.0.0.12', 'nhopself': '0', 'holdtime': '10', 'asn': '65200', 'keepalive': '3'}
Sep  4 05:46:05.023041 mtbc-sonic-03-2700 DEBUG bgp#bgpcfgd: execute command vtysh -c 'configure terminal' -c 'router bgp 65100' -c 'neighbor 10.0.0.13 remote-as 65200'.
Sep  4 05:46:05.228645 mtbc-sonic-03-2700 DEBUG bgp#bgpcfgd: execute command vtysh -c 'configure terminal' -c 'router bgp 65100' -c 'neighbor 10.0.0.13 description ARISTA07T2'.
Sep  4 05:46:05.418170 mtbc-sonic-03-2700 INFO syncd.sh[25691]: Starting MST (Mellanox Software Tools) driver set
Sep  4 05:46:05.475872 mtbc-sonic-03-2700 INFO syncd.sh[25691]: Loading MST PCI module - Success
Sep  4 05:46:05.489038 mtbc-sonic-03-2700 DEBUG bgp#bgpcfgd: Receive message : ('fc00::76', 'SET', (('rrclient', '0'), ('name', 'ARISTA14T0'), ('local_addr', 'fc00::75'), ('nhopself', '0'), ('holdtime', '10'), ('asn', '64014'), ('keepalive', '3')))
Sep  4 05:46:05.489038 mtbc-sonic-03-2700 INFO bgp#bgpcfgd: value for fc00::76 changed to {'rrclient': '0', 'name': 'ARISTA14T0', 'local_addr': 'fc00::75', 'nhopself': '0', 'holdtime': '10', 'asn': '64014', 'keepalive': '3'}
Sep  4 05:46:05.489038 mtbc-sonic-03-2700 DEBUG bgp#bgpcfgd: execute command vtysh -c 'configure terminal' -c 'router bgp 65100' -c 'neighbor fc00::76 remote-as 64014'.
Sep  4 05:46:05.515094 mtbc-sonic-03-2700 INFO syncd.sh[25691]: Loading MST PCI configuration module - Success
Sep  4 05:46:05.523446 mtbc-sonic-03-2700 INFO syncd.sh[25691]: Create devices
Sep  4 05:46:05.762107 mtbc-sonic-03-2700 DEBUG bgp#bgpcfgd: execute command vtysh -c 'configure terminal' -c 'router bgp 65100' -c 'neighbor fc00::76 description ARISTA14T0'.
Sep  4 05:46:05.999826 mtbc-sonic-03-2700 DEBUG bgp#bgpcfgd: Receive message : ('fc00::66', 'SET', (('rrclient', '0'), ('name', 'ARISTA10T0'), ('local_addr', 'fc00::65'), ('nhopself', '0'), ('holdtime', '10'), ('asn', '64010'), ('keep

**Output of `show version`:**

This issue occurs on any of the recent versions.
```
SONiC Software Version: SONiC.HEAD.71-47504d13
Distribution: Debian 9.9
Kernel: 4.9.0-9-2-amd64
Build commit: 47504d1
Build date: Fri Sep 6 08:56:45 UTC 2019
Built by: johnar@jenkins-worker-4

Platform: x86_64-mlnx_msn3700-r0
HwSKU: ACS-MSN3700
ASIC: mellanox
Serial Number: MT1851X02961
Uptime: 04:52:44 up 5 min, 1 user, load average: 3.31, 2.54, 1.18

Docker images:
REPOSITORY TAG IMAGE ID SIZE
docker-syncd-mlnx HEAD.71-47504d13 d33d1c98b196 369MB
docker-syncd-mlnx latest d33d1c98b196 369MB
docker-fpm-frr HEAD.71-47504d13 504e32de8520 319MB
docker-fpm-frr latest 504e32de8520 319MB
docker-lldp-sv2 HEAD.71-47504d13 a60d4a514f57 298MB
docker-lldp-sv2 latest a60d4a514f57 298MB
docker-dhcp-relay HEAD.71-47504d13 7e817834a169 289MB
docker-dhcp-relay latest 7e817834a169 289MB
docker-database HEAD.71-47504d13 0e08472e952f 281MB
docker-database latest 0e08472e952f 281MB
docker-snmp-sv2 HEAD.71-47504d13 881dacc459d5 323MB
docker-snmp-sv2 latest 881dacc459d5 323MB
docker-orchagent HEAD.71-47504d13 eda196786f03 321MB
docker-orchagent latest eda196786f03 321MB
docker-teamd HEAD.71-47504d13 f76a423ee993 302MB
docker-teamd latest f76a423ee993 302MB
docker-sonic-telemetry HEAD.71-47504d13 dccf8e795dd0 304MB
docker-sonic-telemetry latest dccf8e795dd0 304MB
docker-router-advertiser HEAD.71-47504d13 26a418f74696 281MB
docker-router-advertiser latest 26a418f74696 281MB
docker-platform-monitor HEAD.71-47504d13 7f4d55f192bf 560MB
docker-platform-monitor latest 7f4d55f192bf 560MB

```

**Attach debug file `sudo generate_dump`:**

```
(paste your output here)
```

The text was updated successfully, but these errors were encountered:

jangidrk · 2019-09-13T11:56:29Z

timeout is reduced to 1 minute from 6 as per below PR. you can try to increase and see if that helps.

sonic-net/sonic-sairedis#472

yxieca assigned lguohan and biblix and unassigned biblix Sep 20, 2019

stephenxs closed this as completed Nov 23, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

swss restarts during service initialization due to sai_redis_internal_notify_syncd timeout #3444

swss restarts during service initialization due to sai_redis_internal_notify_syncd timeout #3444

stephenxs commented Sep 12, 2019

jangidrk commented Sep 13, 2019

swss restarts during service initialization due to sai_redis_internal_notify_syncd timeout #3444

swss restarts during service initialization due to sai_redis_internal_notify_syncd timeout #3444

Comments

stephenxs commented Sep 12, 2019

jangidrk commented Sep 13, 2019