You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Description
swss restarts during service initialization due to sai_redis_internal_notify_syncd timeout.
During the initialization, orchagent calls initSaiRedis to set up SAI interfaces. One step of it is to wait for some table via calling select and returns once SAI writes something to the table. If no data received within 1 minute the select returns timeout and orchagent fails, causing the whole swss docker restarted. The log is like the following:
The log is like the following:
Aug 31 07:54:25.100769 mtbc-sonic-03-2700 NOTICE swss#orchagent: :- initSaiRedis: Enable redis pipeline
Aug 31 07:54:25.100853 mtbc-sonic-03-2700 NOTICE swss#orchagent: :- sai_redis_notify_syncd: sending syncd INIT view
Aug 31 07:54:25.101245 mtbc-sonic-03-2700 NOTICE swss#orchagent: :- sai_redis_internal_notify_syncd: wait for notify response
... ...
Aug 31 07:55:25.162280 mtbc-sonic-03-2700 ERR swss#orchagent: :- sai_redis_internal_notify_syncd: notify syncd failed to get response result from select: 2
Aug 31 07:55:25.162280 mtbc-sonic-03-2700 ERR swss#orchagent: :- sai_redis_internal_notify_syncd: notify syncd failed to get response
Aug 31 07:55:25.162280 mtbc-sonic-03-2700 ERR swss#orchagent: :- sai_redis_notify_syncd: notify syncd failed: SAI_STATUS_FAILURE
Aug 31 07:55:25.162280 mtbc-sonic-03-2700 ERR swss#orchagent: :- initSaiRedis: Failed to notify syncd INIT_VIEW, rv:-1
Aug 31 07:55:25.162912 mtbc-sonic-03-2700 INFO swss#supervisord: orchagent terminate called without an active exception
Steps to reproduce the issue:
It can be reproduced when "config reload" or "systemctl restart swss.service" is executed by at a very low probability, less then 5%.
Describe the results you received:
Describe the results you expected:
orchagent should start normally.
Additional information you deem important (e.g. issue happens only occasionally):
It seems that a lot of services initialize simultaneously without an explicitly defined order when system is starting. Sometimes bgp starts between orchagent called select in sai_redis_notify_syncd and timeout and a lot of configuration is deployed via bgpcfd like the following log. in this case this issue is more likely to reproduce.
This issue occurs on any of the recent versions.
```
SONiC Software Version: SONiC.HEAD.71-47504d13
Distribution: Debian 9.9
Kernel: 4.9.0-9-2-amd64
Build commit: 47504d1
Build date: Fri Sep 6 08:56:45 UTC 2019
Built by: johnar@jenkins-worker-4
Description
swss restarts during service initialization due to sai_redis_internal_notify_syncd timeout.
During the initialization, orchagent calls initSaiRedis to set up SAI interfaces. One step of it is to wait for some table via calling select and returns once SAI writes something to the table. If no data received within 1 minute the select returns timeout and orchagent fails, causing the whole swss docker restarted. The log is like the following:
The log is like the following:
Steps to reproduce the issue:
It can be reproduced when "config reload" or "systemctl restart swss.service" is executed by at a very low probability, less then 5%.
Describe the results you received:
Describe the results you expected:
orchagent should start normally.
Additional information you deem important (e.g. issue happens only occasionally):
It seems that a lot of services initialize simultaneously without an explicitly defined order when system is starting. Sometimes bgp starts between orchagent called select in sai_redis_notify_syncd and timeout and a lot of configuration is deployed via bgpcfd like the following log. in this case this issue is more likely to reproduce.
This issue occurs on any of the recent versions.
```
SONiC Software Version: SONiC.HEAD.71-47504d13
Distribution: Debian 9.9
Kernel: 4.9.0-9-2-amd64
Build commit: 47504d1
Build date: Fri Sep 6 08:56:45 UTC 2019
Built by: johnar@jenkins-worker-4
Platform: x86_64-mlnx_msn3700-r0
HwSKU: ACS-MSN3700
ASIC: mellanox
Serial Number: MT1851X02961
Uptime: 04:52:44 up 5 min, 1 user, load average: 3.31, 2.54, 1.18
Docker images:
REPOSITORY TAG IMAGE ID SIZE
docker-syncd-mlnx HEAD.71-47504d13 d33d1c98b196 369MB
docker-syncd-mlnx latest d33d1c98b196 369MB
docker-fpm-frr HEAD.71-47504d13 504e32de8520 319MB
docker-fpm-frr latest 504e32de8520 319MB
docker-lldp-sv2 HEAD.71-47504d13 a60d4a514f57 298MB
docker-lldp-sv2 latest a60d4a514f57 298MB
docker-dhcp-relay HEAD.71-47504d13 7e817834a169 289MB
docker-dhcp-relay latest 7e817834a169 289MB
docker-database HEAD.71-47504d13 0e08472e952f 281MB
docker-database latest 0e08472e952f 281MB
docker-snmp-sv2 HEAD.71-47504d13 881dacc459d5 323MB
docker-snmp-sv2 latest 881dacc459d5 323MB
docker-orchagent HEAD.71-47504d13 eda196786f03 321MB
docker-orchagent latest eda196786f03 321MB
docker-teamd HEAD.71-47504d13 f76a423ee993 302MB
docker-teamd latest f76a423ee993 302MB
docker-sonic-telemetry HEAD.71-47504d13 dccf8e795dd0 304MB
docker-sonic-telemetry latest dccf8e795dd0 304MB
docker-router-advertiser HEAD.71-47504d13 26a418f74696 281MB
docker-router-advertiser latest 26a418f74696 281MB
docker-platform-monitor HEAD.71-47504d13 7f4d55f192bf 560MB
docker-platform-monitor latest 7f4d55f192bf 560MB
The text was updated successfully, but these errors were encountered: