-
Notifications
You must be signed in to change notification settings - Fork 270
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce the timeout (GET_RESPONSE_TIMEOUT) from 6 minutes to 1 minute. #472
Conversation
@@ -43,7 +43,7 @@ void check_notifications_pointers( | |||
|
|||
// if we don't receive response from syncd in 60 seconds | |||
// there is something wrong and we should fail | |||
#define GET_RESPONSE_TIMEOUT (6*60*1000) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just fyi i remember that this was changed on purpose but i cant remember why ;/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. Let us face it again. If we face, I would prefer to fix the other end.
Removed from 201811 branch until further validation is done. |
What I did: Increase SAI Sync operation timeout from 1 min to 6 min by reverting #472 Why I did: Issue as mention sonic-net/sonic-buildimage#3832 is seen when there is link/port-channel flaps towards T2 which results in sequence of this operation on T1 topology:- Link Going down - Next Hop Member removal from Nexthop Group - Creation of New Nexthop Group - All the Routes towards T2 are given SET Operation with new Nexthop Group (6000+ routes) Link Going up - Creation of New Nexthop Group - All the Routes towards T2 are given SET Operation with new Nexthop Group (6000+ routes) Above Sequence with flaps create many SAI operation of CREATE/SET and if SAI is slow to process them if there in any intermediate Get operation (common GET case being CRM polling at default 5 mins) will result timeout if there is no response in 1 min which will cause next Get operation to fail because of attribute mismatch as it make Request and Reponse are out of sync. To fix this:- - Have increase the timeout to 6 min. Even 201811 image/branch is having same timeout value. - Hard to predict what is good timeout value but based on below experiment and since 201811 is having 6 min using that value. How I verify: - Ran the below script for 800 iteration and orchagent did not got timeout error and no crash. Without fix OA used to crash consistently with most of the time less than 100 iteration of flaps. - Based on sairedis.rec of the first CRM `GET` operation between `Create/Set` operation in these 800 iteration worst time for GET response was see 4+ min. ``` #!/bin/bash j=0 while [ True ]; do echo "iteration $j" config interface shutdown PortChannel0002 sleep 2 config interface startup PortChannel0002 sleep 2 j=$((j+1)) done ``` ``` admin@str-xxxx-06:/var/log/swss$ sudo zgrep -A 1 "|g|.*AVAILABLE_IPV4_ROUTE" sairedis.rec* sairedis.rec:2021-05-04.17:40:13.431643|g|SAI_OBJECT_TYPE_SWITCH:oid:0x21000000000000|SAI_SWITCH_ATTR_AVAILABLE_IPV4_ROUTE_ENTRY=1538995696 sairedis.rec:2021-05-04.17:46:18.132115|G|SAI_STATUS_SUCCESS|SAI_SWITCH_ATTR_AVAILABLE_IPV4_ROUTE_ENTRY=52530 sairedis.rec:-- sairedis.rec:2021-05-04.17:46:21.128261|g|SAI_OBJECT_TYPE_SWITCH:oid:0x21000000000000|SAI_SWITCH_ATTR_AVAILABLE_IPV4_ROUTE_ENTRY=1538995696 sairedis.rec:2021-05-04.17:46:31.258628|G|SAI_STATUS_SUCCESS|SAI_SWITCH_ATTR_AVAILABLE_IPV4_ROUTE_ENTRY=52530 sairedis.rec.1:2021-05-04.17:30:13.900135|g|SAI_OBJECT_TYPE_SWITCH:oid:0x21000000000000|SAI_SWITCH_ATTR_AVAILABLE_IPV4_ROUTE_ENTRY=1538995696 sairedis.rec.1:2021-05-04.17:34:47.448092|G|SAI_STATUS_SUCCESS|SAI_SWITCH_ATTR_AVAILABLE_IPV4_ROUTE_ENTRY=52531 sairedis.rec.1:-- sairedis.rec.1:2021-05-04.17:35:12.827247|g|SAI_OBJECT_TYPE_SWITCH:oid:0x21000000000000|SAI_SWITCH_ATTR_AVAILABLE_IPV4_ROUTE_ENTRY=1538995696 sairedis.rec.1:2021-05-04.17:35:32.939798|G|SAI_STATUS_SUCCESS|SAI_SWITCH_ATTR_AVAILABLE_IPV4_ROUTE_ENTRY=52533 sairedis.rec.2.gz:2021-05-04.17:20:13.103322|g|SAI_OBJECT_TYPE_SWITCH:oid:0x21000000000000|SAI_SWITCH_ATTR_AVAILABLE_IPV4_ROUTE_ENTRY=1538995696 sairedis.rec.2.gz:2021-05-04.17:24:47.796720|G|SAI_STATUS_SUCCESS|SAI_SWITCH_ATTR_AVAILABLE_IPV4_ROUTE_ENTRY=52531 sairedis.rec.2.gz:-- sairedis.rec.2.gz:2021-05-04.17:25:15.073981|g|SAI_OBJECT_TYPE_SWITCH:oid:0x21000000000000|SAI_SWITCH_ATTR_AVAILABLE_IPV4_ROUTE_ENTRY=1538995696 sairedis.rec.2.gz:2021-05-04.17:26:08.525235|G|SAI_STATUS_SUCCESS|SAI_SWITCH_ATTR_AVAILABLE_IPV4_ROUTE_ENTRY=52532 ``` Signed-off-by: Abhishek Dosi <abdosi@microsoft.com>
What I did: Increase SAI Sync operation timeout from 1 min to 6 min by reverting sonic-net#472 Why I did: Issue as mention sonic-net/sonic-buildimage#3832 is seen when there is link/port-channel flaps towards T2 which results in sequence of this operation on T1 topology:- Link Going down - Next Hop Member removal from Nexthop Group - Creation of New Nexthop Group - All the Routes towards T2 are given SET Operation with new Nexthop Group (6000+ routes) Link Going up - Creation of New Nexthop Group - All the Routes towards T2 are given SET Operation with new Nexthop Group (6000+ routes) Above Sequence with flaps create many SAI operation of CREATE/SET and if SAI is slow to process them if there in any intermediate Get operation (common GET case being CRM polling at default 5 mins) will result timeout if there is no response in 1 min which will cause next Get operation to fail because of attribute mismatch as it make Request and Reponse are out of sync. To fix this:- - Have increase the timeout to 6 min. Even 201811 image/branch is having same timeout value. - Hard to predict what is good timeout value but based on below experiment and since 201811 is having 6 min using that value. How I verify: - Ran the below script for 800 iteration and orchagent did not got timeout error and no crash. Without fix OA used to crash consistently with most of the time less than 100 iteration of flaps. - Based on sairedis.rec of the first CRM `GET` operation between `Create/Set` operation in these 800 iteration worst time for GET response was see 4+ min. ``` #!/bin/bash j=0 while [ True ]; do echo "iteration $j" config interface shutdown PortChannel0002 sleep 2 config interface startup PortChannel0002 sleep 2 j=$((j+1)) done ``` ``` admin@str-xxxx-06:/var/log/swss$ sudo zgrep -A 1 "|g|.*AVAILABLE_IPV4_ROUTE" sairedis.rec* sairedis.rec:2021-05-04.17:40:13.431643|g|SAI_OBJECT_TYPE_SWITCH:oid:0x21000000000000|SAI_SWITCH_ATTR_AVAILABLE_IPV4_ROUTE_ENTRY=1538995696 sairedis.rec:2021-05-04.17:46:18.132115|G|SAI_STATUS_SUCCESS|SAI_SWITCH_ATTR_AVAILABLE_IPV4_ROUTE_ENTRY=52530 sairedis.rec:-- sairedis.rec:2021-05-04.17:46:21.128261|g|SAI_OBJECT_TYPE_SWITCH:oid:0x21000000000000|SAI_SWITCH_ATTR_AVAILABLE_IPV4_ROUTE_ENTRY=1538995696 sairedis.rec:2021-05-04.17:46:31.258628|G|SAI_STATUS_SUCCESS|SAI_SWITCH_ATTR_AVAILABLE_IPV4_ROUTE_ENTRY=52530 sairedis.rec.1:2021-05-04.17:30:13.900135|g|SAI_OBJECT_TYPE_SWITCH:oid:0x21000000000000|SAI_SWITCH_ATTR_AVAILABLE_IPV4_ROUTE_ENTRY=1538995696 sairedis.rec.1:2021-05-04.17:34:47.448092|G|SAI_STATUS_SUCCESS|SAI_SWITCH_ATTR_AVAILABLE_IPV4_ROUTE_ENTRY=52531 sairedis.rec.1:-- sairedis.rec.1:2021-05-04.17:35:12.827247|g|SAI_OBJECT_TYPE_SWITCH:oid:0x21000000000000|SAI_SWITCH_ATTR_AVAILABLE_IPV4_ROUTE_ENTRY=1538995696 sairedis.rec.1:2021-05-04.17:35:32.939798|G|SAI_STATUS_SUCCESS|SAI_SWITCH_ATTR_AVAILABLE_IPV4_ROUTE_ENTRY=52533 sairedis.rec.2.gz:2021-05-04.17:20:13.103322|g|SAI_OBJECT_TYPE_SWITCH:oid:0x21000000000000|SAI_SWITCH_ATTR_AVAILABLE_IPV4_ROUTE_ENTRY=1538995696 sairedis.rec.2.gz:2021-05-04.17:24:47.796720|G|SAI_STATUS_SUCCESS|SAI_SWITCH_ATTR_AVAILABLE_IPV4_ROUTE_ENTRY=52531 sairedis.rec.2.gz:-- sairedis.rec.2.gz:2021-05-04.17:25:15.073981|g|SAI_OBJECT_TYPE_SWITCH:oid:0x21000000000000|SAI_SWITCH_ATTR_AVAILABLE_IPV4_ROUTE_ENTRY=1538995696 sairedis.rec.2.gz:2021-05-04.17:26:08.525235|G|SAI_STATUS_SUCCESS|SAI_SWITCH_ATTR_AVAILABLE_IPV4_ROUTE_ENTRY=52532 ``` Signed-off-by: Abhishek Dosi <abdosi@microsoft.com>
No description provided.